Model Profile · Round 01

GLM 4.7

Z.AI · open weights · 128K ctx

Mid-pack across the board. No standout strength.

Composite Score

73.8

/100 · canonical

Arena ELO (R1)

1483

±43 · n=285

Multi-Turn ELO (R2)

1508

±93 · n=50

Reliability Rank

avg 7.0

▌ Section 02 · The Lede

What this model is for.

GLM 4.7 is mid-pack across the board — community #9, reliability #4, no axis where it dominates and no axis where it floors. Strong tone consistency (4.62) and decent F13 context (4.57) suggest a steady utility model. Lower community half puts it as a niche backup pick — not a daily driver, not a specialist, but a reasonable fallback when the primary model isn't available.

▌ Section 03 · At a Glance

Cross-test position

Composite

Arena ELO

Multi-Turn

Rubric

Adversarial

Cost · Latency

▌ Section 04 · Strength & Weakness

Where it shines. Where it stumbles.

▲ Strength

Strong on tone consistency (4.62/5 — third-best in the field). Failure-rank #4 in the pool — better reliability than its community ELO suggests.

▼ Weakness

Lower community half (#9 ELO). 9.1% agency violation rate. No axis where the model is the best choice.

▌ Section 05 · Failure Modes

Per-axis breakdown.

Six adversarial probes per session, twenty sessions per model, judged by Sonnet 4 against a fixed rubric. Higher score = the model handled the failure mode better. Bars below show the mean across sessions; the black tick marks the population mean (4.20). Right column shows mean and rank within the rp-bench pool.

F1 · Agency

Doesn't write your character's actions

4.38/ 5

#12

F2 · POV / Tense

Holds 2nd-person, present-tense narration

4.27/ 5

F3 · Lore

Doesn't break worldbuilding

4.20/ 5

F8 · Momentum

Pushes scene forward when user goes passive

4.10/ 5

F12 · Instruction Drift

Keeps to the system prompt

4.30/ 5

#10

F13 · Context Attention

Holds character cards 50+ turns deep

4.57/ 5

“The backup pick — better failure-rank than community-rank, no axis where it leads.”

— Round 01 verdict · Steady, not standout

▌ Section 06 · Subjective Dimensions

Engagement · Voice · Collaboration.

All three dimensions scored 1–5 by Sonnet 4 LLM-judge across twenty 12-turn multi-turn sessions. The same battery feeds the failure-mode rubric above — these are the subjective half of that judgment.

Engagement

4.41/5

Tone Consistency

4.62/5

Collaboration

4.42/5

▌ Section 07 · Behavioral Metrics

How it writes.

Quantitative signals from the same 20 multi-turn sessions, compared against the population mean across all 11 models.

Avg words / turn

222↓

pop avg 265 · -16%

Unique-word ratio

0.667↑

pop avg 0.655 · +2%

Repetition score

0.038↓

pop avg 0.049 · -22%

▌ Section 08 · Flaw Hunter

Adversarial probe score.

Score of 100 minus deductions across 22 fail-mode flag types on adversarial 12-turn sessions. Higher = fewer flaws caught. Population range across the round is 12.8–46.9.

36.8

/ 100

▌ Score breakdown

Mean   36.8
Median   37.0
Fatal/sess   0.71
Major/sess   6.76

▌ Top flaws caught

purple_proserecycled_descriptionnarrating_emotions

▌ Section 09 · Sample Responses

Highest- and lowest-rated turns.

▌ Pending Round 02

Best- and worst-rated sample responses ship with the raw-vote endpoint in Round 02. When that lands, this section will surface the model’s highest- and lowest-scoring blind-arena turns side by side, scored on the same rubric the leaderboard uses.

▌ Round 01 verdict

GLM 4.7 is the round's quiet utility. The failure-rank #4 is better than the community-ELO #9 would suggest, and the tone consistency is real — third-best in the field. The reason to skip it is also the reason to keep it on the bench: there's no axis where it leads, but there's no axis where it falls off either. Reach for it as a backup, not a default.

▌ Section 10 · Compare & Drill

Stack it against another model.

━ All 11 Models

The Standings

Full leaderboard, all tests, all filters.

Compare →

Methodology · Raw votes (CSV) · GitHub · HF dataset

Profile · GLM 4.7 · Round 01