Model Profile · Round 01
GLM 4.7
Z.AI · open weights · 128K ctx
Mid-pack across the board. No standout strength.
Composite Score
73.8
/100 · canonical
Arena ELO (R1)
1483
±43 · n=285
Multi-Turn ELO (R2)
1508
±93 · n=50
Reliability Rank
#7
avg 7.0
▌ Section 02 · The Lede
What this model is for.
GLM 4.7 is mid-pack across the board — community #9, reliability #4, no axis where it dominates and no axis where it floors. Strong tone consistency (4.62) and decent F13 context (4.57) suggest a steady utility model. Lower community half puts it as a niche backup pick — not a daily driver, not a specialist, but a reasonable fallback when the primary model isn't available.
▌ Section 03 · At a Glance
Cross-test position
Composite
6
Arena ELO
9
Multi-Turn
10
Rubric
9
Adversarial
7
Cost · Latency
11
▌ Section 04 · Strength & Weakness
Where it shines. Where it stumbles.
▲ Strength
Strong on tone consistency (4.62/5 — third-best in the field). Failure-rank #4 in the pool — better reliability than its community ELO suggests.
▼ Weakness
Lower community half (#9 ELO). 9.1% agency violation rate. No axis where the model is the best choice.
▌ Section 05 · Failure Modes
Per-axis breakdown.
Six adversarial probes per session, twenty sessions per model, judged by Sonnet 4 against a fixed rubric. Higher score = the model handled the failure mode better. Bars below show the mean across sessions; the black tick marks the population mean (4.20). Right column shows mean and rank within the rp-bench pool.
F1 · Agency
Doesn't write your character's actions
4.38
/ 5
#12
F2 · POV / Tense
Holds 2nd-person, present-tense narration
4.27
/ 5
#5
F3 · Lore
Doesn't break worldbuilding
4.20
/ 5
#7
F8 · Momentum
Pushes scene forward when user goes passive
4.10
/ 5
#7
F12 · Instruction Drift
Keeps to the system prompt
4.30
/ 5
#10
F13 · Context Attention
Holds character cards 50+ turns deep
4.57
/ 5
#4
“The backup pick — better failure-rank than community-rank, no axis where it leads.”
— Round 01 verdict · Steady, not standout
▌ Section 06 · Subjective Dimensions
Engagement · Voice · Collaboration.
All three dimensions scored 1–5 by Sonnet 4 LLM-judge across twenty 12-turn multi-turn sessions. The same battery feeds the failure-mode rubric above — these are the subjective half of that judgment.
Engagement
4.41/5
Tone Consistency
4.62/5
Collaboration
4.42/5
▌ Section 07 · Behavioral Metrics
How it writes.
Quantitative signals from the same 20 multi-turn sessions, compared against the population mean across all 11 models.
Avg words / turn
222↓
pop avg 265 · -16%
Unique-word ratio
0.667↑
pop avg 0.655 · +2%
Repetition score
0.038↓
pop avg 0.049 · -22%
▌ Section 08 · Flaw Hunter
Adversarial probe score.
Score of 100 minus deductions across 22 fail-mode flag types on adversarial 12-turn sessions. Higher = fewer flaws caught. Population range across the round is 12.8–46.9.
36.8
/ 100
▌ Score breakdown
Mean 36.8
Median 37.0
Fatal/sess 0.71
Major/sess 6.76
▌ Top flaws caught
purple_proserecycled_descriptionnarrating_emotions
▌ Section 09 · Sample Responses
Highest- and lowest-rated turns.
▌ Pending Round 02
Best- and worst-rated sample responses ship with the raw-vote endpoint in Round 02. When that lands, this section will surface the model’s highest- and lowest-scoring blind-arena turns side by side, scored on the same rubric the leaderboard uses.
▌ Round 01 verdict
GLM 4.7 is the round's quiet utility. The failure-rank #4 is better than the community-ELO #9 would suggest, and the tone consistency is real — third-best in the field. The reason to skip it is also the reason to keep it on the bench: there's no axis where it leads, but there's no axis where it falls off either. Reach for it as a backup, not a default.
▌ Section 10 · Compare & Drill
Stack it against another model.
Profile · GLM 4.7 · Round 01