PlotPointsModelsGLM 5.1
Model Profile · Round 01

GLM 5.1

Z.AI · open weights · 128K ctx

Strong on tone consistency, weak on multi-turn engagement.

Composite Score
16.7
/100 · canonical
Arena ELO (R1)
joined post-R01
Multi-Turn ELO (R2)
1387
±99 · n=30
Reliability Rank
#9
avg 7.8
▌ Section 02 · The Lede

What this model is for.

Z.AI's GLM 5.1 enters Round 02 as the polished-prose pick — strongest tone consistency in its weight class (4.65/5), the field's lowest fatal-flaw rate (0.11 per session), and a flaw-hunter score of 45.8 that puts it ahead of every Anthropic model except Sonnet 4.5. The wrinkle is multi-turn engagement: voters slot it at #19 on the multi-turn ELO at 1387, near the bottom of the 20-model pool. Likely read: GLM 5.1 writes correctly but doesn't write excitingly, and Round 02 voters preferred denser, riskier prose. At $1.80/1M it's pricier than DeepSeek v4 Flash and Gemini 3.1 Flash Lite, both of which beat it on multi-turn ELO. Niche pick.

▌ Section 03 · At a Glance

Cross-test position

GLM 5.1 sits at #19 on Multi-Turn — the caveat to watch.

Composite
17
Arena ELO
Multi-Turn
19
Rubric
6
Adversarial
9
Cost · Latency
14
▌ Section 04 · Strength & Weakness

Where it shines. Where it stumbles.

Strength
Lowest fatal-flaw rate in the entire pool (0.11/session). Strong tone consistency (4.65/5). Flaw-hunter mean of 45.8 sits ahead of every Opus model.
Weakness
Multi-turn ELO of 1387 puts it #19 of 20 — voters preferred almost everything else when picking blind. Priced above the Flash-tier models that outscore it on engagement.
▌ Section 05 · Failure Modes

Per-axis breakdown.

Six adversarial probes per session, twenty sessions per model, judged by Sonnet 4 against a fixed rubric. Higher score = the model handled the failure mode better. Bars below show the mean across sessions; the black tick marks the population mean (4.20). Right column shows mean and rank within the rp-bench pool.
▌ Coverage: 4/6F3 · Lore · F8 · Momentum not yet run on this model. Upstream rolls these out incrementally as new models join the pool.
F1 · Agency
Doesn't write your character's actions
4.50
/ 5
F2 · POV / Tense
Holds 2nd-person, present-tense narration
4.17
/ 5
F3 · Lore
not yet run on this model
F8 · Momentum
not yet run on this model
F12 · Instruction Drift
Keeps to the system prompt
4.33
/ 5
F13 · Context Attention
Holds character cards 50+ turns deep
4.57
/ 5
Cleanest prose, sleepiest votes. A craft pick voters won't fall for.
Round 02 verdict · Polished outsider
▌ Section 06 · Subjective Dimensions

Engagement · Voice · Collaboration.

All three dimensions scored 1–5 by Sonnet 4 LLM-judge across twenty 12-turn multi-turn sessions. The same battery feeds the failure-mode rubric above — these are the subjective half of that judgment.
Engagement
4.49/5
Tone Consistency
4.65/5
Collaboration
4.34/5
▌ Section 07 · Behavioral Metrics

How it writes.

Quantitative signals from the same 20 multi-turn sessions, compared against the population mean across all 11 models.
Avg words / turn
240
pop avg 265 · -9%
Unique-word ratio
0.653
pop avg 0.655 · -0%
Repetition score
0.041
pop avg 0.049 · -16%
▌ Section 08 · Flaw Hunter

Adversarial probe score.

Score of 100 minus deductions across 22 fail-mode flag types on adversarial 12-turn sessions. Higher = fewer flaws caught. Population range across the round is 12.8–46.9.
45.8
/ 100
▌ Score breakdown
Mean   45.8
Median   46.0
Fatal/sess   0.11
Major/sess   6.44
▌ Top flaws caught
purple_proserecycled_descriptionconvenient_world
▌ Section 09 · Sample Responses

Highest- and lowest-rated turns.

▌ Pending Round 02

Best- and worst-rated sample responses ship with the raw-vote endpoint in Round 02. When that lands, this section will surface the model’s highest- and lowest-scoring blind-arena turns side by side, scored on the same rubric the leaderboard uses.

▌ Round 01 verdict
GLM 5.1 is what you pick when your evaluation criteria look more like an editor's checklist than a reader's gut reaction. The fatal-flaw and tone-consistency numbers are real and they say the prose is unusually clean. The multi-turn ELO says nobody noticed. If your product downstream rewards consistency, this is a sleeper; if it rewards engagement, look elsewhere.
▌ Section 10 · Compare & Drill

Stack it against another model.

━ All 11 Models
The Standings
Full leaderboard, all tests, all filters.
Compare →
Methodology · Raw votes (CSV) · GitHub · HF dataset
Profile · GLM 5.1 · Round 01