← Back to PlotPoints

The Standings.Round 01 · April 2026

▌ At a glance
3,734 votes  ·  21 models  ·  801 voters
R01: 1,857 · R02: 1877 · both closed
75% catch-pair · CC-BY 4.0
Composite · how it works

rp-benchmark's composite score — the canonical 'best model overall' answer. Each model's normalized z-score across five axes is weighted, then mapped onto 0–100 for readability.

How it's scored
Multi-turn arena ELO (35%), LLM-judge Likert mean (25%), rubric overall (20%), flaw hunter (15%), behavioral metrics (5%). Models missing one of the five axes are still ranked; the missing component is treated as the pool mean so a partial entry doesn't get an artificial bump or drop.
How to read the table
Higher = better all-axis performance. The score isn't an ELO, it's a normalised blend — a model at 50 sits at the pool's average across all five axes. Compare each row's composite rank against its multi-turn-only rank in the cross-test grid: big gaps surface 'great vibe, dirty prose' or vice versa.
▌ What are you writing?Pick a use case — we'll re-rank for it.
▌ Access
The Pick · for all models · all models
I

Claude Opus 4.7

Anthropic · proprietary · ELO ± · n=
▌ Why this pick

Default ranking — rp-benchmark's composite score, a weighted blend of multi-turn arena ELO (35%), LLM-judge Likert (25%), rubric overall (20%), flaw hunter (15%), and behavioral metrics (5%). Claude Opus 4.7 leads at 97.6/100 and now also tops the raw multi-turn votes. GLM 4.7 jumps from multi-turn #8 to composite #2 (92.9) — strongest open-weight option once cross-axis reliability is weighted in. Sonnet 4.5 climbs even further, multi-turn #13 to composite #4 (83.3), because it scores well on every axis, not just engagement. Engagement column re-derived from a Sonnet-4 judge proxy on 2026-05-02.

Top of the multi-turn pool. Top-1 on agency respect and instruction drift. Runner-up: GLM 4.7.

1583Multi-Turn ELO (R2) · ±44Round 02 (closed) · n=142
SFW Win Rate wins on safe-scene votes
NSFW Win Rate wins on explicit-scene votes
#1Reliability Rank · avg 2.81 = most reliable in the 21-model pool
$39.00Cost / 1M tokens blended 60% input / 40% output
10174msResponse Time median generation time per response
SpreadModel · VerdictScoreSFWNSFWEngagejudge /5All Tests★ comp · E elo · MT m-turn · RU rub · AD adv · $ costVotes (R01+R02)
I119
Claude Opus 4.7Champion
Anthropic · proprietary · 200K
Top of the multi-turn pool. Top-1 on agency respect and instruction drift.
97.6/1004.05
E
MT
RU
AD
$
1
1
1
1
19
142
R1 0 · R2 142
II211
GLM 4.7
Z.AI · open · 128K
Mid-pack across the board. No standout strength.
92.9/10046%49%4.07
E
MT
RU
AD
$
2
9
8
9
7
11
504
R1 285 · R2 219
III220
Claude Opus 4.6Reliable
Anthropic · proprietary · 200K
Reliability runner-up. Top-2 on agency, complete failure-mode coverage.
88.1/1004.10
E
MT
RU
AD
$
3
4
2
2
20
215
R1 0 · R2 215
IV317
Claude Sonnet 4.5Reliable
Anthropic · proprietary · 200K
Round 01 reliability leader. Tied #1 on context attention.
83.3/10051%51%4.06
E
MT
RU
AD
$
4
5
12
4
3
17
409
R1 194 · R2 215
V516
GPT-4.1
OpenAI · proprietary · 1M
Community last in Round 01, top-5 in Round 02 multi-turn. The great inversion.
78.6/10043%46%4.06
E
MT
RU
AD
$
5
11
5
10
6
16
424
R1 215 · R2 209
VI314
DeepSeek v3.2
DeepSeek · open · 128K
Reliable, NSFW-shy at 30%. Strong lore retention.
73.8/10051%30%4.05
E
MT
RU
AD
$
6
7
14
7
4
3
463
R1 241 · R2 222
VII215
DeepSeek v4 Pro
DeepSeek · open · 128K
Strong tone consistency at fraction of Opus pricing.
69.0/1004.02
E
MT
RU
AD
$
7
2
3
5
15
157
R1 0 · R2 157
VIII211
DeepSeek v4 FlashCheap
DeepSeek · open · 128K
Cheapest tier, top flaw-hunter score. Multi-turn ELO drags it down.
64.3/1003.94
E
MT
RU
AD
$
8
11
8
10
2
151
R1 0 · R2 151
IX513
Kimi K2.5
Moonshot · open · 128K
Strong on tone consistency. Slow generation.
59.5/1004.23
E
MT
RU
AD
$
9
10
5
8
13
150
R1 0 · R2 150
X413
MiniMax M2.7
MiniMax · proprietary · 200K
Strong narrative push. Fragile under adversarial pressure.
54.8/10054%45%3.62
E
MT
RU
AD
$
10
4
13
11
11
9
597
R1 393 · R2 204
XI216
Mistral Small CreativeNSFW
Mistral · open · local-friendly · 32K
NSFW specialist. Fastest in the field. Drifts on long sessions.
50.0/10051%67%4.14
E
MT
RU
AD
$
11
2
6
16
15
8
867
R1 646 · R2 221
XII918
Kimi K2.6⚠ Floor
Moonshot · open · 128K
Top-2 on flaw hunter. Catastrophic agency floor on bait scenes.
45.2/1004.19
E
MT
RU
AD
$
12
9
18
17
12
152
R1 0 · R2 152
XIII116
Gemini 3.1 Flash LiteCheap
Google · proprietary · 1M
Cheapest tier with Round 02 top-5 multi-turn ELO.
40.5/1004.02
E
MT
RU
AD
$
13
3
14
16
1
146
R1 0 · R2 146
XIV718
Gemini 3.1 Pro
Google · proprietary · 1M
Deep context window, brittle on adversarial probes.
35.7/1004.11
E
MT
RU
AD
$
14
7
12
13
18
158
R1 0 · R2 158
XV115
Gemma 4 26BRound 01 #1
Google · open · local-friendly · 8K
Round 01 champion, mid-pack on multi-turn. The cheap local-friendly hold.
31.0/10055%51%3.79
E
MT
RU
AD
$
15
1
15
15
12
7
509
R1 302 · R2 207
XVI617
GLM 5.1
Z.AI · open · 128K
Strong on tone consistency, weak on multi-turn engagement.
26.2/1004.12
E
MT
RU
AD
$
16
17
6
9
14
149
R1 0 · R2 149
XVII1317
DeepSeek R1 0528
DeepSeek · open · 164K
2025-vintage reasoner. Clean prose metrics, weak instruction-keeping. No arena votes yet.
21.4/100
E
MT
RU
AD
$
17
13
14
XVIII319
Gemini 2.5 Flash
Google · proprietary · 1M
Round 01 top-3, dropped to bottom of Round 02 multi-turn.
16.7/10053%54%3.84
E
MT
RU
AD
$
18
3
19
19
18
10
442
R1 241 · R2 201
XIX420
Grok 4.1
xAI · proprietary · 128K
Personality up front. Drifts fast under pressure.
11.9/10050%52%4.04
E
MT
RU
AD
$
19
6
18
17
20
4
541
R1 322 · R2 219
XX620
Qwen 3.5 Flash⚠ Floor
Alibaba · open · local-friendly · 128K
Floor on agency and instruction drift. Caveat emptor.
7.1/10048%42%3.92
E
MT
RU
AD
$
20
8
20
20
19
6
608
R1 401 · R2 207
XXI521
Llama 4 Maverick
Meta · open · 128K
Last on every reliability mode. Open-source completist only.
2.4/10047%34%3.59
E
MT
RU
AD
$
21
10
16
21
21
5
684
R1 474 · R2 210
▌ Movers This Round
Climber · +6 → composite #2
GLM 4.7
"Mid-pack on raw multi-turn votes (ELO #8), yet vaults to composite #2 — the strongest open-weight blend of rubric, judge, and reliability in the pool"
Held · ═ composite #1
Claude Opus 4.7
"Holds the composite crown and retook #1 on raw multi-turn votes from DeepSeek v4 Pro in the June regen — top of both views"
Diver · −10 → composite #13
Gemini 3.1 Flash Lite
"Multi-turn ELO #3 (cheap + fast voters loved it) but bottom-quartile on flaw-hunter + behavioral"
▌ Coverage
1,857 total votes
271 pairs · median 7 votes/pair
75% catch-pair · n=335
47% judge–human disagreement
Methodology · Raw votes (CSV) · GitHub · HF dataset
Next issue · 05-15-2026