The Standings.Round 01 · April 2026

▌ At a glance
3,734 votes  ·  21 models  ·  801 voters
R01: 1,857 · R02: 1877 · both closed
75% catch-pair · CC-BY 4.0

▌ Composite · how it works

rp-benchmark's composite score — the canonical 'best model overall' answer. Each model's normalized z-score across five axes is weighted, then mapped onto 0–100 for readability.

▌ How it's scored

Multi-turn arena ELO (35%), LLM-judge Likert mean (25%), rubric overall (20%), flaw hunter (15%), behavioral metrics (5%). Models missing one of the five axes are still ranked; the missing component is treated as the pool mean so a partial entry doesn't get an artificial bump or drop.

▌ How to read the table

Higher = better all-axis performance. The score isn't an ELO, it's a normalised blend — a model at 50 sits at the pool's average across all five axes. Compare each row's composite rank against its multi-turn-only rank in the cross-test grid: big gaps surface 'great vibe, dirty prose' or vice versa.

▌ What are you writing?Pick a use case — we'll re-rank for it.

▌ Access

▌ The Pick · for all models · all models

Claude Opus 4.7

Anthropic · proprietary · ELO  ± · n=

▌ Why this pick

Default ranking — rp-benchmark's composite score, a weighted blend of multi-turn arena ELO (35%), LLM-judge Likert (25%), rubric overall (20%), flaw hunter (15%), and behavioral metrics (5%). Claude Opus 4.7 leads at 97.6/100 and now also tops the raw multi-turn votes. GLM 4.7 jumps from multi-turn #8 to composite #2 (92.9) — strongest open-weight option once cross-axis reliability is weighted in. Sonnet 4.5 climbs even further, multi-turn #13 to composite #4 (83.3), because it scores well on every axis, not just engagement. Engagement column re-derived from a Sonnet-4 judge proxy on 2026-05-02.

Top of the multi-turn pool. Top-1 on agency respect and instruction drift. Runner-up: GLM 4.7.

1583Multi-Turn ELO (R2) · ±44Round 02 (closed) · n=142

—SFW Win Rate wins on safe-scene votes

—NSFW Win Rate wins on explicit-scene votes

#1Reliability Rank · avg 2.81 = most reliable in the 21-model pool

$39.00Cost / 1M tokens blended 60% input / 40% output

10174msResponse Time median generation time per response

№	Spread	Model · Verdict	Score		SFW	NSFW	Engagejudge /5	All Tests★ comp · E elo · MT m-turn · RU rub · AD adv · $ cost	Votes (R01+R02)
I	1→19	Claude Opus 4.7Champion Anthropic · proprietary · 200K Top of the multi-turn pool. Top-1 on agency respect and instruction drift.	97.6	/100	—	—	4.05	★ E MT RU AD $ 1 — 1 1 1 19	142 R1 0 · R2 142	→
II	2→11	GLM 4.7 Z.AI · open · 128K Mid-pack across the board. No standout strength.	92.9	/100	46%	49%	4.07	★ E MT RU AD $ 2 9 8 9 7 11	504 R1 285 · R2 219	→
III	2→20	Claude Opus 4.6Reliable Anthropic · proprietary · 200K Reliability runner-up. Top-2 on agency, complete failure-mode coverage.	88.1	/100	—	—	4.10	★ E MT RU AD $ 3 — 4 2 2 20	215 R1 0 · R2 215	→
IV	3→17	Claude Sonnet 4.5Reliable Anthropic · proprietary · 200K Round 01 reliability leader. Tied #1 on context attention.	83.3	/100	51%	51%	4.06	★ E MT RU AD $ 4 5 12 4 3 17	409 R1 194 · R2 215	→
V	5→16	GPT-4.1 OpenAI · proprietary · 1M Community last in Round 01, top-5 in Round 02 multi-turn. The great inversion.	78.6	/100	43%	46%	4.06	★ E MT RU AD $ 5 11 5 10 6 16	424 R1 215 · R2 209	→
VI	3→14	DeepSeek v3.2 DeepSeek · open · 128K Reliable, NSFW-shy at 30%. Strong lore retention.	73.8	/100	51%	30%	4.05	★ E MT RU AD $ 6 7 14 7 4 3	463 R1 241 · R2 222	→
VII	2→15	DeepSeek v4 Pro DeepSeek · open · 128K Strong tone consistency at fraction of Opus pricing.	69.0	/100	—	—	4.02	★ E MT RU AD $ 7 — 2 3 5 15	157 R1 0 · R2 157	→
VIII	2→11	DeepSeek v4 FlashCheap DeepSeek · open · 128K Cheapest tier, top flaw-hunter score. Multi-turn ELO drags it down.	64.3	/100	—	—	3.94	★ E MT RU AD $ 8 — 11 8 10 2	151 R1 0 · R2 151	→
IX	5→13	Kimi K2.5 Moonshot · open · 128K Strong on tone consistency. Slow generation.	59.5	/100	—	—	4.23	★ E MT RU AD $ 9 — 10 5 8 13	150 R1 0 · R2 150	→
X	4→13	MiniMax M2.7 MiniMax · proprietary · 200K Strong narrative push. Fragile under adversarial pressure.	54.8	/100	54%	45%	3.62	★ E MT RU AD $ 10 4 13 11 11 9	597 R1 393 · R2 204	→
XI	2→16	Mistral Small CreativeNSFW Mistral · open · local-friendly · 32K NSFW specialist. Fastest in the field. Drifts on long sessions.	50.0	/100	51%	67%	4.14	★ E MT RU AD $ 11 2 6 16 15 8	867 R1 646 · R2 221	→
XII	9→18	Kimi K2.6⚠ Floor Moonshot · open · 128K Top-2 on flaw hunter. Catastrophic agency floor on bait scenes.	45.2	/100	—	—	4.19	★ E MT RU AD $ 12 — 9 18 17 12	152 R1 0 · R2 152	→
XIII	1→16	Gemini 3.1 Flash LiteCheap Google · proprietary · 1M Cheapest tier with Round 02 top-5 multi-turn ELO.	40.5	/100	—	—	4.02	★ E MT RU AD $ 13 — 3 14 16 1	146 R1 0 · R2 146	→
XIV	7→18	Gemini 3.1 Pro Google · proprietary · 1M Deep context window, brittle on adversarial probes.	35.7	/100	—	—	4.11	★ E MT RU AD $ 14 — 7 12 13 18	158 R1 0 · R2 158	→
XV	1→15	Gemma 4 26BRound 01 #1 Google · open · local-friendly · 8K Round 01 champion, mid-pack on multi-turn. The cheap local-friendly hold.	31.0	/100	55%	51%	3.79	★ E MT RU AD $ 15 1 15 15 12 7	509 R1 302 · R2 207	→
XVI	6→17	GLM 5.1 Z.AI · open · 128K Strong on tone consistency, weak on multi-turn engagement.	26.2	/100	—	—	4.12	★ E MT RU AD $ 16 — 17 6 9 14	149 R1 0 · R2 149	→
XVII	13→17	DeepSeek R1 0528 DeepSeek · open · 164K 2025-vintage reasoner. Clean prose metrics, weak instruction-keeping. No arena votes yet.	21.4	/100	—	—	—	★ E MT RU AD $ 17 — — 13 14 —	—	→
XVIII	3→19	Gemini 2.5 Flash Google · proprietary · 1M Round 01 top-3, dropped to bottom of Round 02 multi-turn.	16.7	/100	53%	54%	3.84	★ E MT RU AD $ 18 3 19 19 18 10	442 R1 241 · R2 201	→
XIX	4→20	Grok 4.1 xAI · proprietary · 128K Personality up front. Drifts fast under pressure.	11.9	/100	50%	52%	4.04	★ E MT RU AD $ 19 6 18 17 20 4	541 R1 322 · R2 219	→
XX	6→20	Qwen 3.5 Flash⚠ Floor Alibaba · open · local-friendly · 128K Floor on agency and instruction drift. Caveat emptor.	7.1	/100	48%	42%	3.92	★ E MT RU AD $ 20 8 20 20 19 6	608 R1 401 · R2 207	→
XXI	5→21	Llama 4 Maverick Meta · open · 128K Last on every reliability mode. Open-source completist only.	2.4	/100	47%	34%	3.59	★ E MT RU AD $ 21 10 16 21 21 5	684 R1 474 · R2 210	→

▌ Movers This Round

▲ Climber · +6 → composite #2

GLM 4.7

"Mid-pack on raw multi-turn votes (ELO #8), yet vaults to composite #2 — the strongest open-weight blend of rubric, judge, and reliability in the pool"

═ Held · ═ composite #1

Claude Opus 4.7

"Holds the composite crown and retook #1 on raw multi-turn votes from DeepSeek v4 Pro in the June regen — top of both views"

▼ Diver · −10 → composite #13

Gemini 3.1 Flash Lite

"Multi-turn ELO #3 (cheap + fast voters loved it) but bottom-quartile on flaw-hunter + behavioral"

▌ Coverage

1,857 total votes
271 pairs · median 7 votes/pair
75% catch-pair · n=335
47% judge–human disagreement

Methodology · Raw votes (CSV) · GitHub · HF dataset

Next issue · 05-15-2026