← Back to PlotPoints
The Standings.Round 01 · April 2026
▌ At a glance
3,183 votes · 20 models · 665 voters
R01: 1,857 · R02: 1326 in flight
75% catch-pair · CC-BY 4.0
▌ Composite · how it works
rp-benchmark's composite score — the canonical 'best model overall' answer. Each model's normalized z-score across five axes is weighted, then mapped onto 0–100 for readability.
▌ How it's scored
Multi-turn arena ELO (35%), LLM-judge Likert mean (25%), rubric overall (20%), flaw hunter (15%), behavioral metrics (5%). Models missing one of the five axes are still ranked; the missing component is treated as the pool mean so a partial entry doesn't get an artificial bump or drop.
▌ How to read the table
Higher = better all-axis performance. The score isn't an ELO, it's a normalised blend — a model at 50 sits at the pool's average across all five axes. Compare each row's composite rank against its multi-turn-only rank in the cross-test grid: big gaps surface 'great vibe, dirty prose' or vice versa.
▌ What are you writing?Pick a use case — we'll re-rank for it.
▌ Access
| № | Spread | Model · Verdict | Score | SFW | NSFW | Engagejudge /5 | All Tests★ comp · E elo · MT m-turn · RU rub · AD adv · $ cost | Votes (R01+R02) | ||
|---|---|---|---|---|---|---|---|---|---|---|
| I | 1→19 | Claude Opus 4.7Champion Anthropic · proprietary · 200K Top of the multi-turn pool. Top-1 on agency respect and instruction drift. | 97.6 | /100 | — | — | 4.05 | ★ E MT RU AD $ 1 — 3 1 1 19 | 94 R1 0 · R2 94 | → |
| II | 2→17 | Claude Sonnet 4.5Reliable Anthropic · proprietary · 200K Round 01 reliability leader. Tied #1 on context attention. | 92.9 | /100 | 51% | 51% | 4.06 | ★ E MT RU AD $ 2 5 10 4 3 17 | 341 R1 194 · R2 147 | → |
| III | 2→20 | Claude Opus 4.6Reliable Anthropic · proprietary · 200K Reliability runner-up. Top-2 on agency, complete failure-mode coverage. | 88.1 | /100 | — | — | 4.10 | ★ E MT RU AD $ 3 — 2 2 2 20 | 143 R1 0 · R2 143 | → |
| IV | 4→16 | GPT-4.1 OpenAI · proprietary · 1M Community last in Round 01, top-5 in Round 02 multi-turn. The great inversion. | 83.3 | /100 | 43% | 46% | 4.06 | ★ E MT RU AD $ 4 11 5 10 6 16 | 356 R1 215 · R2 141 | → |
| V | 1→15 | DeepSeek v4 Pro DeepSeek · open · 128K Strong tone consistency at fraction of Opus pricing. | 78.6 | /100 | — | — | 4.02 | ★ E MT RU AD $ 5 — 1 3 5 15 | 104 R1 0 · R2 104 | → |
| VI | 6→11 | GLM 4.7 Z.AI · open · 128K Mid-pack across the board. No standout strength. | 73.8 | /100 | 46% | 49% | 4.07 | ★ E MT RU AD $ 6 9 9 9 7 11 | 441 R1 285 · R2 156 | → |
| VII | 3→15 | DeepSeek v3.2 DeepSeek · open · 128K Reliable, NSFW-shy at 30%. Strong lore retention. | 69.0 | /100 | 51% | 30% | 4.05 | ★ E MT RU AD $ 7 7 15 7 4 3 | 402 R1 241 · R2 161 | → |
| VIII | 5→13 | Kimi K2.5 Moonshot · open · 128K Strong on tone consistency. Slow generation. | 64.3 | /100 | — | — | 4.23 | ★ E MT RU AD $ 8 — 11 5 8 13 | 99 R1 0 · R2 99 | → |
| IX | 2→13 | DeepSeek v4 FlashCheap DeepSeek · open · 128K Cheapest tier, top flaw-hunter score. Multi-turn ELO drags it down. | 59.5 | /100 | — | — | 3.94 | ★ E MT RU AD $ 9 — 13 8 11 2 | 106 R1 0 · R2 106 | → |
| X | 4→14 | MiniMax M2.7 MiniMax · proprietary · 200K Strong narrative push. Fragile under adversarial pressure. | 54.8 | /100 | 54% | 45% | 3.62 | ★ E MT RU AD $ 10 4 14 11 10 9 | 537 R1 393 · R2 144 | → |
| XI | 8→17 | Kimi K2.6⚠ Floor Moonshot · open · 128K Top-2 on flaw hunter. Catastrophic agency floor on bait scenes. | 50.0 | /100 | — | — | 4.19 | ★ E MT RU AD $ 11 — 8 17 16 12 | 107 R1 0 · R2 107 | → |
| XII | 1→14 | Gemma 4 26BRound 01 #1 Google · open · local-friendly · 8K Round 01 champion, mid-pack on multi-turn. The cheap local-friendly hold. | 45.2 | /100 | 55% | 51% | 3.79 | ★ E MT RU AD $ 12 1 12 14 12 7 | 456 R1 302 · R2 154 | → |
| XIII | 1→15 | Gemini 3.1 Flash LiteCheap Google · proprietary · 1M Cheapest tier with Round 02 top-5 multi-turn ELO. | 40.5 | /100 | — | — | 4.02 | ★ E MT RU AD $ 13 — 4 13 15 1 | 97 R1 0 · R2 97 | → |
| XIV | 2→15 | Mistral Small CreativeNSFW Mistral · open · local-friendly · 32K NSFW specialist. Fastest in the field. Drifts on long sessions. | 35.7 | /100 | 51% | 67% | 4.14 | ★ E MT RU AD $ 14 2 7 15 14 8 | 812 R1 646 · R2 166 | → |
| XV | 6→18 | Gemini 3.1 Pro Google · proprietary · 1M Deep context window, brittle on adversarial probes. | 31.0 | /100 | — | — | 4.11 | ★ E MT RU AD $ 15 — 6 12 13 18 | 118 R1 0 · R2 118 | → |
| XVI | 6→17 | GLM 5.1 Z.AI · open · 128K Strong on tone consistency, weak on multi-turn engagement. | 26.2 | /100 | — | — | 4.12 | ★ E MT RU AD $ 16 — 17 6 9 14 | 107 R1 0 · R2 107 | → |
| XVII | 4→19 | Grok 4.1 xAI · proprietary · 128K Personality up front. Drifts fast under pressure. | 16.7 | /100 | 50% | 52% | 4.04 | ★ E MT RU AD $ 17 6 18 16 19 4 | 477 R1 322 · R2 155 | → |
| XVIII | 3→20 | Gemini 2.5 Flash Google · proprietary · 1M Round 01 top-3, dropped to bottom of Round 02 multi-turn. | 11.9 | /100 | 53% | 54% | 3.84 | ★ E MT RU AD $ 18 3 20 18 17 10 | 383 R1 241 · R2 142 | → |
| XIX | 6→19 | Qwen 3.5 Flash⚠ Floor Alibaba · open · local-friendly · 128K Floor on agency and instruction drift. Caveat emptor. | 7.1 | /100 | 48% | 42% | 3.92 | ★ E MT RU AD $ 19 8 19 19 18 6 | 549 R1 401 · R2 148 | → |
| XX | 5→20 | Llama 4 Maverick Meta · open · 128K Last on every reliability mode. Open-source completist only. | 2.4 | /100 | 47% | 34% | 3.59 | ★ E MT RU AD $ 20 10 16 20 20 5 | 637 R1 474 · R2 163 | → |
▌ Movers This Round
▲ Climber · +8 → composite #2
Claude Sonnet 4.5
"Only mid-pack on raw multi-turn votes (ELO #10), yet vaults to composite #2 — elite rubric, flaw-hunter, and reliability the vote-only view undersells"
═ Held · ═ composite #1
Claude Opus 4.7
"Holds the composite crown even after DeepSeek v4 Pro overtook it on raw multi-turn votes — still #1 on the all-axis blend"
▼ Diver · −9 → composite #13
Gemini 3.1 Flash Lite
"Multi-turn ELO #4 (cheap + fast voters loved it) but bottom-quartile on flaw-hunter + behavioral"
▌ Coverage
1,857 total votes
271 pairs · median 7 votes/pair
75% catch-pair · n=335
47% judge–human disagreement
271 pairs · median 7 votes/pair
75% catch-pair · n=335
47% judge–human disagreement
Next issue · 05-15-2026