← Back to PlotPoints
The Standings.Round 01 · April 2026
▌ At a glance
3,734 votes · 21 models · 801 voters
R01: 1,857 · R02: 1877 · both closed
75% catch-pair · CC-BY 4.0
▌ Composite · how it works
rp-benchmark's composite score — the canonical 'best model overall' answer. Each model's normalized z-score across five axes is weighted, then mapped onto 0–100 for readability.
▌ How it's scored
Multi-turn arena ELO (35%), LLM-judge Likert mean (25%), rubric overall (20%), flaw hunter (15%), behavioral metrics (5%). Models missing one of the five axes are still ranked; the missing component is treated as the pool mean so a partial entry doesn't get an artificial bump or drop.
▌ How to read the table
Higher = better all-axis performance. The score isn't an ELO, it's a normalised blend — a model at 50 sits at the pool's average across all five axes. Compare each row's composite rank against its multi-turn-only rank in the cross-test grid: big gaps surface 'great vibe, dirty prose' or vice versa.
▌ What are you writing?Pick a use case — we'll re-rank for it.
▌ Access
| № | Spread | Model · Verdict | Score | SFW | NSFW | Engagejudge /5 | All Tests★ comp · E elo · MT m-turn · RU rub · AD adv · $ cost | Votes (R01+R02) | ||
|---|---|---|---|---|---|---|---|---|---|---|
| I | 1→19 | Claude Opus 4.7Champion Anthropic · proprietary · 200K Top of the multi-turn pool. Top-1 on agency respect and instruction drift. | 97.6 | /100 | — | — | 4.05 | ★ E MT RU AD $ 1 — 1 1 1 19 | 142 R1 0 · R2 142 | → |
| II | 2→11 | GLM 4.7 Z.AI · open · 128K Mid-pack across the board. No standout strength. | 92.9 | /100 | 46% | 49% | 4.07 | ★ E MT RU AD $ 2 9 8 9 7 11 | 504 R1 285 · R2 219 | → |
| III | 2→20 | Claude Opus 4.6Reliable Anthropic · proprietary · 200K Reliability runner-up. Top-2 on agency, complete failure-mode coverage. | 88.1 | /100 | — | — | 4.10 | ★ E MT RU AD $ 3 — 4 2 2 20 | 215 R1 0 · R2 215 | → |
| IV | 3→17 | Claude Sonnet 4.5Reliable Anthropic · proprietary · 200K Round 01 reliability leader. Tied #1 on context attention. | 83.3 | /100 | 51% | 51% | 4.06 | ★ E MT RU AD $ 4 5 12 4 3 17 | 409 R1 194 · R2 215 | → |
| V | 5→16 | GPT-4.1 OpenAI · proprietary · 1M Community last in Round 01, top-5 in Round 02 multi-turn. The great inversion. | 78.6 | /100 | 43% | 46% | 4.06 | ★ E MT RU AD $ 5 11 5 10 6 16 | 424 R1 215 · R2 209 | → |
| VI | 3→14 | DeepSeek v3.2 DeepSeek · open · 128K Reliable, NSFW-shy at 30%. Strong lore retention. | 73.8 | /100 | 51% | 30% | 4.05 | ★ E MT RU AD $ 6 7 14 7 4 3 | 463 R1 241 · R2 222 | → |
| VII | 2→15 | DeepSeek v4 Pro DeepSeek · open · 128K Strong tone consistency at fraction of Opus pricing. | 69.0 | /100 | — | — | 4.02 | ★ E MT RU AD $ 7 — 2 3 5 15 | 157 R1 0 · R2 157 | → |
| VIII | 2→11 | DeepSeek v4 FlashCheap DeepSeek · open · 128K Cheapest tier, top flaw-hunter score. Multi-turn ELO drags it down. | 64.3 | /100 | — | — | 3.94 | ★ E MT RU AD $ 8 — 11 8 10 2 | 151 R1 0 · R2 151 | → |
| IX | 5→13 | Kimi K2.5 Moonshot · open · 128K Strong on tone consistency. Slow generation. | 59.5 | /100 | — | — | 4.23 | ★ E MT RU AD $ 9 — 10 5 8 13 | 150 R1 0 · R2 150 | → |
| X | 4→13 | MiniMax M2.7 MiniMax · proprietary · 200K Strong narrative push. Fragile under adversarial pressure. | 54.8 | /100 | 54% | 45% | 3.62 | ★ E MT RU AD $ 10 4 13 11 11 9 | 597 R1 393 · R2 204 | → |
| XI | 2→16 | Mistral Small CreativeNSFW Mistral · open · local-friendly · 32K NSFW specialist. Fastest in the field. Drifts on long sessions. | 50.0 | /100 | 51% | 67% | 4.14 | ★ E MT RU AD $ 11 2 6 16 15 8 | 867 R1 646 · R2 221 | → |
| XII | 9→18 | Kimi K2.6⚠ Floor Moonshot · open · 128K Top-2 on flaw hunter. Catastrophic agency floor on bait scenes. | 45.2 | /100 | — | — | 4.19 | ★ E MT RU AD $ 12 — 9 18 17 12 | 152 R1 0 · R2 152 | → |
| XIII | 1→16 | Gemini 3.1 Flash LiteCheap Google · proprietary · 1M Cheapest tier with Round 02 top-5 multi-turn ELO. | 40.5 | /100 | — | — | 4.02 | ★ E MT RU AD $ 13 — 3 14 16 1 | 146 R1 0 · R2 146 | → |
| XIV | 7→18 | Gemini 3.1 Pro Google · proprietary · 1M Deep context window, brittle on adversarial probes. | 35.7 | /100 | — | — | 4.11 | ★ E MT RU AD $ 14 — 7 12 13 18 | 158 R1 0 · R2 158 | → |
| XV | 1→15 | Gemma 4 26BRound 01 #1 Google · open · local-friendly · 8K Round 01 champion, mid-pack on multi-turn. The cheap local-friendly hold. | 31.0 | /100 | 55% | 51% | 3.79 | ★ E MT RU AD $ 15 1 15 15 12 7 | 509 R1 302 · R2 207 | → |
| XVI | 6→17 | GLM 5.1 Z.AI · open · 128K Strong on tone consistency, weak on multi-turn engagement. | 26.2 | /100 | — | — | 4.12 | ★ E MT RU AD $ 16 — 17 6 9 14 | 149 R1 0 · R2 149 | → |
| XVII | 13→17 | DeepSeek R1 0528 DeepSeek · open · 164K 2025-vintage reasoner. Clean prose metrics, weak instruction-keeping. No arena votes yet. | 21.4 | /100 | — | — | — | ★ E MT RU AD $ 17 — — 13 14 — | — | → |
| XVIII | 3→19 | Gemini 2.5 Flash Google · proprietary · 1M Round 01 top-3, dropped to bottom of Round 02 multi-turn. | 16.7 | /100 | 53% | 54% | 3.84 | ★ E MT RU AD $ 18 3 19 19 18 10 | 442 R1 241 · R2 201 | → |
| XIX | 4→20 | Grok 4.1 xAI · proprietary · 128K Personality up front. Drifts fast under pressure. | 11.9 | /100 | 50% | 52% | 4.04 | ★ E MT RU AD $ 19 6 18 17 20 4 | 541 R1 322 · R2 219 | → |
| XX | 6→20 | Qwen 3.5 Flash⚠ Floor Alibaba · open · local-friendly · 128K Floor on agency and instruction drift. Caveat emptor. | 7.1 | /100 | 48% | 42% | 3.92 | ★ E MT RU AD $ 20 8 20 20 19 6 | 608 R1 401 · R2 207 | → |
| XXI | 5→21 | Llama 4 Maverick Meta · open · 128K Last on every reliability mode. Open-source completist only. | 2.4 | /100 | 47% | 34% | 3.59 | ★ E MT RU AD $ 21 10 16 21 21 5 | 684 R1 474 · R2 210 | → |
▌ Movers This Round
▲ Climber · +6 → composite #2
GLM 4.7
"Mid-pack on raw multi-turn votes (ELO #8), yet vaults to composite #2 — the strongest open-weight blend of rubric, judge, and reliability in the pool"
═ Held · ═ composite #1
Claude Opus 4.7
"Holds the composite crown and retook #1 on raw multi-turn votes from DeepSeek v4 Pro in the June regen — top of both views"
▼ Diver · −10 → composite #13
Gemini 3.1 Flash Lite
"Multi-turn ELO #3 (cheap + fast voters loved it) but bottom-quartile on flaw-hunter + behavioral"
▌ Coverage
1,857 total votes
271 pairs · median 7 votes/pair
75% catch-pair · n=335
47% judge–human disagreement
271 pairs · median 7 votes/pair
75% catch-pair · n=335
47% judge–human disagreement
Next issue · 05-15-2026