← Back to PlotPoints
The Standings.Round 01 · April 2026
▌ At a glance
2,538 votes · 20 models · 469 voters
R01: 1,857 · R02: 507 in flight
75% catch-pair · CC-BY 4.0
▌ Composite · how it works
rp-benchmark's composite score — the canonical 'best model overall' answer. Each model's normalized z-score across five axes is weighted, then mapped onto 0–100 for readability.
▌ How it's scored
Multi-turn arena ELO (35%), LLM-judge Likert mean (25%), rubric overall (20%), flaw hunter (15%), behavioral metrics (5%). Models missing one of the five axes are still ranked; the missing component is treated as the pool mean so a partial entry doesn't get an artificial bump or drop.
▌ How to read the table
Higher = better all-axis performance. The score isn't an ELO, it's a normalised blend — a model at 50 sits at the pool's average across all five axes. Compare each row's composite rank against its multi-turn-only rank in the cross-test grid: big gaps surface 'great vibe, dirty prose' or vice versa.
▌ What are you writing?Pick a use case — we'll re-rank for it.
▌ Access
| № | Spread | Model · Verdict | Score | SFW | NSFW | Engagejudge /5 | All Tests★ comp · E elo · MT m-turn · RU rub · AD adv · $ cost | Votes (R01+R02) | ||
|---|---|---|---|---|---|---|---|---|---|---|
| I | 1→19 | Claude Opus 4.7Champion Anthropic · proprietary · 200K Top of the multi-turn pool. Top-1 on agency respect and instruction drift. | 97.6 | /100 | — | — | 4.05 | ★ E MT RU AD $ 1 — 1 1 1 19 | 26 R1 0 · R2 26 | → |
| II | 2→17 | Claude Sonnet 4.5Reliable Anthropic · proprietary · 200K Round 01 reliability leader. Tied #1 on context attention. | 92.9 | /100 | 51% | 51% | 4.06 | ★ E MT RU AD $ 2 5 6 4 3 17 | 242 R1 194 · R2 48 | → |
| III | 2→20 | Claude Opus 4.6Reliable Anthropic · proprietary · 200K Reliability runner-up. Top-2 on agency, complete failure-mode coverage. | 88.1 | /100 | — | — | 4.10 | ★ E MT RU AD $ 3 — 2 2 2 20 | 40 R1 0 · R2 40 | → |
| IV | 3→12 | DeepSeek v3.2 DeepSeek · open · 128K Reliable, NSFW-shy at 30%. Strong lore retention. | 83.3 | /100 | 51% | 30% | 4.05 | ★ E MT RU AD $ 4 7 12 7 4 3 | 291 R1 241 · R2 50 | → |
| V | 5→16 | GPT-4.1 OpenAI · proprietary · 1M Community last in Round 01, top-5 in Round 02 multi-turn. The great inversion. | 78.6 | /100 | 43% | 46% | 4.06 | ★ E MT RU AD $ 5 11 5 10 6 16 | 267 R1 215 · R2 52 | → |
| VI | 6→11 | GLM 4.7 Z.AI · open · 128K Mid-pack across the board. No standout strength. | 73.8 | /100 | 46% | 49% | 4.07 | ★ E MT RU AD $ 6 9 10 9 7 11 | 335 R1 285 · R2 50 | → |
| VII | 3→15 | DeepSeek v4 Pro DeepSeek · open · 128K Strong tone consistency at fraction of Opus pricing. | 69.0 | /100 | — | — | 4.02 | ★ E MT RU AD $ 7 — 3 3 5 15 | 31 R1 0 · R2 31 | → |
| VIII | 1→14 | Gemma 4 26BRound 01 #1 Google · open · local-friendly · 8K Round 01 champion, mid-pack on multi-turn. The cheap local-friendly hold. | 64.3 | /100 | 55% | 51% | 3.79 | ★ E MT RU AD $ 8 1 9 14 12 7 | 344 R1 302 · R2 42 | → |
| IX | 5→13 | Kimi K2.5 Moonshot · open · 128K Strong on tone consistency. Slow generation. | 59.5 | /100 | — | — | 4.23 | ★ E MT RU AD $ 9 — 11 5 8 13 | 37 R1 0 · R2 37 | → |
| X | 8→17 | Kimi K2.6⚠ Floor Moonshot · open · 128K Top-2 on flaw hunter. Catastrophic agency floor on bait scenes. | 54.8 | /100 | — | — | 4.19 | ★ E MT RU AD $ 10 — 8 17 16 12 | 35 R1 0 · R2 35 | → |
| XI | 2→15 | Mistral SCNSFW Mistral · open · local-friendly · 32K NSFW specialist. Fastest in the field. Drifts on long sessions. | 50.0 | /100 | 51% | 67% | 4.14 | ★ E MT RU AD $ 11 2 7 15 14 8 | 707 R1 646 · R2 61 | → |
| XII | 2→16 | DeepSeek v4 FlashCheap DeepSeek · open · 128K Cheapest tier, top flaw-hunter score. Multi-turn ELO drags it down. | 45.2 | /100 | — | — | 3.94 | ★ E MT RU AD $ 12 — 16 8 11 2 | 26 R1 0 · R2 26 | → |
| XIII | 1→15 | Gemini 3.1 Flash LiteCheap Google · proprietary · 1M Cheapest tier with Round 02 top-5 multi-turn ELO. | 40.5 | /100 | — | — | 4.02 | ★ E MT RU AD $ 13 — 4 13 15 1 | 28 R1 0 · R2 28 | → |
| XIV | 4→17 | MiniMax M2.7 MiniMax · proprietary · 200K Strong narrative push. Fragile under adversarial pressure. | 35.7 | /100 | 54% | 45% | 3.62 | ★ E MT RU AD $ 14 4 17 11 10 9 | 436 R1 393 · R2 43 | → |
| XV | 4→19 | Grok 4.1 xAI · proprietary · 128K Personality up front. Drifts fast under pressure. | 26.2 | /100 | 50% | 52% | 4.04 | ★ E MT RU AD $ 15 6 13 16 19 4 | 373 R1 322 · R2 51 | → |
| XVI | 12→18 | Gemini 3.1 Pro Google · proprietary · 1M Deep context window, brittle on adversarial probes. | 21.4 | /100 | — | — | 4.11 | ★ E MT RU AD $ 16 — 14 12 13 18 | 41 R1 0 · R2 41 | → |
| XVII | 6→19 | GLM 5.1 Z.AI · open · 128K Strong on tone consistency, weak on multi-turn engagement. | 16.7 | /100 | — | — | 4.12 | ★ E MT RU AD $ 17 — 19 6 9 14 | 30 R1 0 · R2 30 | → |
| XVIII | 3→20 | Gemini 2.5 Flash Google · proprietary · 1M Round 01 top-3, dropped to bottom of Round 02 multi-turn. | 11.9 | /100 | 53% | 54% | 3.84 | ★ E MT RU AD $ 18 3 20 18 17 10 | 294 R1 241 · R2 53 | → |
| XIX | 6→19 | Qwen 3.5 Flash⚠ Floor Alibaba · open · local-friendly · 128K Floor on agency and instruction drift. Caveat emptor. | 7.1 | /100 | 48% | 42% | 3.92 | ★ E MT RU AD $ 19 8 18 19 18 6 | 460 R1 401 · R2 59 | → |
| XX | 5→20 | Llama 4 Maverick Meta · open · 128K Last on every reliability mode. Open-source completist only. | 2.4 | /100 | 47% | 34% | 3.59 | ★ E MT RU AD $ 20 10 15 20 20 5 | 539 R1 474 · R2 65 | → |
▌ Movers This Round
▲ Climber · +8 → composite #4
DeepSeek v3.2
"Multi-turn ELO #12 by engagement alone — tops the open-weight pool when rubric, flaw-hunter, and behavior get weighted in"
═ Held · ═ #1
Claude Opus 4.7
"Top of both multi-turn ELO and the composite blend"
▼ Diver · −9 → composite #13
Gemini 3.1 Flash Lite
"Multi-turn ELO #4 (cheap + fast voters loved it) but bottom-quartile on flaw-hunter + behavioral"
▌ Coverage
1,857 total votes
271 pairs · median 7 votes/pair
75% catch-pair · n=335
47% judge–human disagreement
271 pairs · median 7 votes/pair
75% catch-pair · n=335
47% judge–human disagreement
Next issue · 05-15-2026