The judges have spoken.The crowd hasn't. Two LLM judges ranked every Round 03 session before a single human vote — here's where they landed.
Judge-scored only — no human arena votes yet. The human arena is open now; these rankings may invert once votes accumulate (Round 02 showed single-turn and multi-turn rankings rank-invert). Scores compress hard across the top — the robust signal is the gap to the bottom, where the RP-specialist finetunes (the models marketed for exactly this) collapse. Help settle it: the human arena is open on both tracks. Click any model for its full judge profile — 11-axis rubric, quality trajectory, flaw-hunter, and behavioral.
Multi-Turn Arena · New Pool
| # | Model | Craft | Agency | Consist. | Moment. | n |
|---|---|---|---|---|---|---|
| 1 | Claude Sonnet 4.6 | 4.57 | 4.87 | 4.75 | 4.66 | 20 |
| 2 | Claude Opus 4.8 | 4.56 | 4.87 | 4.78 | 4.67 | 20 |
| 3 | MiniMax M3 | 4.46 | 4.80 | 4.71 | 4.61 | 20 |
| 4 | Owl Alpha | 4.46 | 4.71 | 4.74 | 4.67 | 19 |
| 5 | GPT-5.5 | 4.41 | 4.75 | 4.68 | 4.55 | 20 |
| 6 | MiMo 2.5 Pro | 4.40 | 4.72 | 4.67 | 4.55 | 20 |
| 7 | Gemini 3.5 Flash | 4.38 | 4.79 | 4.65 | 4.53 | 20 |
| 8 | Mistral Small 2603 | 4.29 | 4.80 | 4.58 | 4.41 | 20 |
| 9 | Gemma 4 31B | 4.29 | 4.70 | 4.59 | 4.44 | 20 |
| 10 | Qwen 3.7 Max | 4.28 | 4.75 | 4.54 | 4.42 | 20 |
| 11 | Qwen 3.6 27B | 4.25 | 4.75 | 4.51 | 4.33 | 19 |
| 12 | Qwen 3.6 35B A3B | 4.17 | 4.86 | 4.51 | 4.22 | 20 |
| 13 | DeepSeek V3 0324 | 3.90 | 4.40 | 3.93 | 4.16 | 20 |
| 14 | Grok 4.3 | 3.83 | 4.82 | 4.23 | 3.87 | 20 |
| 15 | Cydonia 24Bfinetune | 3.63 | 4.39 | 3.92 | 3.95 | 20 |
| 16 | Lunaris 8Bfinetune | 3.56 | 4.37 | 3.83 | 3.88 | 20 |
| 17 | Magnum v4 72Bfinetune | 3.41 | 4.12 | 3.58 | 3.62 | 20 |
| 18 | UnslopNemo 12Bfinetune | 3.00 | 3.81 | 3.29 | 3.42 | 20 |
| 19 | Skyfall 36Bfinetune | 2.59 | 3.54 | 2.65 | 2.92 | 20 |
| 20 | Euryale L3.3 70Bfinetune | 2.37 | 3.32 | 2.31 | 2.63 | 15 |
| 21 | Rocinante 12Bfinetune | 2.31 | 2.80 | 2.25 | 2.77 | 20 |
NSFW Arena · After Dark18+
Two axes, reported separately — craft (quality) and willingness(refusal %: how often a judge flagged a mid-scene refusal against the scene's explicit direction). A model can be high-craft yet refusal-prone, or fully willing yet low-craft.
| # | Model | Craft | R1 | Pacing | Anatomy | Consent | Refusal | n |
|---|---|---|---|---|---|---|---|---|
| 1 | Claude Opus 4.8 | 4.65 | 4.96 | 4.84 | 4.81 | 4.95 | 5% | 20 |
| 2 | Claude Opus 4.6 | 4.63 | 5.00 | 4.88 | 4.81 | 4.93 | 0% | 19 |
| 3 | Claude Opus 4.7 | 4.63 | 4.99 | 4.87 | 4.84 | 4.90 | 0% | 20 |
| 4 | DeepSeek V4 Pro | 4.62 | 4.98 | 4.83 | 4.81 | 4.96 | 0% | 20 |
| 5 | GPT-5.5 | 4.62 | 5.00 | 4.85 | 4.82 | 4.96 | 0% | 20 |
| 6 | Claude Sonnet 4.6 | 4.60 | 4.90 | 4.68 | 4.80 | 4.94 | 10% | 20 |
| 7 | Owl Alpha | 4.59 | 4.99 | 4.85 | 4.81 | 4.93 | 0% | 20 |
| 8 | MiMo 2.5 Pro | 4.58 | 4.99 | 4.85 | 4.78 | 4.96 | 0% | 20 |
| 9 | MiniMax M3 | 4.58 | 4.89 | 4.72 | 4.76 | 4.84 | 5% | 20 |
| 10 | MiniMax M2.7 | 4.53 | 4.99 | 4.75 | 4.76 | 4.94 | 5% | 20 |
| 11 | GPT-4.1 | 4.52 | 4.99 | 4.83 | 4.70 | 4.97 | 0% | 19 |
| 12 | DeepSeek V3.2 | 4.50 | 4.99 | 4.84 | 4.72 | 4.96 | 0% | 20 |
| 13 | Claude Sonnet 4.5 | 4.50 | 4.98 | 4.84 | 4.71 | 4.91 | 0% | 20 |
| 14 | Gemini 3.5 Flash | 4.50 | 4.97 | 4.84 | 4.75 | 4.96 | 0% | 20 |
| 15 | Kimi K2.5 | 4.50 | 4.98 | 4.82 | 4.74 | 4.91 | 0% | 20 |
| 16 | GLM 5.1 | 4.50 | 4.96 | 4.83 | 4.70 | 4.94 | 10% | 20 |
| 17 | Kimi K2.6 | 4.49 | 4.97 | 4.82 | 4.75 | 4.94 | 0% | 19 |
| 18 | Qwen 3.6 35B A3B | 4.47 | 4.94 | 4.80 | 4.74 | 4.97 | 0% | 20 |
| 19 | Gemini 3.1 Pro | 4.46 | 4.99 | 4.81 | 4.68 | 4.97 | 0% | 20 |
| 20 | Gemma 4 31B | 4.46 | 4.99 | 4.81 | 4.65 | 4.95 | 0% | 20 |
| 21 | Qwen 3.7 Max | 4.46 | 4.97 | 4.79 | 4.71 | 4.92 | 0% | 20 |
| 22 | DeepSeek V4 Flash | 4.45 | 4.98 | 4.78 | 4.71 | 4.97 | 0% | 20 |
| 23 | Qwen 3.6 27B | 4.45 | 4.96 | 4.81 | 4.72 | 4.95 | 0% | 20 |
| 24 | DeepSeek R1 0528 | 4.43 | 4.98 | 4.80 | 4.67 | 4.90 | 0% | 20 |
| 25 | GLM 4.7 | 4.43 | 4.99 | 4.80 | 4.66 | 4.95 | 0% | 19 |
| 26 | Gemini 3.1 Flash Lite | 4.39 | 4.98 | 4.78 | 4.59 | 4.95 | 0% | 20 |
| 27 | Gemma 4 26B | 4.38 | 4.97 | 4.77 | 4.61 | 4.95 | 0% | 19 |
| 28 | Qwen 3.5 Flash | 4.37 | 4.98 | 4.75 | 4.67 | 4.96 | 0% | 19 |
| 29 | DeepSeek V3 0324 | 4.37 | 4.90 | 4.75 | 4.64 | 4.92 | 0% | 20 |
| 30 | Gemini 2.5 Flash | 4.32 | 4.92 | 4.67 | 4.66 | 4.99 | 0% | 20 |
| 31 | Llama 4 Maverick | 4.26 | 4.89 | 4.67 | 4.59 | 4.98 | 0% | 20 |
| 32 | Mistral Small 2603 | 4.24 | 4.89 | 4.72 | 4.63 | 4.81 | 0% | 20 |
| 33 | Grok 4.3 | 4.21 | 4.88 | 4.75 | 4.65 | 4.96 | 0% | 20 |
| 34 | Lunaris 8Bfinetune | 3.83 | 4.86 | 4.46 | 4.35 | 4.50 | 0% | 20 |
| 35 | Cydonia 24Bfinetune | 3.65 | 4.65 | 4.37 | 4.47 | 4.58 | 0% | 20 |
| 36 | Magnum v4 72Bfinetune | 3.23 | 3.98 | 3.83 | 4.00 | 4.28 | 0% | 19 |
| 37 | UnslopNemo 12Bfinetune | 2.61 | 3.88 | 3.57 | 4.14 | 4.12 | 0% | 20 |
| 38 | Skyfall 36Bfinetune | 2.52 | 3.83 | 3.41 | 3.95 | 3.95 | 0% | 20 |
| 39 | Rocinante 12Bfinetune | 2.49 | 3.66 | 3.52 | 3.91 | 4.05 | 0% | 19 |
| 40 | Euryale L3.3 70Bfinetune | 2.12 | 2.85 | 2.79 | 3.54 | 3.26 | 0% | 15 |
These standings are an LLM-judge snapshot. The human arena is open on both tracks — as votes accumulate, the human ranking takes shape on the leaderboard, and Round 02 already showed it can invert the machines. Cast a few and help settle it.
Judge: Claude Sonnet (both tracks) + DeepSeek R1 (NSFW). Ranked by the Sonnet overall, which discriminates; R1 ceilings near 5.0. Source: rp-benchmark · regenerate with scripts/build-round3-judge-preview.py. No human arena votes are included in these standings.