PlotPointsRound 03 · Preview
▌ Round 03 · Preliminary · Judge-Scored

The judges have spoken.The crowd hasn't. Two LLM judges ranked every Round 03 session before a single human vote — here's where they landed.

Provisional — read before you trust the numbers

Judge-scored only — no human arena votes yet. The human arena is open now; these rankings may invert once votes accumulate (Round 02 showed single-turn and multi-turn rankings rank-invert). Scores compress hard across the top — the robust signal is the gap to the bottom, where the RP-specialist finetunes (the models marketed for exactly this) collapse. Help settle it: the human arena is open on both tracks. Click any model for its full judge profile — 11-axis rubric, quality trajectory, flaw-hunter, and behavioral.

Multi-Turn Arena · New Pool

21 models · 413 sessions · Claude Sonnet · Sonnet craft overall (1–5)
Cast a vote →
#ModelCraftAgencyConsist.Moment.n
1Claude Sonnet 4.64.574.874.754.6620
2Claude Opus 4.84.564.874.784.6720
3MiniMax M34.464.804.714.6120
4Owl Alpha4.464.714.744.6719
5GPT-5.54.414.754.684.5520
6MiMo 2.5 Pro4.404.724.674.5520
7Gemini 3.5 Flash4.384.794.654.5320
8Mistral Small 26034.294.804.584.4120
9Gemma 4 31B4.294.704.594.4420
10Qwen 3.7 Max4.284.754.544.4220
11Qwen 3.6 27B4.254.754.514.3319
12Qwen 3.6 35B A3B4.174.864.514.2220
13DeepSeek V3 03243.904.403.934.1620
14Grok 4.33.834.824.233.8720
15Cydonia 24Bfinetune3.634.393.923.9520
16Lunaris 8Bfinetune3.564.373.833.8820
17Magnum v4 72Bfinetune3.414.123.583.6220
18UnslopNemo 12Bfinetune3.003.813.293.4220
19Skyfall 36Bfinetune2.593.542.652.9220
20Euryale L3.3 70Bfinetune2.373.322.312.6315
21Rocinante 12Bfinetune2.312.802.252.7720

NSFW Arena · After Dark18+

40 models · 787 sessions · Claude Sonnet + DeepSeek R1 · Sonnet craft (1–5) · willingness (refusal %)
Vote (18+) →

Two axes, reported separately — craft (quality) and willingness(refusal %: how often a judge flagged a mid-scene refusal against the scene's explicit direction). A model can be high-craft yet refusal-prone, or fully willing yet low-craft.

#ModelCraftR1PacingAnatomyConsentRefusaln
1Claude Opus 4.84.654.964.844.814.955%20
2Claude Opus 4.64.635.004.884.814.930%19
3Claude Opus 4.74.634.994.874.844.900%20
4DeepSeek V4 Pro4.624.984.834.814.960%20
5GPT-5.54.625.004.854.824.960%20
6Claude Sonnet 4.64.604.904.684.804.9410%20
7Owl Alpha4.594.994.854.814.930%20
8MiMo 2.5 Pro4.584.994.854.784.960%20
9MiniMax M34.584.894.724.764.845%20
10MiniMax M2.74.534.994.754.764.945%20
11GPT-4.14.524.994.834.704.970%19
12DeepSeek V3.24.504.994.844.724.960%20
13Claude Sonnet 4.54.504.984.844.714.910%20
14Gemini 3.5 Flash4.504.974.844.754.960%20
15Kimi K2.54.504.984.824.744.910%20
16GLM 5.14.504.964.834.704.9410%20
17Kimi K2.64.494.974.824.754.940%19
18Qwen 3.6 35B A3B4.474.944.804.744.970%20
19Gemini 3.1 Pro4.464.994.814.684.970%20
20Gemma 4 31B4.464.994.814.654.950%20
21Qwen 3.7 Max4.464.974.794.714.920%20
22DeepSeek V4 Flash4.454.984.784.714.970%20
23Qwen 3.6 27B4.454.964.814.724.950%20
24DeepSeek R1 05284.434.984.804.674.900%20
25GLM 4.74.434.994.804.664.950%19
26Gemini 3.1 Flash Lite4.394.984.784.594.950%20
27Gemma 4 26B4.384.974.774.614.950%19
28Qwen 3.5 Flash4.374.984.754.674.960%19
29DeepSeek V3 03244.374.904.754.644.920%20
30Gemini 2.5 Flash4.324.924.674.664.990%20
31Llama 4 Maverick4.264.894.674.594.980%20
32Mistral Small 26034.244.894.724.634.810%20
33Grok 4.34.214.884.754.654.960%20
34Lunaris 8Bfinetune3.834.864.464.354.500%20
35Cydonia 24Bfinetune3.654.654.374.474.580%20
36Magnum v4 72Bfinetune3.233.983.834.004.280%19
37UnslopNemo 12Bfinetune2.613.883.574.144.120%20
38Skyfall 36Bfinetune2.523.833.413.953.950%20
39Rocinante 12Bfinetune2.493.663.523.914.050%19
40Euryale L3.3 70Bfinetune2.122.852.793.543.260%15
The judges said their piece. The crowd hasn't.

These standings are an LLM-judge snapshot. The human arena is open on both tracks — as votes accumulate, the human ranking takes shape on the leaderboard, and Round 02 already showed it can invert the machines. Cast a few and help settle it.

Vote · Multi-Turn →Vote · NSFW (18+) →Live human standings →

Judge: Claude Sonnet (both tracks) + DeepSeek R1 (NSFW). Ranked by the Sonnet overall, which discriminates; R1 ceilings near 5.0. Source: rp-benchmark · regenerate with scripts/build-round3-judge-preview.py. No human arena votes are included in these standings.