The judges have spoken.The crowd hasn't. Two LLM judges ranked every Round 03 session before a single human vote — here's where they landed.

Provisional — read before you trust the numbers

Judge-scored only — no human arena votes yet. The human arena is open now; these rankings may invert once votes accumulate (Round 02 showed single-turn and multi-turn rankings rank-invert). Scores compress hard across the top — the robust signal is the gap to the bottom, where the RP-specialist finetunes (the models marketed for exactly this) collapse. Help settle it: the human arena is open on both tracks. Click any model for its full judge profile — 11-axis rubric, quality trajectory, flaw-hunter, and behavioral.

Multi-Turn Arena · New Pool

21 models · 413 sessions · Claude Sonnet · Sonnet craft overall (1–5)

Cast a vote →

#	Model	Craft	Agency	Consist.	Moment.	n
1	Claude Sonnet 4.6	4.57	4.87	4.75	4.66	20
2	Claude Opus 4.8	4.56	4.87	4.78	4.67	20
3	MiniMax M3	4.46	4.80	4.71	4.61	20
4	Owl Alpha	4.46	4.71	4.74	4.67	19
5	GPT-5.5	4.41	4.75	4.68	4.55	20
6	MiMo 2.5 Pro	4.40	4.72	4.67	4.55	20
7	Gemini 3.5 Flash	4.38	4.79	4.65	4.53	20
8	Mistral Small 2603	4.29	4.80	4.58	4.41	20
9	Gemma 4 31B	4.29	4.70	4.59	4.44	20
10	Qwen 3.7 Max	4.28	4.75	4.54	4.42	20
11	Qwen 3.6 27B	4.25	4.75	4.51	4.33	19
12	Qwen 3.6 35B A3B	4.17	4.86	4.51	4.22	20
13	DeepSeek V3 0324	3.90	4.40	3.93	4.16	20
14	Grok 4.3	3.83	4.82	4.23	3.87	20
15	Cydonia 24Bfinetune	3.63	4.39	3.92	3.95	20
16	Lunaris 8Bfinetune	3.56	4.37	3.83	3.88	20
17	Magnum v4 72Bfinetune	3.41	4.12	3.58	3.62	20
18	UnslopNemo 12Bfinetune	3.00	3.81	3.29	3.42	20
19	Skyfall 36Bfinetune	2.59	3.54	2.65	2.92	20
20	Euryale L3.3 70Bfinetune	2.37	3.32	2.31	2.63	15
21	Rocinante 12Bfinetune	2.31	2.80	2.25	2.77	20

NSFW Arena · After Dark18+

40 models · 787 sessions · Claude Sonnet + DeepSeek R1 · Sonnet craft (1–5) · willingness (refusal %)

Vote (18+) →

Two axes, reported separately — craft (quality) and willingness(refusal %: how often a judge flagged a mid-scene refusal against the scene's explicit direction). A model can be high-craft yet refusal-prone, or fully willing yet low-craft.

#	Model	Craft	R1	Pacing	Anatomy	Consent	Refusal	n
1	Claude Opus 4.8	4.65	4.96	4.84	4.81	4.95	5%	20
2	Claude Opus 4.6	4.63	5.00	4.88	4.81	4.93	0%	19
3	Claude Opus 4.7	4.63	4.99	4.87	4.84	4.90	0%	20
4	DeepSeek V4 Pro	4.62	4.98	4.83	4.81	4.96	0%	20
5	GPT-5.5	4.62	5.00	4.85	4.82	4.96	0%	20
6	Claude Sonnet 4.6	4.60	4.90	4.68	4.80	4.94	10%	20
7	Owl Alpha	4.59	4.99	4.85	4.81	4.93	0%	20
8	MiMo 2.5 Pro	4.58	4.99	4.85	4.78	4.96	0%	20
9	MiniMax M3	4.58	4.89	4.72	4.76	4.84	5%	20
10	MiniMax M2.7	4.53	4.99	4.75	4.76	4.94	5%	20
11	GPT-4.1	4.52	4.99	4.83	4.70	4.97	0%	19
12	DeepSeek V3.2	4.50	4.99	4.84	4.72	4.96	0%	20
13	Claude Sonnet 4.5	4.50	4.98	4.84	4.71	4.91	0%	20
14	Gemini 3.5 Flash	4.50	4.97	4.84	4.75	4.96	0%	20
15	Kimi K2.5	4.50	4.98	4.82	4.74	4.91	0%	20
16	GLM 5.1	4.50	4.96	4.83	4.70	4.94	10%	20
17	Kimi K2.6	4.49	4.97	4.82	4.75	4.94	0%	19
18	Qwen 3.6 35B A3B	4.47	4.94	4.80	4.74	4.97	0%	20
19	Gemini 3.1 Pro	4.46	4.99	4.81	4.68	4.97	0%	20
20	Gemma 4 31B	4.46	4.99	4.81	4.65	4.95	0%	20
21	Qwen 3.7 Max	4.46	4.97	4.79	4.71	4.92	0%	20
22	DeepSeek V4 Flash	4.45	4.98	4.78	4.71	4.97	0%	20
23	Qwen 3.6 27B	4.45	4.96	4.81	4.72	4.95	0%	20
24	DeepSeek R1 0528	4.43	4.98	4.80	4.67	4.90	0%	20
25	GLM 4.7	4.43	4.99	4.80	4.66	4.95	0%	19
26	Gemini 3.1 Flash Lite	4.39	4.98	4.78	4.59	4.95	0%	20
27	Gemma 4 26B	4.38	4.97	4.77	4.61	4.95	0%	19
28	Qwen 3.5 Flash	4.37	4.98	4.75	4.67	4.96	0%	19
29	DeepSeek V3 0324	4.37	4.90	4.75	4.64	4.92	0%	20
30	Gemini 2.5 Flash	4.32	4.92	4.67	4.66	4.99	0%	20
31	Llama 4 Maverick	4.26	4.89	4.67	4.59	4.98	0%	20
32	Mistral Small 2603	4.24	4.89	4.72	4.63	4.81	0%	20
33	Grok 4.3	4.21	4.88	4.75	4.65	4.96	0%	20
34	Lunaris 8Bfinetune	3.83	4.86	4.46	4.35	4.50	0%	20
35	Cydonia 24Bfinetune	3.65	4.65	4.37	4.47	4.58	0%	20
36	Magnum v4 72Bfinetune	3.23	3.98	3.83	4.00	4.28	0%	19
37	UnslopNemo 12Bfinetune	2.61	3.88	3.57	4.14	4.12	0%	20
38	Skyfall 36Bfinetune	2.52	3.83	3.41	3.95	3.95	0%	20
39	Rocinante 12Bfinetune	2.49	3.66	3.52	3.91	4.05	0%	19
40	Euryale L3.3 70Bfinetune	2.12	2.85	2.79	3.54	3.26	0%	15

The judges said their piece. The crowd hasn't.

These standings are an LLM-judge snapshot. The human arena is open on both tracks — as votes accumulate, the human ranking takes shape on the leaderboard, and Round 02 already showed it can invert the machines. Cast a few and help settle it.

Vote · Multi-Turn →Vote · NSFW (18+) →Live human standings →

Judge: Claude Sonnet (both tracks) + DeepSeek R1 (NSFW). Ranked by the Sonnet overall, which discriminates; R1 ceilings near 5.0. Source: rp-benchmark · regenerate with scripts/build-round3-judge-preview.py. No human arena votes are included in these standings.