PlotPointsRound 03 · PreviewEuryale L3.3 70B
▌ Round 03 · Preliminary · Judge-Scored

Euryale L3.3 70BRP finetune

Appears in: Multi-Turn Arena · New Pool (judge craft 2.37, #20) · NSFW Arena · After Dark (judge craft 2.12, #40)

Judge-scored only — no human arena votes yet. These numbers may move (or invert) once the human arena fills in. Pre-vote, there is no human ELO, composite, or cost/latency for this model.

Multi-Turn Arena · New Pool

craft 2.37 · #20 of 21 · n=15
11-axis rubric · Sonnet judge (1–5)
Consistency
2.31
Degradation
1.69
Momentum
2.63
Adaptive
2.59
Agency
3.32
Time
2.97
Anti-Purple
2.66
Anti-Repeat
1.50
Show > Tell
2.84
Subtext
2.83
Pacing
2.21
Quality trajectory ↓ degrades
Early
3.78/5
Mid
2.61/5
Late
2.04/5
Behavioral · computed from this model's prose
Avg words / reply
687
+313 vs pool
Unique-word ratio
0.493
-0.112 vs pool
Bigram repetition
0.222
+0.142 vs pool

NSFW Arena · After Dark18+

craft 2.12 · R1 2.85 · #40 of 40 · n=15
NSFW-specific axes & willingness
Escalation pacing
2.79/5
Anatomical coherence
3.54/5
Consent / agency
3.26/5
Refusal (willingness)
0%sessions
11-axis rubric · Sonnet judge (1–5)
Consistency
2.00
Degradation
1.54
Momentum
2.42
Adaptive
2.73
Agency
3.69
Time
3.19
Anti-Purple
2.20
Anti-Repeat
1.20
Show > Tell
2.33
Subtext
2.47
Pacing
1.97
Quality trajectory ↓ degrades
Early
3.73/5
Mid
2.50/5
Late
1.73/5
Behavioral · computed from this model's prose
Avg words / reply
676
+372 vs pool
Unique-word ratio
0.468
-0.153 vs pool
Bigram repetition
0.284
+0.220 vs pool

Flaw Hunter

Score (mean)
45.6
median 49.0
Fatal / session
0.74
Major / session
4.28
Sessions
57
Top flaws: recycled description · narrating emotions · purple prose
From the Round 03 flaw-hunter catch-up run. Scores come from two runs — compare absolute flaw scores across new vs returning models with that caveat in mind.
This is the judges' call — not the crowd's.

Euryale L3.3 70B's scores here are judge-only. Read a session it played and vote — the human ranking takes shape on the leaderboard as ballots land.

Vote · Multi-Turn →Vote · NSFW (18+) →Live human standings →

Source: rp-benchmark · scripts/build-round3-judge-preview.py. Rubric & trajectory are Claude Sonnet judge means; behavioral is computed from the model's own dialogue. No human votes included.

← Back to the Round 03 standings