PlotPoints · Issue № 01 · April 2026
2,538votes · 21 models · 2 rounds · CC-BY 4.0
The Roleplay AI
Verdict.Rounds 01 + 02.
Twenty models. Twenty-five hundred human votes. Six distinct tests. One open methodology — published with every output.
1535
Champion ELO · Gemma 4 26B
6
Tests run · ELO, multi-turn, rubric…
47%
Judge–human disagreement
▌ The Voter Loop
Round 03 is coming.
Round 02 is closed — and the judges still disagree with humans about half the time. Round 03 puts a new pool, and an After Dark question, in front of real readers.
I
ClosedOpens soon
Multi-Turn Arena
Compare two full 12-turn sessions side-by-side. Round 02 closed at 1,877 votes; Round 03 brings a refreshed pool — Opus 4.8, GPT-5.5, and the RP-finetune crowd — on the same seeds.
See results →
II
ClosedOpens soon
NSFW Arena · 18+
After Dark track: 40 models on 20 NSFW-adversarial seeds — consent handling, mid-scene refusals, anatomical coherence. Do human voters overturn the judges? Round 03 finds out.
See results →
III
Closed3,890 votes
Past Rounds
Round 01 (single-turn, 2,013 votes) crowned Gemma; Round 02 (multi-turn, 1,877 votes) inverted the podium. Read the issues and the calibrated results.
See results →
Leaderboard · rp-benchmark canonical overall ranking — weighted blend across multi-turn, judge, rubric, flaws, behavior.
| № | Model | Score | |
|---|---|---|---|
| 1 | Opus 4.7 | 97.6/100 | /100 |
| 2 | Sonnet 4.5 | 92.9/100 | /100 |
| 3 | Opus 4.6 | 88.1/100 | /100 |
| 4 | DeepSeek v3.2 | 83.3/100 | /100 |
| 5 | GPT-4.1 | 78.6/100 | /100 |
| 6 | GLM 4.7 | 73.8/100 | /100 |
| 7 | DeepSeek v4 Pro | 69.0/100 | /100 |
| 8 | Gemma 4 26B | 64.3/100 | /100 |
From the Issue
The champ that wouldn't move.
Google's open mid-size model held the top of our community ELO leaderboard across every checkpoint we ran — 540, 734, 890, 1,000, 1,600, and 2,000 votes. That's not noise.
But Gemma is only #7 on failure-mode rankings. Engagement and reliability are different leaderboards. Continue reading →
Next issue · 05-15-2026