PlotPoints · Issue № 01 · April 2026
2,538votes · 20 models · 2 rounds · CC-BY 4.0
The Roleplay AI
Verdict.Rounds 01 + 02.
Twenty models. Twenty-five hundred human votes. Six distinct tests. One open methodology — published with every output.
1535
Champion ELO · Gemma 4 26B
6
Tests run · ELO, multi-turn, rubric…
47%
Judge–human disagreement
▌ The Voter Loop
Help us grade Round 02.
The LLM judges disagree with humans 47% of the time. Every vote you cast pulls them back toward real reader preference.
I
● Active~5 min
Multi-Turn Arena
Compare two full 12-turn sessions side-by-side across our 20-model pool. Tests degradation and narrative momentum across a whole scene. Round 02 is open.
Cast a vote →
II
ClosedSee round
Rubric Score
21-axis Likert scoring — agency, voice, subtext. Voter UI is ready; Round 02 ships with multi-turn-only. Will open with Round 03.
See results →
III
Closed2,013 votes
Single-Turn Arena
Two anonymous responses, side-by-side. Round 01 closed at 2,013 votes — see the calibrated results that came out of it.
See results →
Leaderboard · rp-benchmark canonical overall ranking — weighted blend across multi-turn, judge, rubric, flaws, behavior.
| № | Model | Score | |
|---|---|---|---|
| 1 | Opus 4.7 | 97.6/100 | /100 |
| 2 | Sonnet 4.5 | 92.9/100 | /100 |
| 3 | Opus 4.6 | 88.1/100 | /100 |
| 4 | DeepSeek v3.2 | 83.3/100 | /100 |
| 5 | GPT-4.1 | 78.6/100 | /100 |
| 6 | GLM 4.7 | 73.8/100 | /100 |
| 7 | DeepSeek v4 Pro | 69.0/100 | /100 |
| 8 | Gemma 4 26B | 64.3/100 | /100 |
From the Issue
The champ that wouldn't move.
Google's open mid-size model held the top of our community ELO leaderboard across every checkpoint we ran — 540, 734, 890, 1,000, 1,600, and 2,000 votes. That's not noise.
But Gemma is only #7 on failure-mode rankings. Engagement and reliability are different leaderboards. Continue reading →
Next issue · 05-15-2026