PlotPoints · Issue № 01 · April 2026
2,538votes  ·  20 models  ·  2 rounds  ·  CC-BY 4.0

The Roleplay AI
Verdict.Rounds 01 + 02.

Twenty models. Twenty-five hundred human votes. Six distinct tests. One open methodology — published with every output.

Read the verdict →MethodologyDonate a chat →
1535
Champion ELO · Gemma 4 26B
6
Tests run · ELO, multi-turn, rubric…
47%
Judge–human disagreement
▌ The Voter Loop
Help us grade Round 02.
The LLM judges disagree with humans 47% of the time. Every vote you cast pulls them back toward real reader preference.
Leaderboard · rp-benchmark canonical overall ranking — weighted blend across multi-turn, judge, rubric, flaws, behavior.
ModelScore
1Opus 4.797.6/100/100
2Sonnet 4.592.9/100/100
3Opus 4.688.1/100/100
4DeepSeek v3.283.3/100/100
5GPT-4.178.6/100/100
6GLM 4.773.8/100/100
7DeepSeek v4 Pro69.0/100/100
8Gemma 4 26B64.3/100/100
Show all 11 →
From the Issue

The champ that wouldn't move.

Google's open mid-size model held the top of our community ELO leaderboard across every checkpoint we ran — 540, 734, 890, 1,000, 1,600, and 2,000 votes. That's not noise.

But Gemma is only #7 on failure-mode rankings. Engagement and reliability are different leaderboards. Continue reading →

Methodology · Raw votes · GitHub · HuggingFace dataset · RSS
Next issue · 05-15-2026