PlotPoints · Issue № 01 · April 2026

2,538votes  ·  20 models  ·  2 rounds  ·  CC-BY 4.0

The Roleplay AI
Verdict.Rounds 01 + 02.

Twenty models. Twenty-five hundred human votes. Six distinct tests. One open methodology — published with every output.

Read the verdict →Methodology Donate a chat →

1535

Champion ELO · Gemma 4 26B

Tests run · ELO, multi-turn, rubric…

47%

Judge–human disagreement

▌ The Voter Loop

Help us grade Round 02.

The LLM judges disagree with humans 47% of the time. Every vote you cast pulls them back toward real reader preference.

● Active~5 min

Multi-Turn Arena

Compare two full 12-turn sessions side-by-side across our 20-model pool. Tests degradation and narrative momentum across a whole scene. Round 02 is open.

Cast a vote →

ClosedSee round

Rubric Score

21-axis Likert scoring — agency, voice, subtext. Voter UI is ready; Round 02 ships with multi-turn-only. Will open with Round 03.

Two anonymous responses, side-by-side. Round 01 closed at 2,013 votes — see the calibrated results that came out of it.

See results →

Leaderboard · rp-benchmark canonical overall ranking — weighted blend across multi-turn, judge, rubric, flaws, behavior.

№	Model	Score
1	Opus 4.7	97.6/100	/100
2	Sonnet 4.5	92.9/100	/100
3	Opus 4.6	88.1/100	/100
4	DeepSeek v3.2	83.3/100	/100
5	GPT-4.1	78.6/100	/100
6	GLM 4.7	73.8/100	/100
7	DeepSeek v4 Pro	69.0/100	/100
8	Gemma 4 26B	64.3/100	/100

Show all 11 →

From the Issue

The champ that wouldn't move.

Google's open mid-size model held the top of our community ELO leaderboard across every checkpoint we ran — 540, 734, 890, 1,000, 1,600, and 2,000 votes. That's not noise.

But Gemma is only #7 on failure-mode rankings. Engagement and reliability are different leaderboards. Continue reading →

Methodology · Raw votes · GitHub · HuggingFace dataset · RSS

Next issue · 05-15-2026

The Roleplay AIVerdict.Rounds 01 + 02.

The champ that wouldn't move.

The Roleplay AI
Verdict.Rounds 01 + 02.