PlotPointsRound 01
Issue № 01 · April 2026
Closed

The champ that wouldn't move.

Eleven models, 1,857 blind votes, six snapshot checkpoints — one model held the top of every single one.

1,857
votes
335
voters
271
pairs
75%
catch-pair pass
47%
judge–human disagreement
7
median votes / pair
From the Issue

The champ that wouldn't move.

Google's open mid-size model held the top of our community ELO leaderboard across every checkpoint we ran — 540, 734, 890, 1,000, 1,600, and 2,000 votes. That's not noise.

But Gemma is only #7 on failure-mode rankings. Engagement and reliability are different leaderboards. See the full standings →

▌ The Takeaways
  1. 01Gemma 4 26B held #1 community ELO across all six snapshot checkpoints — the only model that didn't move.
  2. 02Engagement (community votes) and reliability (adversarial probes) are different leaderboards. Sonnet 4.5 won failure-rank #3 in the 20-model pool, came #6 on Round 01 community ELO. The inversion is the story.
  3. 03Mistral SC's NSFW specialty (67.4% NSFW win rate) carried it to #2 community ELO despite a 15.9% agency-violation rate.
  4. 04Catch-pair quality control flagged ~25% of voters as low-confidence; their ballots were downweighted accordingly.
  5. 05LLM judges disagreed with human voters on 47% of arena pairs — every additional human vote pulls the model rankings closer to reader preference.
LeaderboardMethodologyRaw votes (CSV)GitHubHF datasetRSS