Issue № 01 · April 2026
Closed
The champ that wouldn't move.
Eleven models, 1,857 blind votes, six snapshot checkpoints — one model held the top of every single one.
1,857
votes
335
voters
271
pairs
75%
catch-pair pass
47%
judge–human disagreement
7
median votes / pair
▌ From the Issue
The champ that wouldn't move.
Google's open mid-size model held the top of our community ELO leaderboard across every checkpoint we ran — 540, 734, 890, 1,000, 1,600, and 2,000 votes. That's not noise.
But Gemma is only #7 on failure-mode rankings. Engagement and reliability are different leaderboards. See the full standings →
▌ The Takeaways
- 01Gemma 4 26B held #1 community ELO across all six snapshot checkpoints — the only model that didn't move.
- 02Engagement (community votes) and reliability (adversarial probes) are different leaderboards. Sonnet 4.5 won failure-rank #3 in the 20-model pool, came #6 on Round 01 community ELO. The inversion is the story.
- 03Mistral SC's NSFW specialty (67.4% NSFW win rate) carried it to #2 community ELO despite a 15.9% agency-violation rate.
- 04Catch-pair quality control flagged ~25% of voters as low-confidence; their ballots were downweighted accordingly.
- 05LLM judges disagreed with human voters on 47% of arena pairs — every additional human vote pulls the model rankings closer to reader preference.
No earlier issue