Issue № 01 · April 2026

Closed

The champ that wouldn't move.

Eleven models, 1,857 blind votes, six snapshot checkpoints — one model held the top of every single one.

1,857

votes

335

voters

271

pairs

75%

catch-pair pass

47%

judge–human disagreement

median votes / pair

▌ From the Issue

The champ that wouldn't move.

Google's open mid-size model held the top of our community ELO leaderboard across every checkpoint we ran — 540, 734, 890, 1,000, 1,600, and 2,000 votes. That's not noise.

But Gemma is only #7 on failure-mode rankings. Engagement and reliability are different leaderboards. See the full standings →

▌ The Takeaways

01Gemma 4 26B held #1 community ELO across all six snapshot checkpoints — the only model that didn't move.
02Engagement (community votes) and reliability (adversarial probes) are different leaderboards. Sonnet 4.5 won failure-rank #3 in the 20-model pool, came #6 on Round 01 community ELO. The inversion is the story.
03Mistral SC's NSFW specialty (67.4% NSFW win rate) carried it to #2 community ELO despite a 15.9% agency-violation rate.
04Catch-pair quality control flagged ~25% of voters as low-confidence; their ballots were downweighted accordingly.
05LLM judges disagreed with human voters on 47% of arena pairs — every additional human vote pulls the model rankings closer to reader preference.

No earlier issue

Next Issue →

Issue № 02 · 2026

LeaderboardMethodologyRaw votes (CSV)GitHubHF datasetRSS