Issue № 02 · April–June 2026

Closed

Single-turn charm doesn't survive twelve turns.

Twenty models, 1,877 blind votes on full twelve-turn sessions — and Round 01's podium sank almost to the bottom.

1,877

votes

466

voters

190

pairs

74%

catch-pair pass

▌ From the Issue

The great inversion.

Round 01's community podium — Gemma 4 26B, Mistral Small Creative, Gemini 2.5 Flash — finished the multi-turn round at #15, #6, and #19. GPT-4.1, dead last with single-message voters, climbed to top-5 once voters read whole sessions.

Single-message votes measure charm; twelve-turn votes measure stamina. The two leaderboards correlate negatively (ρ = −0.24). See the full standings →

▌ The Takeaways

01Claude Opus 4.7 closed the round at #1 (1583), retaking the top from DeepSeek v4 Pro in the final six weeks of voting.
02The single-message and multi-turn rankings rank-invert: Spearman ρ = −0.24 across the 11 models with both. Engagement at message one and engagement at turn twelve are different skills.
03Human multi-turn votes track the LLM judge only loosely (ρ = +0.53) — voters reward narrative pull the judges' craft rubric undersells, which is exactly why the human round exists.
04GPT-4.1's arc is the round in miniature: community last in Round 01, multi-turn top-5 in Round 02.
05Position bias stayed manageable: B-side took 52.8% of decided votes across 1,877 ballots.

← Previous Issue

Issue № 01 · April 2026

Next Issue →

Issue № 03 · 2026

LeaderboardMethodologyRaw votes (CSV)GitHubHF datasetRSS