Issue № 02 · April–June 2026
Closed
Single-turn charm doesn't survive twelve turns.
Twenty models, 1,877 blind votes on full twelve-turn sessions — and Round 01's podium sank almost to the bottom.
1,877
votes
466
voters
190
pairs
74%
catch-pair pass
▌ From the Issue
The great inversion.
Round 01's community podium — Gemma 4 26B, Mistral Small Creative, Gemini 2.5 Flash — finished the multi-turn round at #15, #6, and #19. GPT-4.1, dead last with single-message voters, climbed to top-5 once voters read whole sessions.
Single-message votes measure charm; twelve-turn votes measure stamina. The two leaderboards correlate negatively (ρ = −0.24). See the full standings →
▌ The Takeaways
- 01Claude Opus 4.7 closed the round at #1 (1583), retaking the top from DeepSeek v4 Pro in the final six weeks of voting.
- 02The single-message and multi-turn rankings rank-invert: Spearman ρ = −0.24 across the 11 models with both. Engagement at message one and engagement at turn twelve are different skills.
- 03Human multi-turn votes track the LLM judge only loosely (ρ = +0.53) — voters reward narrative pull the judges' craft rubric undersells, which is exactly why the human round exists.
- 04GPT-4.1's arc is the round in miniature: community last in Round 01, multi-turn top-5 in Round 02.
- 05Position bias stayed manageable: B-side took 52.8% of decided votes across 1,877 ballots.