PlotPointsRound 02
Issue № 02 · April–June 2026
Closed

Single-turn charm doesn't survive twelve turns.

Twenty models, 1,877 blind votes on full twelve-turn sessions — and Round 01's podium sank almost to the bottom.

1,877
votes
466
voters
190
pairs
74%
catch-pair pass
From the Issue

The great inversion.

Round 01's community podium — Gemma 4 26B, Mistral Small Creative, Gemini 2.5 Flash — finished the multi-turn round at #15, #6, and #19. GPT-4.1, dead last with single-message voters, climbed to top-5 once voters read whole sessions.

Single-message votes measure charm; twelve-turn votes measure stamina. The two leaderboards correlate negatively (ρ = −0.24). See the full standings →

▌ The Takeaways
  1. 01Claude Opus 4.7 closed the round at #1 (1583), retaking the top from DeepSeek v4 Pro in the final six weeks of voting.
  2. 02The single-message and multi-turn rankings rank-invert: Spearman ρ = −0.24 across the 11 models with both. Engagement at message one and engagement at turn twelve are different skills.
  3. 03Human multi-turn votes track the LLM judge only loosely (ρ = +0.53) — voters reward narrative pull the judges' craft rubric undersells, which is exactly why the human round exists.
  4. 04GPT-4.1's arc is the round in miniature: community last in Round 01, multi-turn top-5 in Round 02.
  5. 05Position bias stayed manageable: B-side took 52.8% of decided votes across 1,877 ballots.
LeaderboardMethodologyRaw votes (CSV)GitHubHF datasetRSS