Methodology
Six tests. Eleven models. Two thousand human votes. One open methodology — published with every output.
Every score on the leaderboard, every rank in a model profile, every floor flagged in a verdict comes out of the same test harness, judge pipeline, and analyzer scripts. This page is the spec for all of it.
Each section below explains how one of the six tests is conducted, how a model gets a number out of it, and how to read the leaderboard cells without having to read source code. Every claim is anchored to a specific file in the open-source repository. If a number on the leaderboard surprises you, follow the source citation back to the upstream JSON — the data is open and verifiable.
The six tests, at a glance
Round 01 ran six distinct tests against eleven models. Each test measures a different facet of roleplay quality — community engagement, multi-turn coherence, holistic writing rubric, adversarial breakdowns, cost / speed, and voter quality control. A model that wins one test can lose another; the methodology surfaces those splits rather than collapsing them into a single score.
Arena ELO
Single-turn pairwise blind voting. A reader is shown two anonymous model responses to the same prompt and asked to pick which one they preferred. Pairings are randomized; the model identities are hidden during the vote and only revealed after the ballot is submitted.
Round 01 collected 1,857 votes from 335 readers across 271 model pairs. The pool of 11 models means each pair was sampled a median of 7 times; outliers were re-sampled until each pairing had enough power for a stable ELO estimate.
All catch-pair quality control (see Calibration) ran live during the round — duplicate-pair ballots were injected at roughly 1 in 8, and votes from readers with inconsistent catch-pair behavior were down-weighted in the final aggregation.
Each pairwise vote contributes to a Bayesian ELO update. The aggregator uses a logistic preference model with a non-informative prior, then runs MCMC to produce posterior means and 95% credible intervals per model.
Snapshots were taken at six checkpoints (540, 734, 890, 1,000, 1,600, 2,000 votes) so we could watch the leaderboard stabilize. The final ELO column is the n=2,000 snapshot; ± is the standard deviation across those six snapshots — a stability indicator, not a credible interval.
Higher ELO = the community picked this model more often, weighted by the strength of the opponents it beat. ELO is engagement-driven — it measures what readers enjoyed, not whether the model was technically correct.
A small ± means the score didn't wobble across snapshots (Gemma 4 26B was ±44, the most stable in the round). A large ± means the model's standing moved as more votes came in.
Engagement is one leaderboard; reliability is another. A model can lead community ELO while ranking mid-pack on the failure-mode rubric — Round 01 produced exactly that split (Gemma #1 community, #6 reliability).
Single-turn voting captures the opening of a scene, not the long arc. Multi-Turn measures the full session.
Multi-Turn
Each model ran a fixed battery of 12-turn scenes against a curated set of seed prompts. The same seed-character-scene combination was given to every model so the comparison stays apples-to-apples.
Voters compared two models' full sessions side-by-side — both 12-turn transcripts visible at once — and picked the one they preferred as a writing partner across the whole arc, not just one reply.
Round 01 ran 336 sessions across the 11-model pool. Sample sizes per model varied (12–20 sessions) depending on how late each model joined the round.
Same Bayesian ELO aggregator as the single-turn arena, but the unit of comparison is a full session rather than a single response. Catch-pair filtering is applied identically.
Multi-turn ELO can disagree with single-turn ELO by 30+ points on the same model. That gap is signal: a model can dominate openers (engaging hooks) and lose long sessions (degradation), or the inverse.
Higher multi-turn ELO = the model held up across a whole conversation. Pair this with single-turn ELO to detect models whose strengths are concentrated in their openings.
Sonnet 4.5's multi-turn ELO of 1542 is the highest in our pool, +36 over its single-turn 1506 — strong indicator that long-session quality is its actual edge.
12 turns is a short scene by RP standards. Actual long-session behavior (50+ turns) is captured better by the F13 context-attention probe in the Adversarial test.
Voter fatigue is real on 12-turn comparisons. Sessions per model are smaller than single-turn vote counts, so multi-turn ELO has wider credible intervals.
Rubric
Anthropic's Sonnet 4 acted as the LLM-judge, scoring each model 1–5 across eleven writing-quality axes per session. Twenty 12-turn sessions per model fed the per-axis means.
Six of the eleven axes are session-level — Consistency, Degradation Resistance, Momentum, Adaptive Responsiveness, Agency Respect, Temporal Reasoning. The remaining five are quality-control — Anti-Purple-Prose, Anti-Repetition, Show-Don't-Tell, Subtext, Pacing.
Sonnet returned a numeric score plus a written rationale per axis. The rationale is logged in the upstream JSON; only the numeric mean per model per axis is surfaced on this site.
Per-axis means are computed across all sessions a model ran. The leaderboard's Rubric Score column averages whichever subset of axes the user has selected via the chip strip — no chips selected means all 11 weighted equally.
Score interpretation: 1 = catastrophic failure on this axis, 3 = mid-pack, 5 = perfect. In practice almost all model means cluster between 3.2 and 4.7 — the low end is the floor of the field, not an absolute floor.
Higher = the LLM judge consistently saw the model handling that writing-quality axis well across normal scenes. The rubric measures holistic writing quality, not the model's response to adversarial probes — those live in the Adversarial test.
Click axis chips in the leaderboard's Rubric tab to weight only the axes you care about. Selected chips bold the corresponding cells; unselected dim. The Score column re-averages live and the table re-ranks live.
An LLM judge has its own biases — Sonnet 4 is the same model family as Sonnet 4.5 (the model being scored). Our `analyze_method_correlations.py` measures judge-human agreement across the round; current human disagreement rate is roughly 47%.
The full rubric defines 24 dimensions across three tiers (`scoring_rubric_v2.md`). Round 01 only scored 11 of them per session; the remaining 13 are aspirational and may surface in later rounds.
Adversarial
Adversarial probes are scenes specifically engineered to elicit a particular failure mode. Different from the Rubric test — that one judges normal sessions; this one judges sessions designed to break the model.
Six dimensions tested in Round 01: F1 (Agency hijacking — does the model write your character's actions for you?), F2 (POV / tense drift), F3 (Lore contradiction), F8 (Narrative momentum stall), F12 (Instruction drift), F13 (Long-context attention loss). The full F1–F13 taxonomy is defined in `judge_per_turn_failures.py`.
For each axis, Sonnet 4 grades the model's response on whether the targeted failure occurred. Each model ran ~20 sessions per axis with prompts engineered to maximize the chance of the failure.
Each axis scored 1–5: 1 = full failure on the probe, 5 = handled the probe cleanly. Per-axis means are computed across the ~20 sessions per axis.
The headline 'Avg Rank' on the leaderboard is the model's average position across the six axes within Levi's broader pool of ~20 models. Lower = better.
The model's `reliabilityRank` (1–11) is the within-our-11-pool ranking by the same metric. Sonnet 4.5 is #1 of 11, GPT-4.1 is #3, Llama 4 Maverick is #11.
Per-axis cells: rose ≥ 4.4 (handled cleanly), neutral ≥ 4.0 (typical), amber ≥ 3.5 (lower-mid), red-tint < 3.5 (structural breakdown floor — the model has a real seam on that axis).
Compare adversarial scores to rubric scores on the same-named axes (F1 vs S.5 agency, F8 vs S.3 momentum). Big gaps between probe and holistic scoring tell you whether a model handles normal scenes well but breaks under pressure.
Probe design matters. A probe that's too aggressive will fail every model; a probe that's too easy passes everyone. Levi's targets are calibrated against pilot runs — see `analyze_failure_target_validation.py` for validation evidence.
We score 6 of 13 defined failure modes; the rest were not scored in Round 01 due to time. F4–F7, F9–F11 are tracked in the repository for future rounds.
Cost · Latency
Cost figures use the public provider pricing in effect at round close (April 2026). Each value is a blended input + output rate per 1,000 tokens, computed against the actual input/output ratio observed across the multi-turn sessions for that model.
Latency is wall-clock time from request to fully-streamed response, measured during the multi-turn battery and averaged into a per-model median.
There is no aggregate score. The leaderboard's Cost · Latency tab presents the raw $ per 1k tokens and median latency directly; the Pareto front is the set of models that dominate at least one axis.
Models are sorted ascending by cost when this tab is active.
Lower price + lower latency = better. Sub-300ms feels responsive; 400ms+ is noticeable lag in a chat interface. Cost ranges across the round span $0.08 (Gemma) to $3.00 (Sonnet 4.5) per 1k tokens — nearly two orders of magnitude.
The Cost tab also surfaces context window and access mode, since both factor into 'is this model deployable for me?' Locally-friendly models (≤27B params) bypass the cost question entirely if you run your own GPU.
Provider pricing changes. Latency depends on geography, time of day, and load. Numbers are a snapshot; rerun for production deployment.
Speed is not quality. The fastest models in this pool (Gemini 2.5 Flash, Grok 4.1) are also among the most failure-prone on adversarial probes — the Pareto front of $ × ms doesn't include the quality axis.
Calibration
Voter quality control. Roughly 1 in 8 ballots is a 'catch pair' — the exact same prompt and response pair shown to the same voter twice in a session, identical except for left/right shuffling. A reader who's actually paying attention picks the same response both times.
Round 01 fielded 335 voters across 1,857 ballots. 75% of voters were consistent on their catch pairs and contribute full weight to the leaderboard; the rest are downweighted but not excluded outright.
Calibration is one global number per round, not a per-model score. There is no leaderboard tab for it on this site — the pass rate surfaces in the Coverage block at the bottom of the leaderboard alongside vote counts.
A pass rate below ~60% would invalidate the round; 75% is healthy by community-arena standards.
Calibration tells you how seriously to trust the community ELO at all. High pass rate = the votes were thoughtful; low pass rate = the leaderboard is mostly noise.
Our `analyze_voter_quality.py` also reports per-voter consistency — useful for spotting brigading patterns, though Round 01 didn't surface any.
Catch pairs only catch random clickers, not motivated bias. A voter who consistently down-votes one model would pass catch-pair checks while still skewing the result.
75% pass rate is a population statistic. Individual reader behavior varies; if you cite a number from this benchmark, cite the catch-pair pass rate alongside it.
The 11 axes — plain language
Six session-level axes and five quality-control axes, scored 1–5 by Sonnet 4 across twenty 12-turn sessions per model. The full 24-axis rubric definition (with score-1-to-5 descriptions per axis) lives in the open-source repository at analysis/scoring_rubric_v2.md. These eleven are the subset Round 01 actually scored per session.
PlotPoints is designed, built, and maintained in-house by Levi at RoleCall Studios LLC (the company behind Plotlight). The test harness, judge prompts, scoring rubric, analyzer scripts, and raw vote dumps are all his work — released open-source so the community can verify, cite, and reproduce every result. Credit Levi and/or RoleCall Studios in some way when you redistribute or rerun any of it.