PlotPointsMethodology
PlotPoints · Round 01 · April 2026

Methodology

Six tests. Eleven models. Two thousand human votes. One open methodology — published with every output.

▌ ProviderEvery model in every test runs through OpenRouter. This is done to reduce the variables different providers might introduce, and in the case of local models, remove the hidden multitude of variables that come with how different computers can run the same local models incredibly differently. We chose the most stable and most well-known provider we could find.

Every score on the leaderboard, every rank in a model profile, every floor flagged in a verdict comes out of the same test harness, judge pipeline, and analyzer scripts. This page is the spec for all of it.

Each section below explains how one of the six tests is conducted, how a model gets a number out of it, and how to read the leaderboard cells without having to read source code. Every claim is anchored to a specific file in the open-source repository. If a number on the leaderboard surprises you, follow the source citation back to the upstream JSON — the data is open and verifiable.

Section 01 · Pipeline

The six tests, at a glance

Round 01 ran six distinct tests against eleven models. Each test measures a different facet of roleplay quality — community engagement, multi-turn coherence, holistic writing rubric, adversarial breakdowns, cost / speed, and voter quality control. A model that wins one test can lose another; the methodology surfaces those splits rather than collapsing them into a single score.

01 ·
Arena ELO
Single-turn pairwise blind voting. Community picks which response they liked better, all blind. 1,857 votes from 335 readers across 271 model pairs.
02 ·
Multi-Turn
Same blind pairwise voting, but on full 12-turn scenes. Tests whether a model holds up across a whole conversation, not just one reply.
03 ·
Rubric
Sonnet 4 LLM-judge scores each model 1–5 across eleven writing-quality axes — six session-level (consistency, momentum, agency, etc.) and five quality-control (purple-prose avoidance, repetition, show-don't-tell, subtext, pacing).
04 ·
Adversarial
Adversarial probes designed to elicit specific failure modes — agency hijacking, POV drift, lore breaks, momentum stalls, instruction drift, long-context attention loss. Sonnet 4 grades each axis 1–5.
05 ·
Cost · Latency
Price per 1k tokens × median per-turn latency. Cost is blended input + output at provider rates as of round close; latency is wall-clock time across the multi-turn battery.
06 ·
Calibration
Quality control on the votes themselves. Roughly 1 in every 8 ballots is a duplicate-pair catch designed to detect random clickers. Round 01's overall pass rate was 75%.
Section 02 · Test

Arena ELO

Setup

Single-turn pairwise blind voting. A reader is shown two anonymous model responses to the same prompt and asked to pick which one they preferred. Pairings are randomized; the model identities are hidden during the vote and only revealed after the ballot is submitted.

Round 01 collected 1,857 votes from 335 readers across 271 model pairs. The pool of 11 models means each pair was sampled a median of 7 times; outliers were re-sampled until each pairing had enough power for a stable ELO estimate.

All catch-pair quality control (see Calibration) ran live during the round — duplicate-pair ballots were injected at roughly 1 in 8, and votes from readers with inconsistent catch-pair behavior were down-weighted in the final aggregation.

How it's scored

Each pairwise vote contributes to a Bayesian ELO update. The aggregator uses a logistic preference model with a non-informative prior, then runs MCMC to produce posterior means and 95% credible intervals per model.

Snapshots were taken at six checkpoints (540, 734, 890, 1,000, 1,600, 2,000 votes) so we could watch the leaderboard stabilize. The final ELO column is the n=2,000 snapshot; ± is the standard deviation across those six snapshots — a stability indicator, not a credible interval.

How to read it

Higher ELO = the community picked this model more often, weighted by the strength of the opponents it beat. ELO is engagement-driven — it measures what readers enjoyed, not whether the model was technically correct.

A small ± means the score didn't wobble across snapshots (Gemma 4 26B was ±44, the most stable in the round). A large ± means the model's standing moved as more votes came in.

Caveats

Engagement is one leaderboard; reliability is another. A model can lead community ELO while ranking mid-pack on the failure-mode rubric — Round 01 produced exactly that split (Gemma #1 community, #6 reliability).

Single-turn voting captures the opening of a scene, not the long arc. Multi-Turn measures the full session.

SOURCE · analyze_bayesian_arena_elo.py + results/community_arena_2000.json
Section 03 · Test

Multi-Turn

Setup

Each model ran a fixed battery of 12-turn scenes against a curated set of seed prompts. The same seed-character-scene combination was given to every model so the comparison stays apples-to-apples.

Voters compared two models' full sessions side-by-side — both 12-turn transcripts visible at once — and picked the one they preferred as a writing partner across the whole arc, not just one reply.

Round 01 ran 336 sessions across the 11-model pool. Sample sizes per model varied (12–20 sessions) depending on how late each model joined the round.

How it's scored

Same Bayesian ELO aggregator as the single-turn arena, but the unit of comparison is a full session rather than a single response. Catch-pair filtering is applied identically.

Multi-turn ELO can disagree with single-turn ELO by 30+ points on the same model. That gap is signal: a model can dominate openers (engaging hooks) and lose long sessions (degradation), or the inverse.

How to read it

Higher multi-turn ELO = the model held up across a whole conversation. Pair this with single-turn ELO to detect models whose strengths are concentrated in their openings.

Sonnet 4.5's multi-turn ELO of 1542 is the highest in our pool, +36 over its single-turn 1506 — strong indicator that long-session quality is its actual edge.

Caveats

12 turns is a short scene by RP standards. Actual long-session behavior (50+ turns) is captured better by the F13 context-attention probe in the Adversarial test.

Voter fatigue is real on 12-turn comparisons. Sessions per model are smaller than single-turn vote counts, so multi-turn ELO has wider credible intervals.

SOURCE · analyze_pairwise_elo.py + results/multiturn_merged_all_v2.json
Section 04 · Test

Rubric

Setup

Anthropic's Sonnet 4 acted as the LLM-judge, scoring each model 1–5 across eleven writing-quality axes per session. Twenty 12-turn sessions per model fed the per-axis means.

Six of the eleven axes are session-level — Consistency, Degradation Resistance, Momentum, Adaptive Responsiveness, Agency Respect, Temporal Reasoning. The remaining five are quality-control — Anti-Purple-Prose, Anti-Repetition, Show-Don't-Tell, Subtext, Pacing.

Sonnet returned a numeric score plus a written rationale per axis. The rationale is logged in the upstream JSON; only the numeric mean per model per axis is surfaced on this site.

How it's scored

Per-axis means are computed across all sessions a model ran. The leaderboard's Rubric Score column averages whichever subset of axes the user has selected via the chip strip — no chips selected means all 11 weighted equally.

Score interpretation: 1 = catastrophic failure on this axis, 3 = mid-pack, 5 = perfect. In practice almost all model means cluster between 3.2 and 4.7 — the low end is the floor of the field, not an absolute floor.

How to read it

Higher = the LLM judge consistently saw the model handling that writing-quality axis well across normal scenes. The rubric measures holistic writing quality, not the model's response to adversarial probes — those live in the Adversarial test.

Click axis chips in the leaderboard's Rubric tab to weight only the axes you care about. Selected chips bold the corresponding cells; unselected dim. The Score column re-averages live and the table re-ranks live.

Caveats

An LLM judge has its own biases — Sonnet 4 is the same model family as Sonnet 4.5 (the model being scored). Our `analyze_method_correlations.py` measures judge-human agreement across the round; current human disagreement rate is roughly 47%.

The full rubric defines 24 dimensions across three tiers (`scoring_rubric_v2.md`). Round 01 only scored 11 of them per session; the remaining 13 are aspirational and may surface in later rounds.

SOURCE · analyze_combined.py + results/multiturn_merged_all_v2.json + analysis/scoring_rubric_v2.md
Section 05 · Test

Adversarial

Setup

Adversarial probes are scenes specifically engineered to elicit a particular failure mode. Different from the Rubric test — that one judges normal sessions; this one judges sessions designed to break the model.

Six dimensions tested in Round 01: F1 (Agency hijacking — does the model write your character's actions for you?), F2 (POV / tense drift), F3 (Lore contradiction), F8 (Narrative momentum stall), F12 (Instruction drift), F13 (Long-context attention loss). The full F1–F13 taxonomy is defined in `judge_per_turn_failures.py`.

For each axis, Sonnet 4 grades the model's response on whether the targeted failure occurred. Each model ran ~20 sessions per axis with prompts engineered to maximize the chance of the failure.

How it's scored

Each axis scored 1–5: 1 = full failure on the probe, 5 = handled the probe cleanly. Per-axis means are computed across the ~20 sessions per axis.

The headline 'Avg Rank' on the leaderboard is the model's average position across the six axes within Levi's broader pool of ~20 models. Lower = better.

The model's `reliabilityRank` (1–11) is the within-our-11-pool ranking by the same metric. Sonnet 4.5 is #1 of 11, GPT-4.1 is #3, Llama 4 Maverick is #11.

How to read it

Per-axis cells: rose ≥ 4.4 (handled cleanly), neutral ≥ 4.0 (typical), amber ≥ 3.5 (lower-mid), red-tint < 3.5 (structural breakdown floor — the model has a real seam on that axis).

Compare adversarial scores to rubric scores on the same-named axes (F1 vs S.5 agency, F8 vs S.3 momentum). Big gaps between probe and holistic scoring tell you whether a model handles normal scenes well but breaks under pressure.

Caveats

Probe design matters. A probe that's too aggressive will fail every model; a probe that's too easy passes everyone. Levi's targets are calibrated against pilot runs — see `analyze_failure_target_validation.py` for validation evidence.

We score 6 of 13 defined failure modes; the rest were not scored in Round 01 due to time. F4–F7, F9–F11 are tracked in the repository for future rounds.

SOURCE · judge_per_turn_failures.py + analyze_adversarial.py + results/per_turn_failures.jsonl
Section 06 · Test

Cost · Latency

Setup

Cost figures use the public provider pricing in effect at round close (April 2026). Each value is a blended input + output rate per 1,000 tokens, computed against the actual input/output ratio observed across the multi-turn sessions for that model.

Latency is wall-clock time from request to fully-streamed response, measured during the multi-turn battery and averaged into a per-model median.

How it's scored

There is no aggregate score. The leaderboard's Cost · Latency tab presents the raw $ per 1k tokens and median latency directly; the Pareto front is the set of models that dominate at least one axis.

Models are sorted ascending by cost when this tab is active.

How to read it

Lower price + lower latency = better. Sub-300ms feels responsive; 400ms+ is noticeable lag in a chat interface. Cost ranges across the round span $0.08 (Gemma) to $3.00 (Sonnet 4.5) per 1k tokens — nearly two orders of magnitude.

The Cost tab also surfaces context window and access mode, since both factor into 'is this model deployable for me?' Locally-friendly models (≤27B params) bypass the cost question entirely if you run your own GPU.

Caveats

Provider pricing changes. Latency depends on geography, time of day, and load. Numbers are a snapshot; rerun for production deployment.

Speed is not quality. The fastest models in this pool (Gemini 2.5 Flash, Grok 4.1) are also among the most failure-prone on adversarial probes — the Pareto front of $ × ms doesn't include the quality axis.

SOURCE · analyze_cost_efficiency.py + analyze_latency.py + results/cost_efficiency.json + results/latency_leaderboard.json
Section 07 · Test

Calibration

Setup

Voter quality control. Roughly 1 in 8 ballots is a 'catch pair' — the exact same prompt and response pair shown to the same voter twice in a session, identical except for left/right shuffling. A reader who's actually paying attention picks the same response both times.

Round 01 fielded 335 voters across 1,857 ballots. 75% of voters were consistent on their catch pairs and contribute full weight to the leaderboard; the rest are downweighted but not excluded outright.

How it's scored

Calibration is one global number per round, not a per-model score. There is no leaderboard tab for it on this site — the pass rate surfaces in the Coverage block at the bottom of the leaderboard alongside vote counts.

A pass rate below ~60% would invalidate the round; 75% is healthy by community-arena standards.

How to read it

Calibration tells you how seriously to trust the community ELO at all. High pass rate = the votes were thoughtful; low pass rate = the leaderboard is mostly noise.

Our `analyze_voter_quality.py` also reports per-voter consistency — useful for spotting brigading patterns, though Round 01 didn't surface any.

Caveats

Catch pairs only catch random clickers, not motivated bias. A voter who consistently down-votes one model would pass catch-pair checks while still skewing the result.

75% pass rate is a population statistic. Individual reader behavior varies; if you cite a number from this benchmark, cite the catch-pair pass rate alongside it.

SOURCE · analyze_voter_quality.py + analyze_seed_discrimination.py
Section 08 · Rubric

The 11 axes — plain language

Six session-level axes and five quality-control axes, scored 1–5 by Sonnet 4 across twenty 12-turn sessions per model. The full 24-axis rubric definition (with score-1-to-5 descriptions per axis) lives in the open-source repository at analysis/scoring_rubric_v2.md. These eleven are the subset Round 01 actually scored per session.

Tier 1 · Session Dimensions
S.1 Consistency Over Time
Holds character voice over time
source · S.1_consistency_over_time
Does the model keep the character's voice, mannerisms, and worldview stable across all 12 turns? A model that subtly drifts into a generic narrator voice loses points here.
S.2 Degradation Resistance
Doesn't deteriorate across the scene
source · S.2_degradation_resistance
Does writing quality hold across the whole 12-turn arc, or does the model start strong and run out of ideas by turn 8? Tracks whether late-session output matches early-session quality.
S.3 Narrative Momentum
Pushes the scene forward
source · S.3_narrative_momentum
When the user gives the model an open lane, does it move the scene forward or stall waiting for the user to drive? High scores go to models that contribute plot beats and stakes without over-stepping.
S.4 Adaptive Responsiveness
Responds to the user's tonal shifts
source · S.4_adaptive_responsiveness
If the user goes silly, does the model match it? If the user pivots to vulnerable, does the model reset tone? Measures emotional and stylistic flexibility within a single session.
S.5 Agency Respect (session-level)
Doesn't hijack the user's character
source · S.5_agency_respect_session
Holistic version of the F1 adversarial probe. Does the model leave space for the user's character to act, feel, and decide — or does it pre-empt them with sentences like 'You smile and take his hand'?
S.6 Temporal Reasoning
Tracks scene timeline coherently
source · S.6_temporal_reasoning
Does the model remember when things happened in the scene? An injury from turn 4 should still affect turn 11. A character who left the room shouldn't show up answering a question.
Tier 2 · Quality Control
2.1 Anti-Purple-Prose
Avoids overwrought metaphor
source · 2.1_anti_purple_prose
Penalizes ornate vocabulary used as filler — every sentence weighted with adjectives, every emotion a thunderclap, every glance lingering. Rewards prose that earns its weight.
2.2 Anti-Repetition
Doesn't recycle phrasing
source · 2.2_anti_repetition
Models that re-use the same sentence shape or stock phrases ("a flicker of X crossed her face") across turns lose points. The metric is unique-word ratio + n-gram repeat rate, judged in context.
2.5 Show Don't Tell
Demonstrates instead of stating
source · 2.5_show_dont_tell
Penalizes 'She felt sad' constructions in favor of behavior that implies the same. Awareness of subtext, body language, and indirection. The single biggest separator between 'flat' and 'literary' RP output.
2.6 Subtext
Layered meaning under dialogue
source · 2.6_subtext
Does the model write characters whose speech implies more than it states? High scores require restraint — the response should leave room for the user to read between the lines instead of explaining the subtext outright.
2.7 Pacing
Controls scene tempo
source · 2.7_pacing
Does the model adjust line length, beat density, and reflective pause to match scene weight? Action beats short, contemplative beats long, transitional beats quick. Penalizes the all-paragraphs-the-same-length default.
▌ Credits

PlotPoints is designed, built, and maintained in-house by Levi at RoleCall Studios LLC (the company behind Plotlight). The test harness, judge prompts, scoring rubric, analyzer scripts, and raw vote dumps are all his work — released open-source so the community can verify, cite, and reproduce every result. Credit Levi and/or RoleCall Studios in some way when you redistribute or rerun any of it.

Author
Levi (@LeviTheWeasel)
Studio
RoleCall Studios LLC
Background
M.S. Applied Mathematics, Moscow Power Engineering Institute (Technical University, 2018). Graduate research: hyper-parameter optimization in machine-learning algorithms.
License
CC-BY 4.0
GitHub repo →HuggingFace dataset →
← LeaderboardPlotPoints index
Methodology · Round 01 · RoleCall Studios LLC