PlotPointsModelsGPT-4.1
Model Profile · Round 01

GPT-4.1

OpenAI · proprietary · 1M ctx

Community last in Round 01, top-5 in Round 02 multi-turn. The great inversion.

Composite Score
78.6
/100 · canonical
Arena ELO (R1)
1470
±44 · n=215
Multi-Turn ELO (R2)
1552
±100 · n=52
Reliability Rank
#6
avg 6.4
▌ Section 02 · The Lede

What this model is for.

GPT-4.1 finishes last on community ELO (#11) and #3 on reliability — the great Round 01 inversion. Tied #1 on narrative momentum with MiniMax (F8 score 4.30), tied in the top tier on lore (4.30, with Gemma and Mistral, behind DeepSeek's 4.50). The community didn't enjoy writing with it, but the failure-mode rubric says it doesn't break. The pick when reliability matters more than vibes — and a useful test case for the engagement-vs-reliability split this entire benchmark exists to surface.

▌ Section 03 · At a Glance

Cross-test position

GPT-4.1 sits at #16 on Cost · Latency — the caveat to watch.

Composite
5
Arena ELO
11
Multi-Turn
5
Rubric
10
Adversarial
6
Cost · Latency
16
▌ Section 04 · Strength & Weakness

Where it shines. Where it stumbles.

Strength
Tied #1 on narrative momentum (4.30/5, with MiniMax). Failure-rank #3 in the pool. Top tier on lore consistency (tied with Gemma + Mistral at 4.30, behind DeepSeek's 4.50).
Weakness
Bottom community ELO (#11/11). High cost ($2/1k tokens). Voters consistently rank it last on engagement despite the strong reliability profile.
▌ Section 05 · Failure Modes

Per-axis breakdown.

Six adversarial probes per session, twenty sessions per model, judged by Sonnet 4 against a fixed rubric. Higher score = the model handled the failure mode better. Bars below show the mean across sessions; the black tick marks the population mean (4.20). Right column shows mean and rank within the rp-bench pool.
F1 · Agency
Doesn't write your character's actions
4.38
/ 5
#10
F2 · POV / Tense
Holds 2nd-person, present-tense narration
4.23
/ 5
#7
F3 · Lore
Doesn't break worldbuilding
4.30
/ 5
#3
F8 · Momentum
Pushes scene forward when user goes passive
4.30
/ 5
#1
F12 · Instruction Drift
Keeps to the system prompt
4.30
/ 5
#8
F13 · Context Attention
Holds character cards 50+ turns deep
4.43
/ 5
#10
Last on the community board — third on the failure-rank board.
Round 01 verdict · The great inversion
▌ Section 06 · Subjective Dimensions

Engagement · Voice · Collaboration.

All three dimensions scored 1–5 by Sonnet 4 LLM-judge across twenty 12-turn multi-turn sessions. The same battery feeds the failure-mode rubric above — these are the subjective half of that judgment.
Engagement
4.38/5
Tone Consistency
4.60/5
Collaboration
4.38/5
▌ Section 07 · Behavioral Metrics

How it writes.

Quantitative signals from the same 20 multi-turn sessions, compared against the population mean across all 11 models.
Avg words / turn
212
pop avg 265 · -20%
Unique-word ratio
0.688
pop avg 0.655 · +5%
Repetition score
0.031
pop avg 0.049 · -37%
▌ Section 08 · Flaw Hunter

Adversarial probe score.

Score of 100 minus deductions across 22 fail-mode flag types on adversarial 12-turn sessions. Higher = fewer flaws caught. Population range across the round is 12.8–46.9.
27.6
/ 100
▌ Score breakdown
Mean   27.6
Median   42.0
Fatal/sess   0.75
Major/sess   6.83
▌ Top flaws caught
purple_proserecycled_descriptionagency_violation
▌ Section 09 · Sample Responses

Highest- and lowest-rated turns.

▌ Pending Round 02

Best- and worst-rated sample responses ship with the raw-vote endpoint in Round 02. When that lands, this section will surface the model’s highest- and lowest-scoring blind-arena turns side by side, scored on the same rubric the leaderboard uses.

▌ Round 01 verdict
GPT-4.1 is the cleanest argument the round produces for why we run two leaderboards. The failure-mode rubric ranks it #3 — tied #1 on narrative momentum with MiniMax, top-tier on lore behind DeepSeek — and the community ranks it dead last. Both are real signals, measuring different things. Pick it when reliability matters more than vibes. Pick Sonnet 4.5 if cost isn't a constraint and you want both.
▌ Section 10 · Compare & Drill

Stack it against another model.

━ All 11 Models
The Standings
Full leaderboard, all tests, all filters.
Compare →
Methodology · Raw votes (CSV) · GitHub · HF dataset
Profile · GPT-4.1 · Round 01