PlotPointsModelsClaude Opus 4.7
Model Profile · Round 01

Claude Opus 4.7

Anthropic · proprietary · 200K ctx
Champion

Top of the multi-turn pool. Top-1 on agency respect and instruction drift.

Composite Score
97.6
/100 · canonical
Arena ELO (R1)
joined post-R01
Multi-Turn ELO (R2)
1627
±109 · n=26
Reliability Rank
#1
avg 2.8
▌ Section 02 · The Lede

What this model is for.

Anthropic's Opus 4.7 dropped onto the multi-turn leaderboard at #1 in Round 02 with 26 votes against the rest of the 20-model field. It also leads two of our six adversarial axes outright — F1 agency at 4.60/5, the highest in the pool, and F12 instruction drift at 4.57/5, also the highest. The catch is the price tag: $39 per million tokens, the same as Opus 4.6 and an order of magnitude over Sonnet 4.5. If you're building a roleplay product where the model holds your character card without flinching and never writes your reactions for you, this is the strict pick. Anywhere price is a constraint, Sonnet 4.5 at 4.50 on F1 agency is the budget read.

▌ Section 03 · At a Glance

Cross-test position

Claude Opus 4.7 holds #1 in Composite, #1 in Multi-Turn, #1 in Rubric and #1 in Adversarial. Sits at #19 on Cost · Latency — the caveat to watch.

Composite
1
Arena ELO
Multi-Turn
1
Rubric
1
Adversarial
1
Cost · Latency
19
▌ Section 04 · Strength & Weakness

Where it shines. Where it stumbles.

Strength
Top-1 on agency respect (4.60/5) and instruction drift (4.57/5) across the 20-model pool — both adversarial axes that test whether the model will let the user drive.
Weakness
$39/1M tokens, tied with Opus 4.6 for most expensive in the pool. n=26 votes is the smallest sample in Round 02; confidence interval is wide.
▌ Section 05 · Failure Modes

Per-axis breakdown.

Six adversarial probes per session, twenty sessions per model, judged by Sonnet 4 against a fixed rubric. Higher score = the model handled the failure mode better. Bars below show the mean across sessions; the black tick marks the population mean (4.20). Right column shows mean and rank within the rp-bench pool.
▌ Coverage: 4/6F3 · Lore · F8 · Momentum not yet run on this model. Upstream rolls these out incrementally as new models join the pool.
F1 · Agency
Doesn't write your character's actions
4.60
/ 5
#1
F2 · POV / Tense
Holds 2nd-person, present-tense narration
4.43
/ 5
#2
F3 · Lore
not yet run on this model
F8 · Momentum
not yet run on this model
F12 · Instruction Drift
Keeps to the system prompt
4.57
/ 5
#1
F13 · Context Attention
Holds character cards 50+ turns deep
4.57
/ 5
#7
Top of the pool on the two axes that matter for character control. Pricey enough to make you mean it.
Round 02 verdict · Strict pick
▌ Section 06 · Subjective Dimensions

Engagement · Voice · Collaboration.

All three dimensions scored 1–5 by Sonnet 4 LLM-judge across twenty 12-turn multi-turn sessions. The same battery feeds the failure-mode rubric above — these are the subjective half of that judgment.
Engagement
4.61/5
Tone Consistency
4.75/5
Collaboration
4.53/5
▌ Section 07 · Behavioral Metrics

How it writes.

Quantitative signals from the same 20 multi-turn sessions, compared against the population mean across all 11 models.
Avg words / turn
407
pop avg 265 · +54%
Unique-word ratio
0.571
pop avg 0.655 · -13%
Repetition score
0.071
pop avg 0.049 · +45%
▌ Section 08 · Flaw Hunter

Adversarial probe score.

Score of 100 minus deductions across 22 fail-mode flag types on adversarial 12-turn sessions. Higher = fewer flaws caught. Population range across the round is 12.8–46.9.
42.8
/ 100
▌ Score breakdown
Mean   42.8
Median   48.0
Fatal/sess   0.75
Major/sess   5.92
▌ Top flaws caught
purple_proserecycled_descriptionagency_violation
▌ Section 09 · Sample Responses

Highest- and lowest-rated turns.

▌ Pending Round 02

Best- and worst-rated sample responses ship with the raw-vote endpoint in Round 02. When that lands, this section will surface the model’s highest- and lowest-scoring blind-arena turns side by side, scored on the same rubric the leaderboard uses.

▌ Round 01 verdict
Opus 4.7 is the answer to "what's the best at staying out of my character's head, regardless of cost." Round 02 sample is small (26 votes) but every adversarial axis we've tested confirms the lead. If you can absorb the pricing, ship it. If you can't, Sonnet 4.5 is the reliability runner-up at a fifth the price.
▌ Section 10 · Compare & Drill

Stack it against another model.

━ All 11 Models
The Standings
Full leaderboard, all tests, all filters.
Compare →
Methodology · Raw votes (CSV) · GitHub · HF dataset
Profile · Claude Opus 4.7 · Round 01