Model Profile · Round 01

Claude Opus 4.7

Anthropic · proprietary · 200K ctx

Champion

Top of the multi-turn pool. Top-1 on agency respect and instruction drift.

Composite Score

97.6

/100 · canonical

Arena ELO (R1)

—

joined post-R01

Multi-Turn ELO (R2)

1583

±44 · n=142

Reliability Rank

avg 2.8

▌ Section 02 · The Lede

What this model is for.

Anthropic's Opus 4.7 dropped onto the multi-turn leaderboard at #1 in Round 02 with 26 votes against the rest of the 20-model field. It also leads two of our six adversarial axes outright — F1 agency at 4.60/5, the highest in the pool, and F12 instruction drift at 4.57/5, also the highest. The catch is the price tag: $39 per million tokens, the same as Opus 4.6 and an order of magnitude over Sonnet 4.5. If you're building a roleplay product where the model holds your character card without flinching and never writes your reactions for you, this is the strict pick. Anywhere price is a constraint, Sonnet 4.5 at 4.50 on F1 agency is the budget read.

▌ Section 03 · At a Glance

Cross-test position

Claude Opus 4.7 holds #1 in Composite, #1 in Multi-Turn, #1 in Rubric and #1 in Adversarial. Sits at #19 on Cost · Latency — the caveat to watch.

Composite

Arena ELO

—

Multi-Turn

Rubric

Adversarial

Cost · Latency

▌ Section 04 · Strength & Weakness

Where it shines. Where it stumbles.

▲ Strength

Top-1 on agency respect (4.60/5) and instruction drift (4.57/5) across the 20-model pool — both adversarial axes that test whether the model will let the user drive.

▼ Weakness

$39/1M tokens, tied with Opus 4.6 for most expensive in the pool. n=26 votes is the smallest sample in Round 02; confidence interval is wide.

▌ Section 05 · Failure Modes

Per-axis breakdown.

Six adversarial probes per session, twenty sessions per model, judged by Sonnet 4 against a fixed rubric. Higher score = the model handled the failure mode better. Bars below show the mean across sessions; the black tick marks the population mean (4.20). Right column shows mean and rank within the rp-bench pool.

▌ Coverage: 4/6F3 · Lore · F8 · Momentum not yet run on this model. Upstream rolls these out incrementally as new models join the pool.

F1 · Agency

Doesn't write your character's actions

4.60/ 5

F2 · POV / Tense

Holds 2nd-person, present-tense narration

4.43/ 5

F3 · Lore

not yet run on this model

—

—

F8 · Momentum

not yet run on this model

—

—

F12 · Instruction Drift

Keeps to the system prompt

4.57/ 5

F13 · Context Attention

Holds character cards 50+ turns deep

4.57/ 5

“Top of the pool on the two axes that matter for character control. Pricey enough to make you mean it.”

— Round 02 verdict · Strict pick

▌ Section 06 · Subjective Dimensions

Engagement · Voice · Collaboration.

All three dimensions scored 1–5 by Sonnet 4 LLM-judge across twenty 12-turn multi-turn sessions. The same battery feeds the failure-mode rubric above — these are the subjective half of that judgment.

Engagement

4.61/5

Tone Consistency

4.75/5

Collaboration

4.53/5

▌ Section 07 · Behavioral Metrics

How it writes.

Quantitative signals from the same 20 multi-turn sessions, compared against the population mean across all 11 models.

Avg words / turn

407↑

pop avg 265 · +54%

Unique-word ratio

0.571↓

pop avg 0.657 · -13%

Repetition score

0.071↑

pop avg 0.048 · +48%

▌ Section 08 · Flaw Hunter

Adversarial probe score.

Score of 100 minus deductions across 22 fail-mode flag types on adversarial 12-turn sessions. Higher = fewer flaws caught. Population range across the round is 12.8–46.9.

42.8

/ 100

▌ Score breakdown

Mean   42.8
Median   48.0
Fatal/sess   0.75
Major/sess   5.92

▌ Top flaws caught

purple_proserecycled_descriptionagency_violation

▌ Section 09 · Sample Responses

Highest- and lowest-rated turns.

▌ Pending Round 02

Best- and worst-rated sample responses ship with the raw-vote endpoint in Round 02. When that lands, this section will surface the model’s highest- and lowest-scoring blind-arena turns side by side, scored on the same rubric the leaderboard uses.

▌ Round 01 verdict

Opus 4.7 is the answer to "what's the best at staying out of my character's head, regardless of cost." Round 02 sample is small (26 votes) but every adversarial axis we've tested confirms the lead. If you can absorb the pricing, ship it. If you can't, Sonnet 4.5 is the reliability runner-up at a fifth the price.

▌ Section 10 · Compare & Drill

Stack it against another model.

━ All 11 Models

The Standings

Full leaderboard, all tests, all filters.

Compare →

Methodology · Raw votes (CSV) · GitHub · HF dataset

Profile · Claude Opus 4.7 · Round 01