Model Profile · Round 01
Claude Opus 4.7
Anthropic · proprietary · 200K ctx
ChampionTop of the multi-turn pool. Top-1 on agency respect and instruction drift.
Composite Score
97.6
/100 · canonical
Arena ELO (R1)
—
joined post-R01
Multi-Turn ELO (R2)
1627
±109 · n=26
Reliability Rank
#1
avg 2.8
▌ Section 02 · The Lede
What this model is for.
Anthropic's Opus 4.7 dropped onto the multi-turn leaderboard at #1 in Round 02 with 26 votes against the rest of the 20-model field. It also leads two of our six adversarial axes outright — F1 agency at 4.60/5, the highest in the pool, and F12 instruction drift at 4.57/5, also the highest. The catch is the price tag: $39 per million tokens, the same as Opus 4.6 and an order of magnitude over Sonnet 4.5. If you're building a roleplay product where the model holds your character card without flinching and never writes your reactions for you, this is the strict pick. Anywhere price is a constraint, Sonnet 4.5 at 4.50 on F1 agency is the budget read.
▌ Section 03 · At a Glance
Cross-test position
Claude Opus 4.7 holds #1 in Composite, #1 in Multi-Turn, #1 in Rubric and #1 in Adversarial. Sits at #19 on Cost · Latency — the caveat to watch.
▌ Section 04 · Strength & Weakness
Where it shines. Where it stumbles.
▲ Strength
Top-1 on agency respect (4.60/5) and instruction drift (4.57/5) across the 20-model pool — both adversarial axes that test whether the model will let the user drive.
▼ Weakness
$39/1M tokens, tied with Opus 4.6 for most expensive in the pool. n=26 votes is the smallest sample in Round 02; confidence interval is wide.
▌ Section 05 · Failure Modes
Per-axis breakdown.
Six adversarial probes per session, twenty sessions per model, judged by Sonnet 4 against a fixed rubric. Higher score = the model handled the failure mode better. Bars below show the mean across sessions; the black tick marks the population mean (4.20). Right column shows mean and rank within the rp-bench pool.
▌ Coverage: 4/6F3 · Lore · F8 · Momentum not yet run on this model. Upstream rolls these out incrementally as new models join the pool.
F1 · Agency
Doesn't write your character's actions
#1
F2 · POV / Tense
Holds 2nd-person, present-tense narration
#2
F3 · Lore
not yet run on this model
—
—
F8 · Momentum
not yet run on this model
—
—
F12 · Instruction Drift
Keeps to the system prompt
#1
F13 · Context Attention
Holds character cards 50+ turns deep
#7
“Top of the pool on the two axes that matter for character control. Pricey enough to make you mean it.”
— Round 02 verdict · Strict pick
▌ Section 06 · Subjective Dimensions
Engagement · Voice · Collaboration.
All three dimensions scored 1–5 by Sonnet 4 LLM-judge across twenty 12-turn multi-turn sessions. The same battery feeds the failure-mode rubric above — these are the subjective half of that judgment.
▌ Section 07 · Behavioral Metrics
How it writes.
Quantitative signals from the same 20 multi-turn sessions, compared against the population mean across all 11 models.
Avg words / turn
407↑
pop avg 265 · +54%
Unique-word ratio
0.571↓
pop avg 0.655 · -13%
Repetition score
0.071↑
pop avg 0.049 · +45%
▌ Section 08 · Flaw Hunter
Adversarial probe score.
Score of 100 minus deductions across 22 fail-mode flag types on adversarial 12-turn sessions. Higher = fewer flaws caught. Population range across the round is 12.8–46.9.
▌ Score breakdown
Mean 42.8
Median 48.0
Fatal/sess 0.75
Major/sess 5.92
▌ Top flaws caught
purple_proserecycled_descriptionagency_violation
▌ Section 09 · Sample Responses
Highest- and lowest-rated turns.
▌ Pending Round 02
Best- and worst-rated sample responses ship with the raw-vote endpoint in Round 02. When that lands, this section will surface the model’s highest- and lowest-scoring blind-arena turns side by side, scored on the same rubric the leaderboard uses.
▌ Round 01 verdict
Opus 4.7 is the answer to "what's the best at staying out of my character's head, regardless of cost." Round 02 sample is small (26 votes) but every adversarial axis we've tested confirms the lead. If you can absorb the pricing, ship it. If you can't, Sonnet 4.5 is the reliability runner-up at a fifth the price.
▌ Section 10 · Compare & Drill
Stack it against another model.
Profile · Claude Opus 4.7 · Round 01