PlotPointsModelsClaude Opus 4.6
Model Profile · Round 01

Claude Opus 4.6

Anthropic · proprietary · 200K ctx
Reliable

Reliability runner-up. Top-2 on agency, complete failure-mode coverage.

Composite Score
88.1
/100 · canonical
Arena ELO (R1)
joined post-R01
Multi-Turn ELO (R2)
1568
±100 · n=40
Reliability Rank
#2
avg 2.9
▌ Section 02 · The Lede

What this model is for.

Opus 4.6 is the completist Anthropic option in Round 02 — every adversarial axis the benchmark tests has at least three sessions on it, where its successor 4.7 is still missing F3 (lore) and F8 (momentum) coverage. The numbers it does have are top tier: F2 POV/tense at 4.47 (highest in pool), F3 lore at 4.60 (highest in pool), and a fatal-flaw rate of 0.29 per session — half of Opus 4.7's. The trade-off is engagement: Opus 4.6 reads denser, sits at #2 on multi-turn ELO behind 4.7 (1568 vs 1627), and generates twice as slowly (20s vs 10s median). Round 02 holds it as the reliability runner-up at the same price.

▌ Section 03 · At a Glance

Cross-test position

Claude Opus 4.6 holds #2 in Multi-Turn, #2 in Rubric, #2 in Adversarial and #3 in Composite. Sits at #20 on Cost · Latency — the caveat to watch.

Composite
3
Arena ELO
Multi-Turn
2
Rubric
2
Adversarial
2
Cost · Latency
20
▌ Section 04 · Strength & Weakness

Where it shines. Where it stumbles.

Strength
Top-2 on agency respect (4.55/5) — and complete failure-mode coverage where 4.7 is missing F3/F8 axes. Lowest fatal-flaw rate of any Opus model.
Weakness
Highest phrase repetition in the Anthropic family (0.076 vs population 0.049). Slowest generation among the top-5 multi-turn finalists at 20s median.
▌ Section 05 · Failure Modes

Per-axis breakdown.

Six adversarial probes per session, twenty sessions per model, judged by Sonnet 4 against a fixed rubric. Higher score = the model handled the failure mode better. Bars below show the mean across sessions; the black tick marks the population mean (4.20). Right column shows mean and rank within the rp-bench pool.
F1 · Agency
Doesn't write your character's actions
4.55
/ 5
#2
F2 · POV / Tense
Holds 2nd-person, present-tense narration
4.47
/ 5
#1
F3 · Lore
Doesn't break worldbuilding
4.60
/ 5
#1
F8 · Momentum
Pushes scene forward when user goes passive
4.10
/ 5
#9
F12 · Instruction Drift
Keeps to the system prompt
4.47
/ 5
#2
F13 · Context Attention
Holds character cards 50+ turns deep
4.57
/ 5
#4
Every axis tested. Slowest in the Opus family. A reliability profile you can trust, end-to-end.
Round 02 verdict · Completist pick
▌ Section 06 · Subjective Dimensions

Engagement · Voice · Collaboration.

All three dimensions scored 1–5 by Sonnet 4 LLM-judge across twenty 12-turn multi-turn sessions. The same battery feeds the failure-mode rubric above — these are the subjective half of that judgment.
Engagement
4.67/5
Tone Consistency
4.75/5
Collaboration
4.46/5
▌ Section 07 · Behavioral Metrics

How it writes.

Quantitative signals from the same 20 multi-turn sessions, compared against the population mean across all 11 models.
Avg words / turn
534
pop avg 265 · +102%
Unique-word ratio
0.551
pop avg 0.655 · -16%
Repetition score
0.076
pop avg 0.049 · +55%
▌ Section 08 · Flaw Hunter

Adversarial probe score.

Score of 100 minus deductions across 22 fail-mode flag types on adversarial 12-turn sessions. Higher = fewer flaws caught. Population range across the round is 12.8–46.9.
40.9
/ 100
▌ Score breakdown
Mean   40.9
Median   42.0
Fatal/sess   0.29
Major/sess   6.82
▌ Top flaws caught
purple_proserecycled_descriptionnarrating_emotions
▌ Section 09 · Sample Responses

Highest- and lowest-rated turns.

▌ Pending Round 02

Best- and worst-rated sample responses ship with the raw-vote endpoint in Round 02. When that lands, this section will surface the model’s highest- and lowest-scoring blind-arena turns side by side, scored on the same rubric the leaderboard uses.

▌ Round 01 verdict
Opus 4.6 is the model you pick when you need the full reliability matrix — every axis covered, every probe answered. It loses Round 02's headline ELO race to 4.7 by ~60 points, but the gap fits inside the confidence interval at this sample size. Pick 4.6 over 4.7 if F3 (lore) or F8 (momentum) coverage matters to you; pick 4.7 if you want the fastest top-tier generation.
▌ Section 10 · Compare & Drill

Stack it against another model.

━ All 11 Models
The Standings
Full leaderboard, all tests, all filters.
Compare →
Methodology · Raw votes (CSV) · GitHub · HF dataset
Profile · Claude Opus 4.6 · Round 01