Model Profile · Round 01

Claude Sonnet 4.5

Anthropic · proprietary · 200K ctx

Reliable

Round 01 reliability leader. Tied #1 on context attention.

Composite Score

92.9

/100 · canonical

Arena ELO (R1)

1506

±45 · n=194

Multi-Turn ELO (R2)

1550

±92 · n=48

Reliability Rank

avg 4.6

▌ Section 02 · The Lede

What this model is for.

Anthropic's Sonnet 4.5 is the round's reliability anchor — tied #1 on long-context attention with DeepSeek (F13 score 4.60), #1 on agency (F1 4.50), tied in the top-tier on narrative momentum and instruction drift. Failure-rank #1 in our pool of 11. The cost: $3 per 1k tokens (~38× Gemma), and a community ELO of 1506 (#6) that puts it well behind the engagement leaders. The model that doesn't break.

▌ Section 03 · At a Glance

Cross-test position

Claude Sonnet 4.5 holds #2 in Composite and #3 in Adversarial. Sits at #17 on Cost · Latency — the caveat to watch.

Composite

Arena ELO

Multi-Turn

Rubric

Adversarial

Cost · Latency

▌ Section 04 · Strength & Weakness

Where it shines. Where it stumbles.

▲ Strength

Tied #1 on long-context attention (4.60/5, with DeepSeek). #1 on agency (F1 4.50). Failure-rank #1 in the pool — best multi-turn reliability across the round.

▼ Weakness

Highest cost per 1k tokens ($3.00 — ~38× the cheapest top-tier option). Mid-pack community engagement (#6 ELO).

▌ Section 05 · Failure Modes

Per-axis breakdown.

Six adversarial probes per session, twenty sessions per model, judged by Sonnet 4 against a fixed rubric. Higher score = the model handled the failure mode better. Bars below show the mean across sessions; the black tick marks the population mean (4.20). Right column shows mean and rank within the rp-bench pool.

F1 · Agency

Doesn't write your character's actions

4.50/ 5

F2 · POV / Tense

Holds 2nd-person, present-tense narration

4.20/ 5

#10

F3 · Lore

Doesn't break worldbuilding

4.20/ 5

F8 · Momentum

Pushes scene forward when user goes passive

4.20/ 5

F12 · Instruction Drift

Keeps to the system prompt

4.30/ 5

F13 · Context Attention

Holds character cards 50+ turns deep

4.60/ 5

“The model the community ranked sixth — and the model that never broke.”

— Round 01 verdict · Reliability ≠ engagement

▌ Section 06 · Subjective Dimensions

Engagement · Voice · Collaboration.

All three dimensions scored 1–5 by Sonnet 4 LLM-judge across twenty 12-turn multi-turn sessions. The same battery feeds the failure-mode rubric above — these are the subjective half of that judgment.

Engagement

4.58/5

Tone Consistency

4.67/5

Collaboration

4.37/5

▌ Section 07 · Behavioral Metrics

How it writes.

Quantitative signals from the same 20 multi-turn sessions, compared against the population mean across all 11 models.

Avg words / turn

314↑

pop avg 265 · +19%

Unique-word ratio

0.625↓

pop avg 0.655 · -5%

Repetition score

0.053↑

pop avg 0.049 · +8%

▌ Section 08 · Flaw Hunter

Adversarial probe score.

Score of 100 minus deductions across 22 fail-mode flag types on adversarial 12-turn sessions. Higher = fewer flaws caught. Population range across the round is 12.8–46.9.

45.3

/ 100

▌ Score breakdown

Mean   45.3
Median   44.5
Fatal/sess   0.22
Major/sess   6.22

▌ Top flaws caught

purple_proserecycled_descriptionnarrating_emotions

▌ Section 09 · Sample Responses

Highest- and lowest-rated turns.

▌ Pending Round 02

Best- and worst-rated sample responses ship with the raw-vote endpoint in Round 02. When that lands, this section will surface the model’s highest- and lowest-scoring blind-arena turns side by side, scored on the same rubric the leaderboard uses.

▌ Round 01 verdict

Sonnet 4.5 is the model you reach for when the scene matters more than the vibe. It's #1 on the failure-rank board, tied #1 on long-context attention, #1 on agency, and tied in the top tier on momentum and instruction drift — every reliability axis we tested, Sonnet sits at or near the top. The community ELO and the price tag are the trade-off. Pick it for the campaign that has to last. Pick something cheaper for the everyday.

▌ Section 10 · Compare & Drill

Stack it against another model.

━ All 11 Models

The Standings

Full leaderboard, all tests, all filters.

Compare →

Methodology · Raw votes (CSV) · GitHub · HF dataset

Profile · Claude Sonnet 4.5 · Round 01