Model Profile · Round 01

DeepSeek v4 Pro

DeepSeek · open weights · 128K ctx

Strong tone consistency at fraction of Opus pricing.

Composite Score

69.0

/100 · canonical

Arena ELO (R1)

—

joined post-R01

Multi-Turn ELO (R2)

1564

±101 · n=31

Reliability Rank

avg 5.8

▌ Section 02 · The Lede

What this model is for.

DeepSeek's v4 Pro lands as the best open-weight model in Round 02's multi-turn arena. It sits #3 on the multi-turn ELO at 1564, behind the two Opus models — and at $3.20 per million tokens, it's an order of magnitude cheaper than either. Strengths cluster on tone consistency (4.70/5, top of the pool) and reliability (avg failure rank 5.75 across the 20 models). The data is thinner than the Anthropic entries — n=31 multi-turn votes — and the Likert-judge mean is dragged down by a higher-than-expected fatal-flaw count (0.50 per session), but the multi-turn community votes have it slotted firmly in the upper third. If you can't deploy Anthropic for compliance reasons, this is the open-weight read.

▌ Section 03 · At a Glance

Cross-test position

DeepSeek v4 Pro holds #3 in Multi-Turn and #3 in Rubric. Sits at #15 on Cost · Latency — the caveat to watch.

Composite

Arena ELO

—

Multi-Turn

Rubric

Adversarial

Cost · Latency

▌ Section 04 · Strength & Weakness

Where it shines. Where it stumbles.

▲ Strength

Strong tone consistency (4.70/5, top of pool). Top-5 on multi-turn ELO (1564). $3.20/1M, a tenth of Opus pricing.

▼ Weakness

0.50 fatal-flaws per session is mid-pack, not top-tier. Missing F3 (lore) and F8 (momentum) adversarial coverage in upstream — those axes are blank cells on the model profile page.

▌ Section 05 · Failure Modes

Per-axis breakdown.

Six adversarial probes per session, twenty sessions per model, judged by Sonnet 4 against a fixed rubric. Higher score = the model handled the failure mode better. Bars below show the mean across sessions; the black tick marks the population mean (4.20). Right column shows mean and rank within the rp-bench pool.

▌ Coverage: 4/6F3 · Lore · F8 · Momentum not yet run on this model. Upstream rolls these out incrementally as new models join the pool.

F1 · Agency

Doesn't write your character's actions

4.40/ 5

—

F2 · POV / Tense

Holds 2nd-person, present-tense narration

4.30/ 5

—

F3 · Lore

not yet run on this model

—

—

F8 · Momentum

not yet run on this model

—

—

F12 · Instruction Drift

Keeps to the system prompt

4.40/ 5

—

F13 · Context Attention

Holds character cards 50+ turns deep

4.57/ 5

—

“Best open-weight in Round 02. Anthropic-tier reliability without the Anthropic price tag.”

— Round 02 verdict · Open-weight lead

▌ Section 06 · Subjective Dimensions

Engagement · Voice · Collaboration.

All three dimensions scored 1–5 by Sonnet 4 LLM-judge across twenty 12-turn multi-turn sessions. The same battery feeds the failure-mode rubric above — these are the subjective half of that judgment.

Engagement

4.53/5

Tone Consistency

4.70/5

Collaboration

4.38/5

▌ Section 07 · Behavioral Metrics

How it writes.

Quantitative signals from the same 20 multi-turn sessions, compared against the population mean across all 11 models.

Avg words / turn

259↓

pop avg 265 · -2%

Unique-word ratio

0.664↑

pop avg 0.655 · +1%

Repetition score

0.040↓

pop avg 0.049 · -18%

▌ Section 08 · Flaw Hunter

Adversarial probe score.

Score of 100 minus deductions across 22 fail-mode flag types on adversarial 12-turn sessions. Higher = fewer flaws caught. Population range across the round is 12.8–46.9.

19.4

/ 100

▌ Score breakdown

Mean   19.4
Median   46.5
Fatal/sess   0.50
Major/sess   9.00

▌ Top flaws caught

purple_proserecycled_descriptionnarrating_emotions

▌ Section 09 · Sample Responses

Highest- and lowest-rated turns.

▌ Pending Round 02

Best- and worst-rated sample responses ship with the raw-vote endpoint in Round 02. When that lands, this section will surface the model’s highest- and lowest-scoring blind-arena turns side by side, scored on the same rubric the leaderboard uses.

▌ Round 01 verdict

DeepSeek v4 Pro is the model to pick if open weights are a hard requirement. The Round 02 sample is thin (n=31), but every axis with data points top-third. The blanks on F3 and F8 are real — if your scenes lean on lore consistency or narrative momentum, run a few sessions of your own before committing. For most cases, this is a credible Sonnet alternative.

▌ Section 10 · Compare & Drill

Stack it against another model.

━ All 11 Models

The Standings

Full leaderboard, all tests, all filters.

Compare →

Methodology · Raw votes (CSV) · GitHub · HF dataset

Profile · DeepSeek v4 Pro · Round 01