PlotPointsModelsDeepSeek v4 Flash
Model Profile · Round 01

DeepSeek v4 Flash

DeepSeek · open weights · 128K ctx
Cheap

Cheapest tier, top flaw-hunter score. Multi-turn ELO drags it down.

Composite Score
45.2
/100 · canonical
Arena ELO (R1)
joined post-R01
Multi-Turn ELO (R2)
1469
±104 · n=26
Reliability Rank
#11
avg 9.5
▌ Section 02 · The Lede

What this model is for.

DeepSeek's v4 Flash is the cheapest-tier entrant in Round 02 at $0.18 per million tokens, tied with Gemini 3.1 Flash Lite for the lowest price in the 20-model pool. Where it earns its shelf space is the flaw hunter score — 50.6/100, the highest in the entire field, ahead of every Anthropic and DeepSeek Pro model. The trade-off is multi-turn engagement: Round 02 voters slot it at #16 by ELO (1469), well below where its quality-axis numbers would suggest. Plausible read: voters prefer denser, more elaborate prose; Flash writes leaner. Either way, the price-to-flaw-hunter ratio is uncontested in this round.

▌ Section 03 · At a Glance

Cross-test position

DeepSeek v4 Flash holds #2 in Cost · Latency. Sits at #16 on Multi-Turn — the caveat to watch.

Composite
12
Arena ELO
Multi-Turn
16
Rubric
8
Adversarial
11
Cost · Latency
2
▌ Section 04 · Strength & Weakness

Where it shines. Where it stumbles.

Strength
Top-1 on flaw hunter (50.6/100) across all 20 models. Tied cheapest at $0.18/1M.
Weakness
Multi-turn ELO of 1469 puts it at #16 — lower engagement than its quality numbers predict. Missing F3 (lore) and F8 (momentum) adversarial coverage upstream.
▌ Section 05 · Failure Modes

Per-axis breakdown.

Six adversarial probes per session, twenty sessions per model, judged by Sonnet 4 against a fixed rubric. Higher score = the model handled the failure mode better. Bars below show the mean across sessions; the black tick marks the population mean (4.20). Right column shows mean and rank within the rp-bench pool.
▌ Coverage: 4/6F3 · Lore · F8 · Momentum not yet run on this model. Upstream rolls these out incrementally as new models join the pool.
F1 · Agency
Doesn't write your character's actions
4.50
/ 5
F2 · POV / Tense
Holds 2nd-person, present-tense narration
4.20
/ 5
F3 · Lore
not yet run on this model
F8 · Momentum
not yet run on this model
F12 · Instruction Drift
Keeps to the system prompt
4.30
/ 5
F13 · Context Attention
Holds character cards 50+ turns deep
4.53
/ 5
Highest flaw-hunter score, lowest price. Voters didn't catch up yet.
Round 02 verdict · Cost-efficiency lead
▌ Section 06 · Subjective Dimensions

Engagement · Voice · Collaboration.

All three dimensions scored 1–5 by Sonnet 4 LLM-judge across twenty 12-turn multi-turn sessions. The same battery feeds the failure-mode rubric above — these are the subjective half of that judgment.
Engagement
4.47/5
Tone Consistency
4.60/5
Collaboration
4.27/5
▌ Section 07 · Behavioral Metrics

How it writes.

Quantitative signals from the same 20 multi-turn sessions, compared against the population mean across all 11 models.
Avg words / turn
173
pop avg 265 · -35%
Unique-word ratio
0.709
pop avg 0.655 · +8%
Repetition score
0.030
pop avg 0.049 · -39%
▌ Section 08 · Flaw Hunter

Adversarial probe score.

Score of 100 minus deductions across 22 fail-mode flag types on adversarial 12-turn sessions. Higher = fewer flaws caught. Population range across the round is 12.8–46.9.
50.6
/ 100
▌ Score breakdown
Mean   50.6
Median   58.0
Fatal/sess   0.36
Major/sess   5.27
▌ Top flaws caught
purple_proserecycled_descriptionconvenient_world
▌ Section 09 · Sample Responses

Highest- and lowest-rated turns.

▌ Pending Round 02

Best- and worst-rated sample responses ship with the raw-vote endpoint in Round 02. When that lands, this section will surface the model’s highest- and lowest-scoring blind-arena turns side by side, scored on the same rubric the leaderboard uses.

▌ Round 01 verdict
Pick Flash when budget is the binding constraint and you can stomach a lower-engagement read. The flaw-hunter number says the prose is structurally clean; the multi-turn ELO says voters preferred denser models when forced to choose. If your product values consistency over flair, Flash is the bargain. If you need the scene to grab the reader, Mistral SC at $0.50 hits much harder for ~3× the cost.
▌ Section 10 · Compare & Drill

Stack it against another model.

━ All 11 Models
The Standings
Full leaderboard, all tests, all filters.
Compare →
Methodology · Raw votes (CSV) · GitHub · HF dataset
Profile · DeepSeek v4 Flash · Round 01