PlotPointsModelsDeepSeek R1 0528
Model Profile · Round 01

DeepSeek R1 0528

DeepSeek · open weights · 164K ctx

2025-vintage reasoner. Clean prose metrics, weak instruction-keeping. No arena votes yet.

Composite Score
21.4
/100 · canonical
Arena ELO (R1)
joined post-R01
Multi-Turn ELO (R2)
n/a
Reliability Rank
#14
avg 10.9
▌ Section 02 · The Lede

What this model is for.

DeepSeek R1 0528 is the pool's odd entrant: a 2025-vintage open-weight reasoner added in June 2026 to test how the previous generation holds up against the current field — and the only model here with no human votes at all. Every number on this page is judge-scored: 20 adversarial multi-turn sessions, the flaw-hunter gauntlet, and the behavioral pass. The split it surfaces is sharp. Surface metrics are genuinely clean — 0.685 unique-word ratio and 0.037 bigram repetition are among the best in the field — but it ranks #19 of 21 on instruction drift and burns a median 33 seconds per reply on reasoning tokens. A curiosity pick for people who want to see what a year of progress looks like.

▌ Section 03 · At a Glance

Cross-test position

DeepSeek R1 0528 sits at #17 on Composite — the caveat to watch.

Composite
17
Arena ELO
Multi-Turn
Rubric
13
Adversarial
14
Cost · Latency
▌ Section 04 · Strength & Weakness

Where it shines. Where it stumbles.

Strength
Cleanest-prose cluster of the open-weight field: 0.685 unique-word ratio and 0.037 bigram repetition beat every Anthropic model. Solid lore retention too — F3 lore-consistency at 4.30 sits #6 in the 21-model pool.
Weakness
Instruction drift is the wound: F12 at 4.17 ranks #19 of 21, and the flaw hunter logged 11 fatal agency violations across 14 sessions (0.79 fatal/session). For strict character work, DeepSeek's own v3.2 is more reliable at a fraction of the latency.
▌ Section 05 · Failure Modes

Per-axis breakdown.

Six adversarial probes per session, twenty sessions per model, judged by Sonnet 4 against a fixed rubric. Higher score = the model handled the failure mode better. Bars below show the mean across sessions; the black tick marks the population mean (4.20). Right column shows mean and rank within the rp-bench pool.
F1 · Agency
Doesn't write your character's actions
4.42
/ 5
#9
F2 · POV / Tense
Holds 2nd-person, present-tense narration
4.20
/ 5
#16
F3 · Lore
Doesn't break worldbuilding
4.30
/ 5
#6
F8 · Momentum
Pushes scene forward when user goes passive
4.20
/ 5
#7
F12 · Instruction Drift
Keeps to the system prompt
4.17
/ 5
#19
F13 · Context Attention
Holds character cards 50+ turns deep
4.50
/ 5
#10
Clean words, loose hands. The prose passes every filter; the character contract doesn't.
Phase C verdict · 2025-vintage baseline
▌ Section 06 · Subjective Dimensions

Engagement · Voice · Collaboration.

All three dimensions scored 1–5 by Sonnet 4 LLM-judge across twenty 12-turn multi-turn sessions. The same battery feeds the failure-mode rubric above — these are the subjective half of that judgment.
Engagement
4.53/5
Tone Consistency
4.62/5
Collaboration
4.37/5
▌ Section 07 · Behavioral Metrics

How it writes.

Quantitative signals from the same 20 multi-turn sessions, compared against the population mean across all 11 models.
Avg words / turn
274
pop avg 265 · +3%
Unique-word ratio
0.685
pop avg 0.657 · +4%
Repetition score
0.037
pop avg 0.048 · -23%
▌ Section 08 · Flaw Hunter

Adversarial probe score.

Score of 100 minus deductions across 22 fail-mode flag types on adversarial 12-turn sessions. Higher = fewer flaws caught. Population range across the round is 12.8–46.9.
26.2
/ 100
▌ Score breakdown
Mean   26.2
Median   30.0
Fatal/sess   0.79
Major/sess   8.21
▌ Top flaws caught
purple_proserecycled_descriptionagency_violation
▌ Section 09 · Sample Responses

Highest- and lowest-rated turns.

▌ Pending Round 02

Best- and worst-rated sample responses ship with the raw-vote endpoint in Round 02. When that lands, this section will surface the model’s highest- and lowest-scoring blind-arena turns side by side, scored on the same rubric the leaderboard uses.

▌ Round 01 verdict
R1 0528 exists on this leaderboard as a baseline, and as a baseline it's instructive: one model generation ago, this was the open-weight reasoning frontier, and it still posts top-tier vocabulary diversity. But roleplay is a contract-keeping exercise, and the contract numbers — #19 instruction drift, 0.79 fatal flaws per session, 33-second median replies — put it at composite #17 of 21 with the multi-turn axis imputed. If you want DeepSeek for RP, v3.2 and v4 Pro are both better at everything except nostalgia. No arena votes yet; if Round 03 carries it forward, the community gets to confirm or overturn the judges.
▌ Section 10 · Compare & Drill

Stack it against another model.

━ All 11 Models
The Standings
Full leaderboard, all tests, all filters.
Compare →
Methodology · Raw votes (CSV) · GitHub · HF dataset
Profile · DeepSeek R1 0528 · Round 01