Model Profile · Round 01
DeepSeek v3.2
DeepSeek · open weights · 128K ctx
Reliable, NSFW-shy at 30%. Strong lore retention.
Composite Score
83.3
/100 · canonical
Arena ELO (R1)
1489
±45 · n=241
Multi-Turn ELO (R2)
1491
±95 · n=50
Reliability Rank
#4
avg 5.4
▌ Section 02 · The Lede
What this model is for.
DeepSeek v3.2 is the open-weight reliability play — #1 on lore consistency in our pool (F3 4.50), tied #1 on context attention with Sonnet (F13 4.60), failure-rank #2. Cheap at $0.14 per 1k tokens. The catch is the NSFW collapse: 30% NSFW win rate, lowest in the field. If your scene goes there, look elsewhere. If it doesn't, this is the cheapest path to Sonnet-tier reliability we know of.
▌ Section 03 · At a Glance
Cross-test position
DeepSeek v3.2 holds #3 in Cost · Latency. Sits at #12 on Multi-Turn — the caveat to watch.
▌ Section 04 · Strength & Weakness
Where it shines. Where it stumbles.
▲ Strength
#1 on lore consistency (F3 4.50/5, ahead of the next-best 4.30 tie). Tied #1 on context attention (4.60/5 with Sonnet). Failure-rank #2 in the pool. Cost-efficient ($0.14 per 1k tokens — within an order of magnitude of Gemma).
▼ Weakness
NSFW collapse — 30% win rate, lowest in the field. 4.5% agency violation rate (mid-pack but real).
▌ Section 05 · Failure Modes
Per-axis breakdown.
Six adversarial probes per session, twenty sessions per model, judged by Sonnet 4 against a fixed rubric. Higher score = the model handled the failure mode better. Bars below show the mean across sessions; the black tick marks the population mean (4.20). Right column shows mean and rank within the rp-bench pool.
F1 · Agency
Doesn't write your character's actions
#11
F2 · POV / Tense
Holds 2nd-person, present-tense narration
#4
F3 · Lore
Doesn't break worldbuilding
#2
F8 · Momentum
Pushes scene forward when user goes passive
#5
F12 · Instruction Drift
Keeps to the system prompt
#9
F13 · Context Attention
Holds character cards 50+ turns deep
#2
“The cheapest path to Sonnet-tier reliability — provided your scene stays SFW.”
— Round 01 verdict · Reliability without the price tag
▌ Section 06 · Subjective Dimensions
Engagement · Voice · Collaboration.
All three dimensions scored 1–5 by Sonnet 4 LLM-judge across twenty 12-turn multi-turn sessions. The same battery feeds the failure-mode rubric above — these are the subjective half of that judgment.
▌ Section 07 · Behavioral Metrics
How it writes.
Quantitative signals from the same 20 multi-turn sessions, compared against the population mean across all 11 models.
Avg words / turn
178↓
pop avg 265 · -33%
Unique-word ratio
0.713↑
pop avg 0.655 · +9%
Repetition score
0.029↓
pop avg 0.049 · -41%
▌ Section 08 · Flaw Hunter
Adversarial probe score.
Score of 100 minus deductions across 22 fail-mode flag types on adversarial 12-turn sessions. Higher = fewer flaws caught. Population range across the round is 12.8–46.9.
▌ Score breakdown
Mean 46.9
Median 47.0
Fatal/sess 0.40
Major/sess 5.53
▌ Top flaws caught
purple_proserecycled_descriptionnarrating_emotions
▌ Section 09 · Sample Responses
Highest- and lowest-rated turns.
▌ Pending Round 02
Best- and worst-rated sample responses ship with the raw-vote endpoint in Round 02. When that lands, this section will surface the model’s highest- and lowest-scoring blind-arena turns side by side, scored on the same rubric the leaderboard uses.
▌ Round 01 verdict
DeepSeek v3.2 is the round's most under-covered story: failure-rank #2, #1 on lore consistency in our pool, tied #1 on context attention with Sonnet, at a price point within an order of magnitude of the cheapest model. The cost is the NSFW score — 30% is the floor of the field, and there's no way to read that as a maybe. Pick it for SFW long-form. Pick Mistral SC if the scene is going somewhere else.
▌ Section 10 · Compare & Drill
Stack it against another model.
Profile · DeepSeek v3.2 · Round 01