PlotPointsModelsMistral SC
Model Profile · Round 01

Mistral SC

Mistral · open weights · local-friendly · 32K ctx
NSFW

NSFW specialist. Fastest in the field. Drifts on long sessions.

Composite Score
50.0
/100 · canonical
Arena ELO (R1)
1526
±50 · n=646
Multi-Turn ELO (R2)
1536
±92 · n=61
Reliability Rank
#14
avg 11.4
▌ Section 02 · The Lede

What this model is for.

Mistral Small Creative is the round's NSFW specialist — 67% NSFW win rate, +16 points over its SFW score, the largest pro-NSFW lift on the board. It's the one community voters reach for when the scene heats up. The trade-off shows up in the failure-mode rubric: a 15.9% agency violation rate (highest in the field) and a #7 placement on the failure-rank leaderboard, with particular weakness on long sessions (F13 mean 4.20, second-lowest in our pool). Mistral SC is a specialist tool — pick it knowing what it's for, not as a daily driver.

▌ Section 03 · At a Glance

Cross-test position

Mistral SC holds #2 in Arena ELO. Sits at #15 on Rubric — the caveat to watch.

Composite
11
Arena ELO
2
Multi-Turn
7
Rubric
15
Adversarial
14
Cost · Latency
8
▌ Section 04 · Strength & Weakness

Where it shines. Where it stumbles.

Strength
Community top tier (#2, ELO 1526). Dominant NSFW performance (67% win rate, +16 vs SFW). Open weights, local-friendly.
Weakness
High agency violation rate (15.9%). Drifts on long sessions (F13 score 4.20). Reach for Sonnet 4.5 if your scene needs to run past 30 turns.
▌ Section 05 · Failure Modes

Per-axis breakdown.

Six adversarial probes per session, twenty sessions per model, judged by Sonnet 4 against a fixed rubric. Higher score = the model handled the failure mode better. Bars below show the mean across sessions; the black tick marks the population mean (4.20). Right column shows mean and rank within the rp-bench pool.
F1 · Agency
Doesn't write your character's actions
4.28
/ 5
#15
F2 · POV / Tense
Holds 2nd-person, present-tense narration
4.23
/ 5
#8
F3 · Lore
Doesn't break worldbuilding
4.30
/ 5
#4
F8 · Momentum
Pushes scene forward when user goes passive
4.10
/ 5
#9
F12 · Instruction Drift
Keeps to the system prompt
4.27
/ 5
#15
F13 · Context Attention
Holds character cards 50+ turns deep
4.20
/ 5
#18
The largest pro-NSFW lift on the board — and the one voters reach for when the scene heats up.
Round 01 verdict · Specialist, not a daily driver
▌ Section 06 · Subjective Dimensions

Engagement · Voice · Collaboration.

All three dimensions scored 1–5 by Sonnet 4 LLM-judge across twenty 12-turn multi-turn sessions. The same battery feeds the failure-mode rubric above — these are the subjective half of that judgment.
Engagement
4.46/5
Tone Consistency
4.59/5
Collaboration
4.35/5
▌ Section 07 · Behavioral Metrics

How it writes.

Quantitative signals from the same 20 multi-turn sessions, compared against the population mean across all 11 models.
Avg words / turn
439
pop avg 265 · +66%
Unique-word ratio
0.557
pop avg 0.655 · -15%
Repetition score
0.095
pop avg 0.049 · +94%
▌ Section 08 · Flaw Hunter

Adversarial probe score.

Score of 100 minus deductions across 22 fail-mode flag types on adversarial 12-turn sessions. Higher = fewer flaws caught. Population range across the round is 12.8–46.9.
27.1
/ 100
▌ Score breakdown
Mean   27.1
Median   37.0
Fatal/sess   0.95
Major/sess   7.70
▌ Top flaws caught
purple_proserecycled_descriptionagency_violation
▌ Section 09 · Sample Responses

Highest- and lowest-rated turns.

▌ Pending Round 02

Best- and worst-rated sample responses ship with the raw-vote endpoint in Round 02. When that lands, this section will surface the model’s highest- and lowest-scoring blind-arena turns side by side, scored on the same rubric the leaderboard uses.

▌ Round 01 verdict
Mistral SC is a specialist tool, and the data agrees. The 67% NSFW win rate is the round's most decisive single-mode showing, and the model is open-weight, local-friendly, and cheap. The cost is reliability: highest agency-violation rate in the field, and a #7 placement on failure-rank with long-session attention as the sharpest seam. Pick it for the scene. Pick something else for the campaign.
▌ Section 10 · Compare & Drill

Stack it against another model.

━ All 11 Models
The Standings
Full leaderboard, all tests, all filters.
Compare →
Methodology · Raw votes (CSV) · GitHub · HF dataset
Profile · Mistral SC · Round 01