PlotPointsModelsGrok 4.1
Model Profile · Round 01

Grok 4.1

xAI · proprietary · 128K ctx

Personality up front. Drifts fast under pressure.

Composite Score
26.2
/100 · canonical
Arena ELO (R1)
1506
±47 · n=322
Multi-Turn ELO (R2)
1487
±94 · n=51
Reliability Rank
#19
avg 14.1
▌ Section 02 · The Lede

What this model is for.

Grok 4.1 brings personality up front — 0.796 unique-word ratio, the highest in the field, terse and sharp at 137 words/turn. Community ELO #5. The reliability picture is brutal: 1.33 fatal flaws per session (highest in our pool), Flaw Hunter mean 12.8 (lowest in our pool), and rank 11 of 11 on narrative momentum. It's a model that says interesting things and then forgets where the scene is going.

▌ Section 03 · At a Glance

Cross-test position

Grok 4.1 sits at #19 on Adversarial — the caveat to watch.

Composite
15
Arena ELO
6
Multi-Turn
13
Rubric
16
Adversarial
19
Cost · Latency
4
▌ Section 04 · Strength & Weakness

Where it shines. Where it stumbles.

Strength
Most distinctive prose voice in the field — highest unique-word ratio (0.796). Solid tone consistency at 4.53/5.
Weakness
Bottom-1 on narrative momentum (F8 score 3.80, rank 11 of 11 in our pool). Highest fatal-flaw rate in the round (1.33/session). Lowest Flaw Hunter score in the pool (12.8).
▌ Section 05 · Failure Modes

Per-axis breakdown.

Six adversarial probes per session, twenty sessions per model, judged by Sonnet 4 against a fixed rubric. Higher score = the model handled the failure mode better. Bars below show the mean across sessions; the black tick marks the population mean (4.20). Right column shows mean and rank within the rp-bench pool.
F1 · Agency
Doesn't write your character's actions
4.33
/ 5
#14
F2 · POV / Tense
Holds 2nd-person, present-tense narration
4.10
/ 5
#17
F3 · Lore
Doesn't break worldbuilding
4.20
/ 5
#10
F8 · Momentum
Pushes scene forward when user goes passive
3.80
/ 5
#12
F12 · Instruction Drift
Keeps to the system prompt
4.23
/ 5
#16
F13 · Context Attention
Holds character cards 50+ turns deep
4.07
/ 5
#20
The most distinctive voice in the field — and the most frequent to lose its place in the scene.
Round 01 verdict · Voice vs structure
▌ Section 06 · Subjective Dimensions

Engagement · Voice · Collaboration.

All three dimensions scored 1–5 by Sonnet 4 LLM-judge across twenty 12-turn multi-turn sessions. The same battery feeds the failure-mode rubric above — these are the subjective half of that judgment.
Engagement
4.36/5
Tone Consistency
4.53/5
Collaboration
4.26/5
▌ Section 07 · Behavioral Metrics

How it writes.

Quantitative signals from the same 20 multi-turn sessions, compared against the population mean across all 11 models.
Avg words / turn
137
pop avg 265 · -48%
Unique-word ratio
0.796
pop avg 0.655 · +22%
Repetition score
0.015
pop avg 0.049 · -69%
▌ Section 08 · Flaw Hunter

Adversarial probe score.

Score of 100 minus deductions across 22 fail-mode flag types on adversarial 12-turn sessions. Higher = fewer flaws caught. Population range across the round is 12.8–46.9.
12.8
/ 100
▌ Score breakdown
Mean   12.8
Median   34.5
Fatal/sess   1.33
Major/sess   8.17
▌ Top flaws caught
purple_proserecycled_descriptionagency_violation
▌ Section 09 · Sample Responses

Highest- and lowest-rated turns.

▌ Pending Round 02

Best- and worst-rated sample responses ship with the raw-vote endpoint in Round 02. When that lands, this section will surface the model’s highest- and lowest-scoring blind-arena turns side by side, scored on the same rubric the leaderboard uses.

▌ Round 01 verdict
Grok 4.1 has a voice. The unique-word ratio and the tone consistency are real, and there's no other model in the pool that sounds quite like it. The trade-off is structural — it scores last in the field on narrative momentum and racks up fatal flaws faster than any other model we tested. Reach for it for novelty. Keep your finger on the swipe.
▌ Section 10 · Compare & Drill

Stack it against another model.

━ All 11 Models
The Standings
Full leaderboard, all tests, all filters.
Compare →
Methodology · Raw votes (CSV) · GitHub · HF dataset
Profile · Grok 4.1 · Round 01