Model Profile · Round 01

Grok 4.1

xAI · proprietary · 128K ctx

Personality up front. Drifts fast under pressure.

Composite Score

11.9

/100 · canonical

Arena ELO (R1)

1506

±47 · n=322

Multi-Turn ELO (R2)

1452

±39 · n=219

Reliability Rank

#20

avg 15.0

▌ Section 02 · The Lede

What this model is for.

Grok 4.1 brings personality up front — 0.796 unique-word ratio, the highest in the field, terse and sharp at 137 words/turn. Community ELO #5. The reliability picture is brutal: 1.33 fatal flaws per session (highest in our pool), Flaw Hunter mean 12.8 (lowest in our pool), and rank 11 of 11 on narrative momentum. It's a model that says interesting things and then forgets where the scene is going.

▌ Section 03 · At a Glance

Cross-test position

Grok 4.1 sits at #20 on Adversarial — the caveat to watch.

Composite

Arena ELO

Multi-Turn

Rubric

Adversarial

Cost · Latency

▌ Section 04 · Strength & Weakness

Where it shines. Where it stumbles.

▲ Strength

Most distinctive prose voice in the field — highest unique-word ratio (0.796). Solid tone consistency at 4.53/5.

▼ Weakness

Bottom-1 on narrative momentum (F8 score 3.80, rank 11 of 11 in our pool). Highest fatal-flaw rate in the round (1.33/session). Lowest Flaw Hunter score in the pool (12.8).

▌ Section 05 · Failure Modes

Per-axis breakdown.

Six adversarial probes per session, twenty sessions per model, judged by Sonnet 4 against a fixed rubric. Higher score = the model handled the failure mode better. Bars below show the mean across sessions; the black tick marks the population mean (4.20). Right column shows mean and rank within the rp-bench pool.

F1 · Agency

Doesn't write your character's actions

4.33/ 5

#15

F2 · POV / Tense

Holds 2nd-person, present-tense narration

4.10/ 5

#18

F3 · Lore

Doesn't break worldbuilding

4.20/ 5

#11

F8 · Momentum

Pushes scene forward when user goes passive

3.80/ 5

#13

F12 · Instruction Drift

Keeps to the system prompt

4.23/ 5

#16

F13 · Context Attention

Holds character cards 50+ turns deep

4.07/ 5

#21

“The most distinctive voice in the field — and the most frequent to lose its place in the scene.”

— Round 01 verdict · Voice vs structure

▌ Section 06 · Subjective Dimensions

Engagement · Voice · Collaboration.

All three dimensions scored 1–5 by Sonnet 4 LLM-judge across twenty 12-turn multi-turn sessions. The same battery feeds the failure-mode rubric above — these are the subjective half of that judgment.

Engagement

4.36/5

Tone Consistency

4.53/5

Collaboration

4.26/5

▌ Section 07 · Behavioral Metrics

How it writes.

Quantitative signals from the same 20 multi-turn sessions, compared against the population mean across all 11 models.

Avg words / turn

137↓

pop avg 265 · -48%

Unique-word ratio

0.796↑

pop avg 0.657 · +21%

Repetition score

0.015↓

pop avg 0.048 · -69%

▌ Section 08 · Flaw Hunter

Adversarial probe score.

Score of 100 minus deductions across 22 fail-mode flag types on adversarial 12-turn sessions. Higher = fewer flaws caught. Population range across the round is 12.8–46.9.

12.8

/ 100

▌ Score breakdown

Mean   12.8
Median   34.5
Fatal/sess   1.33
Major/sess   8.17

▌ Top flaws caught

purple_proserecycled_descriptionagency_violation

▌ Section 09 · Sample Responses

Highest- and lowest-rated turns.

▌ Pending Round 02

Best- and worst-rated sample responses ship with the raw-vote endpoint in Round 02. When that lands, this section will surface the model’s highest- and lowest-scoring blind-arena turns side by side, scored on the same rubric the leaderboard uses.

▌ Round 01 verdict

Grok 4.1 has a voice. The unique-word ratio and the tone consistency are real, and there's no other model in the pool that sounds quite like it. The trade-off is structural — it scores last in the field on narrative momentum and racks up fatal flaws faster than any other model we tested. Reach for it for novelty. Keep your finger on the swipe.

▌ Section 10 · Compare & Drill

Stack it against another model.

━ All 11 Models

The Standings

Full leaderboard, all tests, all filters.

Compare →

Methodology · Raw votes (CSV) · GitHub · HF dataset

Profile · Grok 4.1 · Round 01