PlotPointsModelsLlama 4 Maverick
Model Profile · Round 01

Llama 4 Maverick

Meta · open weights · 128K ctx

Last on every reliability mode. Open-source completist only.

Composite Score
2.4
/100 · canonical
Arena ELO (R1)
1473
±42 · n=474
Multi-Turn ELO (R2)
1483
±93 · n=65
Reliability Rank
#20
avg 15.9
▌ Section 02 · The Lede

What this model is for.

Meta's Llama 4 Maverick lands #10 on community ELO and #11 reliability — last or near-last on most failure axes, agency floor at 3.0/5, F12 instruction drift mean of 3.77 (with a session floor at 3.2). It's the open-source completist's pick, not the writer's pick. The 34% NSFW win rate is the second-lowest in the field. Across the round, no single axis surfaces a reason to choose Maverick over its open-weight siblings.

▌ Section 03 · At a Glance

Cross-test position

Llama 4 Maverick sits at #20 on Composite — the caveat to watch.

Composite
20
Arena ELO
10
Multi-Turn
15
Rubric
20
Adversarial
20
Cost · Latency
5
▌ Section 04 · Strength & Weakness

Where it shines. Where it stumbles.

Strength
No standout strength on the dimensions we tested. Open-weight, available for self-hosting if license terms suit.
Weakness
Catastrophic floor on agency respect (lowest session: 3.0/5). Bottom-tier F12 instruction drift mean of 3.77 (#10 of 11), with session floor at 3.2. Second-lowest NSFW win rate in the field (34%).
▌ Section 05 · Failure Modes

Per-axis breakdown.

Six adversarial probes per session, twenty sessions per model, judged by Sonnet 4 against a fixed rubric. Higher score = the model handled the failure mode better. Bars below show the mean across sessions; the black tick marks the population mean (4.20). Right column shows mean and rank within the rp-bench pool.
F1 · Agency
Doesn't write your character's actions
3.83
/ 5
#18
F2 · POV / Tense
Holds 2nd-person, present-tense narration
3.93
/ 5
#20
F3 · Lore
Doesn't break worldbuilding
4.10
/ 5
#12
F8 · Momentum
Pushes scene forward when user goes passive
4.10
/ 5
#11
F12 · Instruction Drift
Keeps to the system prompt
3.77
/ 5
#19
F13 · Context Attention
Holds character cards 50+ turns deep
4.10
/ 5
#19
The open-source completist's pick — not the writer's pick.
Round 01 verdict · Available, not advisable
▌ Section 06 · Subjective Dimensions

Engagement · Voice · Collaboration.

All three dimensions scored 1–5 by Sonnet 4 LLM-judge across twenty 12-turn multi-turn sessions. The same battery feeds the failure-mode rubric above — these are the subjective half of that judgment.
Engagement
4.15/5
Tone Consistency
4.30/5
Collaboration
4.20/5
▌ Section 07 · Behavioral Metrics

How it writes.

Quantitative signals from the same 20 multi-turn sessions, compared against the population mean across all 11 models.
Avg words / turn
172
pop avg 265 · -35%
Unique-word ratio
0.646
pop avg 0.655 · -1%
Repetition score
0.064
pop avg 0.049 · +31%
▌ Section 08 · Flaw Hunter

Adversarial probe score.

Score of 100 minus deductions across 22 fail-mode flag types on adversarial 12-turn sessions. Higher = fewer flaws caught. Population range across the round is 12.8–46.9.
30.6
/ 100
▌ Score breakdown
Mean   30.6
Median   36.5
Fatal/sess   0.95
Major/sess   6.65
▌ Top flaws caught
recycled_descriptionpurple_proseagency_violation
▌ Section 09 · Sample Responses

Highest- and lowest-rated turns.

▌ Pending Round 02

Best- and worst-rated sample responses ship with the raw-vote endpoint in Round 02. When that lands, this section will surface the model’s highest- and lowest-scoring blind-arena turns side by side, scored on the same rubric the leaderboard uses.

▌ Round 01 verdict
Llama 4 Maverick rounds out the open-weight side of the round, but the data doesn't surface a reason to pick it. Reliability ranks #11 of 11, agency and instruction-drift scores both bottom-tier, NSFW win rate near the floor. Pick DeepSeek v3.2 or Mistral SC if you need open weights. Pick Gemma if you need open weights and engagement. Maverick is here for completeness, not preference.
▌ Section 10 · Compare & Drill

Stack it against another model.

━ All 11 Models
The Standings
Full leaderboard, all tests, all filters.
Compare →
Methodology · Raw votes (CSV) · GitHub · HF dataset
Profile · Llama 4 Maverick · Round 01