PlotPointsModelsGemma 4 26B
Model Profile · Round 01

Gemma 4 26B

Google · open weights · local-friendly · 8K ctx
Round 01 #1

Round 01 champion, mid-pack on multi-turn. The cheap local-friendly hold.

Composite Score
64.3
/100 · canonical
Arena ELO (R1)
1535
±44 · n=302
Multi-Turn ELO (R2)
1519
±97 · n=42
Reliability Rank
#12
avg 9.9
▌ Section 02 · The Lede

What this model is for.

Google's open mid-size dropped onto the leaderboard at #1 in our first community-vote snapshot — at 540 votes — and it was still there at 2,000. Six checkpoints, same name. Through the round we drained 1,857 votes from 335 readers, ran the multi-turn adversarial battery, and watched Gemma 4 26B refuse to move. The catch is that engagement is one leaderboard. Reliability is another. Gemma sits at #6 of 11 on failure-mode rank — not the safest pick if your scenes go past fifty turns, not the strongest on instruction discipline. What it is: the cheapest top-tier model in the round, one of three Round 01 entrants that runs locally on a single 24GB GPU, and the one voters ranked #1 on the average ballot — the highest SFW win rate in the field at 55%.

▌ Section 03 · At a Glance

Cross-test position

Gemma 4 26B holds #1 in Arena ELO. Sits at #14 on Rubric — the caveat to watch.

Composite
8
Arena ELO
1
Multi-Turn
9
Rubric
14
Adversarial
12
Cost · Latency
7
▌ Section 04 · Strength & Weakness

Where it shines. Where it stumbles.

Strength
Community top tier (#1, ELO 1535). Best cost-efficiency in the field. Runs locally on a single 24GB GPU.
Weakness
No standout strength on tested failure dimensions. Drifts on long-context sessions. Pick Sonnet 4.5 if you're running 50+ turns.
▌ Section 05 · Failure Modes

Per-axis breakdown.

Six adversarial probes per session, twenty sessions per model, judged by Sonnet 4 against a fixed rubric. Higher score = the model handled the failure mode better. Bars below show the mean across sessions; the black tick marks the population mean (4.20). Right column shows mean and rank within the rp-bench pool.
F1 · Agency
Doesn't write your character's actions
4.25
/ 5
#16
F2 · POV / Tense
Holds 2nd-person, present-tense narration
4.20
/ 5
#11
F3 · Lore
Doesn't break worldbuilding
4.30
/ 5
#5
F8 · Momentum
Pushes scene forward when user goes passive
4.20
/ 5
#6
F12 · Instruction Drift
Keeps to the system prompt
4.30
/ 5
#11
F13 · Context Attention
Holds character cards 50+ turns deep
4.33
/ 5
#13
The model the community most enjoyed writing with — and the sixth-most reliable model on the board.
Round 01 verdict · Engagement ≠ Reliability
▌ Section 06 · Subjective Dimensions

Engagement · Voice · Collaboration.

All three dimensions scored 1–5 by Sonnet 4 LLM-judge across twenty 12-turn multi-turn sessions. The same battery feeds the failure-mode rubric above — these are the subjective half of that judgment.
Engagement
4.49/5
Tone Consistency
4.59/5
Collaboration
4.32/5
▌ Section 07 · Behavioral Metrics

How it writes.

Quantitative signals from the same 20 multi-turn sessions, compared against the population mean across all 11 models.
Avg words / turn
350
pop avg 265 · +32%
Unique-word ratio
0.597
pop avg 0.655 · -9%
Repetition score
0.069
pop avg 0.049 · +41%
▌ Section 08 · Flaw Hunter

Adversarial probe score.

Score of 100 minus deductions across 22 fail-mode flag types on adversarial 12-turn sessions. Higher = fewer flaws caught. Population range across the round is 12.8–46.9.
32.6
/ 100
▌ Score breakdown
Mean   32.6
Median   33.0
Fatal/sess   0.62
Major/sess   7.38
▌ Top flaws caught
purple_proserecycled_descriptionnarrating_emotions
▌ Section 09 · Sample Responses

Highest- and lowest-rated turns.

▌ Pending Round 02

Best- and worst-rated sample responses ship with the raw-vote endpoint in Round 02. When that lands, this section will surface the model’s highest- and lowest-scoring blind-arena turns side by side, scored on the same rubric the leaderboard uses.

▌ Round 01 verdict
Gemma 4 26B is the everyday champion of Round 01 — the model voters most enjoyed writing with, the cheapest top-tier option per token, and the only one in our pool that runs on a single consumer GPU. The caveat is the same one the failure-rank board makes: it's #6 of 11 on multi-turn reliability, and that's the seam to watch if your scenes lean on long-context attention or strict instruction discipline. Pick it for the everyday. Reach for Sonnet 4.5 when the scene needs to last.
▌ Section 10 · Compare & Drill

Stack it against another model.

━ All 11 Models
The Standings
Full leaderboard, all tests, all filters.
Compare →
Methodology · Raw votes (CSV) · GitHub · HF dataset
Profile · Gemma 4 26B · Round 01