PlotPointsModelsGemini 2.5 Flash
Model Profile · Round 01

Gemini 2.5 Flash

Google · proprietary · 1M ctx

Round 01 top-3, dropped to bottom of Round 02 multi-turn.

Composite Score
11.9
/100 · canonical
Arena ELO (R1)
1515
±48 · n=241
Multi-Turn ELO (R2)
1372
±94 · n=53
Reliability Rank
#17
avg 13.4
▌ Section 02 · The Lede

What this model is for.

Google's flash-tier multimodal lands at #3 community ELO and stays remarkably balanced — SFW 53%, NSFW 54%, the rare case where neither mode favors. It writes shorter (141 words/turn vs the population's 265) and cleaner (0.728 unique-word ratio, second-highest in the field behind Grok at 0.796). The catch is hidden in the floor: agency-respect score 2.8/5 in its worst session — the second-lowest single-session score in our pool, with only Qwen falling further at 2.5. When Gemini is on, it's terse and graceful. When it slips, it slips hard.

▌ Section 03 · At a Glance

Cross-test position

Gemini 2.5 Flash holds #3 in Arena ELO. Sits at #20 on Multi-Turn — the caveat to watch.

Composite
18
Arena ELO
3
Multi-Turn
20
Rubric
18
Adversarial
17
Cost · Latency
10
▌ Section 04 · Strength & Weakness

Where it shines. Where it stumbles.

Strength
Community top tier (#3, ELO 1515). Second-cleanest prose in the field (unique-word ratio 0.728, behind Grok). Balanced across modes — neither SFW nor NSFW favors.
Weakness
Floor on agency respect (lowest session: 2.8/5, second-lowest in the round behind Qwen's 2.5). Below-average word count — may feel terse for richly described scenes.
▌ Section 05 · Failure Modes

Per-axis breakdown.

Six adversarial probes per session, twenty sessions per model, judged by Sonnet 4 against a fixed rubric. Higher score = the model handled the failure mode better. Bars below show the mean across sessions; the black tick marks the population mean (4.20). Right column shows mean and rank within the rp-bench pool.
F1 · Agency
Doesn't write your character's actions
3.95
/ 5
#17
F2 · POV / Tense
Holds 2nd-person, present-tense narration
4.03
/ 5
#18
F3 · Lore
Doesn't break worldbuilding
4.10
/ 5
#11
F8 · Momentum
Pushes scene forward when user goes passive
4.20
/ 5
#4
F12 · Instruction Drift
Keeps to the system prompt
4.20
/ 5
#18
F13 · Context Attention
Holds character cards 50+ turns deep
4.23
/ 5
#17
Among the cleanest prose in the field — and a hard floor to land on when it slips.
Round 01 verdict · Mean ≠ floor
▌ Section 06 · Subjective Dimensions

Engagement · Voice · Collaboration.

All three dimensions scored 1–5 by Sonnet 4 LLM-judge across twenty 12-turn multi-turn sessions. The same battery feeds the failure-mode rubric above — these are the subjective half of that judgment.
Engagement
4.07/5
Tone Consistency
4.50/5
Collaboration
4.33/5
▌ Section 07 · Behavioral Metrics

How it writes.

Quantitative signals from the same 20 multi-turn sessions, compared against the population mean across all 11 models.
Avg words / turn
141
pop avg 265 · -47%
Unique-word ratio
0.728
pop avg 0.655 · +11%
Repetition score
0.030
pop avg 0.049 · -39%
▌ Section 08 · Flaw Hunter

Adversarial probe score.

Score of 100 minus deductions across 22 fail-mode flag types on adversarial 12-turn sessions. Higher = fewer flaws caught. Population range across the round is 12.8–46.9.
43.6
/ 100
▌ Score breakdown
Mean   43.6
Median   41.5
Fatal/sess   0.19
Major/sess   6.44
▌ Top flaws caught
purple_proserecycled_descriptionnarrating_emotions
▌ Section 09 · Sample Responses

Highest- and lowest-rated turns.

▌ Pending Round 02

Best- and worst-rated sample responses ship with the raw-vote endpoint in Round 02. When that lands, this section will surface the model’s highest- and lowest-scoring blind-arena turns side by side, scored on the same rubric the leaderboard uses.

▌ Round 01 verdict
Gemini 2.5 Flash is the round's surprise: among the cleanest prose metrics in the field, balanced across modes, and one of three models that hold the top community-ELO tier. The hidden risk is the floor — when Gemini fails an agency probe, it falls further than any model except Qwen. Run it for the daily, but read the long sessions before you cite.
▌ Section 10 · Compare & Drill

Stack it against another model.

━ All 11 Models
The Standings
Full leaderboard, all tests, all filters.
Compare →
Methodology · Raw votes (CSV) · GitHub · HF dataset
Profile · Gemini 2.5 Flash · Round 01