PlotPointsModelsGemini 3.1 Pro
Model Profile · Round 01

Gemini 3.1 Pro

Google · proprietary · 1M ctx

Deep context window, brittle on adversarial probes.

Composite Score
21.4
/100 · canonical
Arena ELO (R1)
joined post-R01
Multi-Turn ELO (R2)
1487
±97 · n=41
Reliability Rank
#13
avg 10.3
▌ Section 02 · The Lede

What this model is for.

Google's Gemini 3.1 Pro is the premium-tier Google entrant in Round 02 — $12.20 per million tokens, the third-highest price in the pool behind both Opus models. Multi-turn voters were unkind: it lands at #14 on Round 02 ELO at 1487, well behind the Flash Lite sibling that costs 1/68th as much. Adversarial coverage is solid where it exists (F1 agency 4.43, F12 instruction drift 4.33), but the fatal-flaw rate is the killer — 1.00 per session, second-worst in the field, and the flaw-hunter score sits at 29.2 with the kind of variance that implies one bad probe can break a session. The 1M-token context window is the main argument for picking this over its cheaper sibling.

▌ Section 03 · At a Glance

Cross-test position

Gemini 3.1 Pro sits at #18 on Cost · Latency — the caveat to watch.

Composite
16
Arena ELO
Multi-Turn
14
Rubric
12
Adversarial
13
Cost · Latency
18
▌ Section 04 · Strength & Weakness

Where it shines. Where it stumbles.

Strength
Strong tone consistency (4.62/5). 1M-token context window, tied with the Flash Lite sibling and GPT-4.1.
Weakness
Fatal-flaw rate of 1.00 per session — second-highest in the field. Multi-turn ELO of 1487 makes it the worst-value premium-tier model in the pool.
▌ Section 05 · Failure Modes

Per-axis breakdown.

Six adversarial probes per session, twenty sessions per model, judged by Sonnet 4 against a fixed rubric. Higher score = the model handled the failure mode better. Bars below show the mean across sessions; the black tick marks the population mean (4.20). Right column shows mean and rank within the rp-bench pool.
▌ Coverage: 4/6F3 · Lore · F8 · Momentum not yet run on this model. Upstream rolls these out incrementally as new models join the pool.
F1 · Agency
Doesn't write your character's actions
4.43
/ 5
F2 · POV / Tense
Holds 2nd-person, present-tense narration
4.20
/ 5
F3 · Lore
not yet run on this model
F8 · Momentum
not yet run on this model
F12 · Instruction Drift
Keeps to the system prompt
4.33
/ 5
F13 · Context Attention
Holds character cards 50+ turns deep
4.37
/ 5
Premium-tier price, mid-pack votes. The 1M context is the only argument.
Round 02 verdict · Tough price-to-rank
▌ Section 06 · Subjective Dimensions

Engagement · Voice · Collaboration.

All three dimensions scored 1–5 by Sonnet 4 LLM-judge across twenty 12-turn multi-turn sessions. The same battery feeds the failure-mode rubric above — these are the subjective half of that judgment.
Engagement
4.50/5
Tone Consistency
4.62/5
Collaboration
4.42/5
▌ Section 07 · Behavioral Metrics

How it writes.

Quantitative signals from the same 20 multi-turn sessions, compared against the population mean across all 11 models.
Avg words / turn
263
pop avg 265 · -1%
Unique-word ratio
0.668
pop avg 0.655 · +2%
Repetition score
0.040
pop avg 0.049 · -18%
▌ Section 08 · Flaw Hunter

Adversarial probe score.

Score of 100 minus deductions across 22 fail-mode flag types on adversarial 12-turn sessions. Higher = fewer flaws caught. Population range across the round is 12.8–46.9.
29.2
/ 100
▌ Score breakdown
Mean   29.2
Median   35.5
Fatal/sess   1.00
Major/sess   6.75
▌ Top flaws caught
recycled_descriptionpurple_proseagency_violation
▌ Section 09 · Sample Responses

Highest- and lowest-rated turns.

▌ Pending Round 02

Best- and worst-rated sample responses ship with the raw-vote endpoint in Round 02. When that lands, this section will surface the model’s highest- and lowest-scoring blind-arena turns side by side, scored on the same rubric the leaderboard uses.

▌ Round 01 verdict
Hard to recommend in Round 02. Gemini 3.1 Pro is more expensive than Sonnet 4.5 (which destroys it on every axis) and slower than Opus 4.7 (which beats it on multi-turn ELO by 140 points). Pick this when the 1M-token context window is the load-bearing feature — anywhere else, the Flash Lite sibling is 67× cheaper for adjacent multi-turn quality.
▌ Section 10 · Compare & Drill

Stack it against another model.

━ All 11 Models
The Standings
Full leaderboard, all tests, all filters.
Compare →
Methodology · Raw votes (CSV) · GitHub · HF dataset
Profile · Gemini 3.1 Pro · Round 01