Model Profile · Round 01

Gemini 3.1 Pro

Google · proprietary · 1M ctx

Deep context window, brittle on adversarial probes.

Composite Score

21.4

/100 · canonical

Arena ELO (R1)

—

joined post-R01

Multi-Turn ELO (R2)

1487

±97 · n=41

Reliability Rank

#13

avg 10.3

▌ Section 02 · The Lede

What this model is for.

Google's Gemini 3.1 Pro is the premium-tier Google entrant in Round 02 — $12.20 per million tokens, the third-highest price in the pool behind both Opus models. Multi-turn voters were unkind: it lands at #14 on Round 02 ELO at 1487, well behind the Flash Lite sibling that costs 1/68th as much. Adversarial coverage is solid where it exists (F1 agency 4.43, F12 instruction drift 4.33), but the fatal-flaw rate is the killer — 1.00 per session, second-worst in the field, and the flaw-hunter score sits at 29.2 with the kind of variance that implies one bad probe can break a session. The 1M-token context window is the main argument for picking this over its cheaper sibling.

▌ Section 03 · At a Glance

Cross-test position

Gemini 3.1 Pro sits at #18 on Cost · Latency — the caveat to watch.

Composite

Arena ELO

—

Multi-Turn

Rubric

Adversarial

Cost · Latency

▌ Section 04 · Strength & Weakness

Where it shines. Where it stumbles.

▲ Strength

Strong tone consistency (4.62/5). 1M-token context window, tied with the Flash Lite sibling and GPT-4.1.

▼ Weakness

Fatal-flaw rate of 1.00 per session — second-highest in the field. Multi-turn ELO of 1487 makes it the worst-value premium-tier model in the pool.

▌ Section 05 · Failure Modes

Per-axis breakdown.

Six adversarial probes per session, twenty sessions per model, judged by Sonnet 4 against a fixed rubric. Higher score = the model handled the failure mode better. Bars below show the mean across sessions; the black tick marks the population mean (4.20). Right column shows mean and rank within the rp-bench pool.

▌ Coverage: 4/6F3 · Lore · F8 · Momentum not yet run on this model. Upstream rolls these out incrementally as new models join the pool.

F1 · Agency

Doesn't write your character's actions

4.43/ 5

—

F2 · POV / Tense

Holds 2nd-person, present-tense narration

4.20/ 5

—

F3 · Lore

not yet run on this model

—

—

F8 · Momentum

not yet run on this model

—

—

F12 · Instruction Drift

Keeps to the system prompt

4.33/ 5

—

F13 · Context Attention

Holds character cards 50+ turns deep

4.37/ 5

—

“Premium-tier price, mid-pack votes. The 1M context is the only argument.”

— Round 02 verdict · Tough price-to-rank

▌ Section 06 · Subjective Dimensions

Engagement · Voice · Collaboration.

All three dimensions scored 1–5 by Sonnet 4 LLM-judge across twenty 12-turn multi-turn sessions. The same battery feeds the failure-mode rubric above — these are the subjective half of that judgment.

Engagement

4.50/5

Tone Consistency

4.62/5

Collaboration

4.42/5

▌ Section 07 · Behavioral Metrics

How it writes.

Quantitative signals from the same 20 multi-turn sessions, compared against the population mean across all 11 models.

Avg words / turn

263↓

pop avg 265 · -1%

Unique-word ratio

0.668↑

pop avg 0.655 · +2%

Repetition score

0.040↓

pop avg 0.049 · -18%

▌ Section 08 · Flaw Hunter

Adversarial probe score.

Score of 100 minus deductions across 22 fail-mode flag types on adversarial 12-turn sessions. Higher = fewer flaws caught. Population range across the round is 12.8–46.9.

29.2

/ 100

▌ Score breakdown

Mean   29.2
Median   35.5
Fatal/sess   1.00
Major/sess   6.75

▌ Top flaws caught

recycled_descriptionpurple_proseagency_violation

▌ Section 09 · Sample Responses

Highest- and lowest-rated turns.

▌ Pending Round 02

Best- and worst-rated sample responses ship with the raw-vote endpoint in Round 02. When that lands, this section will surface the model’s highest- and lowest-scoring blind-arena turns side by side, scored on the same rubric the leaderboard uses.

▌ Round 01 verdict

Hard to recommend in Round 02. Gemini 3.1 Pro is more expensive than Sonnet 4.5 (which destroys it on every axis) and slower than Opus 4.7 (which beats it on multi-turn ELO by 140 points). Pick this when the 1M-token context window is the load-bearing feature — anywhere else, the Flash Lite sibling is 67× cheaper for adjacent multi-turn quality.

▌ Section 10 · Compare & Drill

Stack it against another model.

━ All 11 Models

The Standings

Full leaderboard, all tests, all filters.

Compare →

Methodology · Raw votes (CSV) · GitHub · HF dataset

Profile · Gemini 3.1 Pro · Round 01