PlotPoints›Models›Gemini 3.1 Flash Lite

Model Profile · Round 01

Gemini 3.1 Flash Lite

Google · proprietary · 1M ctx

Cheap

Cheapest tier with Round 02 top-5 multi-turn ELO.

Composite Score

40.5

/100 · canonical

Arena ELO (R1)

—

joined post-R01

Multi-Turn ELO (R2)

1560

±104 · n=28

Reliability Rank

#15

avg 12.3

▌ Section 02 · The Lede

What this model is for.

Gemini 3.1 Flash Lite is Round 02's quiet upset. It sits at #4 on multi-turn ELO at 1560 — within striking distance of Sonnet 4.5 (1550) and ahead of GPT-4.1 (1552) — at $0.18 per million tokens. That's 43× cheaper than Sonnet, 24× cheaper than GPT-4.1, and tied with DeepSeek v4 Flash for the lowest price in the pool. The catch is generation speed: 153 tokens/sec, which sounds fast but masks a higher fatal-flaw rate (0.17 per session is decent, but 8.0 majors is mid-pack). The 1M-token context window is preserved from the Pro sibling at a fraction of the price.

▌ Section 03 · At a Glance

Cross-test position

Gemini 3.1 Flash Lite holds #1 in Cost · Latency. Sits at #15 on Adversarial — the caveat to watch.

Composite

Arena ELO

—

Multi-Turn

Rubric

Adversarial

Cost · Latency

▌ Section 04 · Strength & Weakness

Where it shines. Where it stumbles.

▲ Strength

#4 multi-turn ELO at $0.18/1M — the best price-to-rank in the pool. 1M-token context window, fastest generation among top-5 finalists.

▼ Weakness

Avg failure rank of 12.25 across the 20-model pool puts adversarial reliability mid-pack. Flaw-hunter mean of 34.2 sits below the Sonnet/Opus tier.

▌ Section 05 · Failure Modes

Per-axis breakdown.

Six adversarial probes per session, twenty sessions per model, judged by Sonnet 4 against a fixed rubric. Higher score = the model handled the failure mode better. Bars below show the mean across sessions; the black tick marks the population mean (4.20). Right column shows mean and rank within the rp-bench pool.

▌ Coverage: 4/6F3 · Lore · F8 · Momentum not yet run on this model. Upstream rolls these out incrementally as new models join the pool.

F1 · Agency

Doesn't write your character's actions

4.33/ 5

—

F2 · POV / Tense

Holds 2nd-person, present-tense narration

4.23/ 5

—

F3 · Lore

not yet run on this model

—

—

F8 · Momentum

not yet run on this model

—

—

F12 · Instruction Drift

Keeps to the system prompt

4.30/ 5

—

F13 · Context Attention

Holds character cards 50+ turns deep

4.33/ 5

—

“Sonnet-tier multi-turn ELO at 1/43rd the price. Round 02's biggest value.”

— Round 02 verdict · Best price-to-rank

▌ Section 06 · Subjective Dimensions

Engagement · Voice · Collaboration.

All three dimensions scored 1–5 by Sonnet 4 LLM-judge across twenty 12-turn multi-turn sessions. The same battery feeds the failure-mode rubric above — these are the subjective half of that judgment.

Engagement

4.43/5

Tone Consistency

4.60/5

Collaboration

4.31/5

▌ Section 07 · Behavioral Metrics

How it writes.

Quantitative signals from the same 20 multi-turn sessions, compared against the population mean across all 11 models.

Avg words / turn

264↓

pop avg 265 · -0%

Unique-word ratio

0.644↓

pop avg 0.655 · -2%

Repetition score

0.049═

pop avg 0.049 · +0%

▌ Section 08 · Flaw Hunter

Adversarial probe score.

Score of 100 minus deductions across 22 fail-mode flag types on adversarial 12-turn sessions. Higher = fewer flaws caught. Population range across the round is 12.8–46.9.

34.2

/ 100

▌ Score breakdown

Mean   34.2
Median   34.0
Fatal/sess   0.17
Major/sess   8.00

▌ Top flaws caught

purple_proserecycled_descriptionnarrating_emotions

▌ Section 09 · Sample Responses

Highest- and lowest-rated turns.

▌ Pending Round 02

Best- and worst-rated sample responses ship with the raw-vote endpoint in Round 02. When that lands, this section will surface the model’s highest- and lowest-scoring blind-arena turns side by side, scored on the same rubric the leaderboard uses.

▌ Round 01 verdict

Flash Lite is the value play if you trust Round 02 voters. The price-to-multi-turn-ELO ratio is uncontested at this scale, and the 1M context window means it doesn't lose to its Pro sibling on the only axis Pro is good at. The reliability numbers are mid-pack — if your product needs strict character control, pay up for Sonnet. For everything else, this is the leading "ship Round 03 cheaply" candidate.

▌ Section 10 · Compare & Drill

Stack it against another model.

━ All 11 Models

The Standings

Full leaderboard, all tests, all filters.

Compare →

Methodology · Raw votes (CSV) · GitHub · HF dataset

Profile · Gemini 3.1 Flash Lite · Round 01