Model Profile · Round 01

Kimi K2.6

Moonshot · open weights · 128K ctx

⚠ Floor

Top-2 on flaw hunter. Catastrophic agency floor on bait scenes.

Composite Score

54.8

/100 · canonical

Arena ELO (R1)

—

joined post-R01

Multi-Turn ELO (R2)

1535

±101 · n=35

Reliability Rank

#16

avg 12.8

▌ Section 02 · The Lede

What this model is for.

Kimi K2.6 is the K2.5 successor with the field's highest flaw-hunter score (49.5) and a catastrophic agency floor: F1 agency lands at 3.77/5, the second-worst in the entire pool, and the lowest-scoring session it produced clocks in at 2.5. Multi-turn voters didn't see this happen — Round 02 ELO holds it at #8 at 1535, well ahead of K2.5 — because the agency-bait probes are a small slice of the multi-turn battery and most pairings don't engineer that failure. The model that ships clean writing 90% of the time and writes your character's actions for you 10% of the time is exactly the model that catches voters off guard. Not for slow-burn.

▌ Section 03 · At a Glance

Cross-test position

Kimi K2.6 sits at #17 on Rubric — the caveat to watch.

Composite

Arena ELO

—

Multi-Turn

Rubric

Adversarial

Cost · Latency

▌ Section 04 · Strength & Weakness

Where it shines. Where it stumbles.

▲ Strength

Top-2 flaw-hunter score (49.5/100), highest in the Moonshot lineup. Multi-turn ELO of 1535 puts it #8 in the pool. Lowest fatal-flaw rate among Round 02 entrants at 0.12/session.

▼ Weakness

F1 agency mean of 3.77 is second-worst in the pool — the model will write your character's actions when probed. 59-second median generation is the slowest in the field by a margin.

▌ Section 05 · Failure Modes

Per-axis breakdown.

Six adversarial probes per session, twenty sessions per model, judged by Sonnet 4 against a fixed rubric. Higher score = the model handled the failure mode better. Bars below show the mean across sessions; the black tick marks the population mean (4.20). Right column shows mean and rank within the rp-bench pool.

▌ Coverage: 4/6F3 · Lore · F8 · Momentum not yet run on this model. Upstream rolls these out incrementally as new models join the pool.

F1 · Agency

Doesn't write your character's actions

3.77/ 5

—

F2 · POV / Tense

Holds 2nd-person, present-tense narration

4.27/ 5

—

F3 · Lore

not yet run on this model

—

—

F8 · Momentum

not yet run on this model

—

—

F12 · Instruction Drift

Keeps to the system prompt

4.30/ 5

—

F13 · Context Attention

Holds character cards 50+ turns deep

4.40/ 5

—

“Cleanest prose with a 3.77 agency floor. The model that catches voters off guard.”

— Round 02 verdict · Don't pick for romance

▌ Section 06 · Subjective Dimensions

Engagement · Voice · Collaboration.

All three dimensions scored 1–5 by Sonnet 4 LLM-judge across twenty 12-turn multi-turn sessions. The same battery feeds the failure-mode rubric above — these are the subjective half of that judgment.

Engagement

4.44/5

Tone Consistency

4.39/5

Collaboration

4.12/5

▌ Section 07 · Behavioral Metrics

How it writes.

Quantitative signals from the same 20 multi-turn sessions, compared against the population mean across all 11 models.

Avg words / turn

221↓

pop avg 265 · -16%

Unique-word ratio

0.677↑

pop avg 0.655 · +3%

Repetition score

0.038↓

pop avg 0.049 · -22%

▌ Section 08 · Flaw Hunter

Adversarial probe score.

Score of 100 minus deductions across 22 fail-mode flag types on adversarial 12-turn sessions. Higher = fewer flaws caught. Population range across the round is 12.8–46.9.

49.5

/ 100

▌ Score breakdown

Mean   49.5
Median   52.0
Fatal/sess   0.12
Major/sess   5.25

▌ Top flaws caught

recycled_descriptionpurple_proseconvenient_world

▌ Section 09 · Sample Responses

Highest- and lowest-rated turns.

▌ Pending Round 02

Best- and worst-rated sample responses ship with the raw-vote endpoint in Round 02. When that lands, this section will surface the model’s highest- and lowest-scoring blind-arena turns side by side, scored on the same rubric the leaderboard uses.

▌ Round 01 verdict

K2.6 is a paradox: structurally the cleanest prose the benchmark has scored, paired with a near-bottom agency floor that flips on adversarial probes. If your product never bait-tests user agency — say, you're generating villain monologues or NPC asides — this is one of the strongest open-weight picks. If user agency matters, the floor will eventually surface and you'll wish you picked K2.5 or DeepSeek v4 Pro instead.

▌ Section 10 · Compare & Drill

Stack it against another model.

━ All 11 Models

The Standings

Full leaderboard, all tests, all filters.

Compare →

Methodology · Raw votes (CSV) · GitHub · HF dataset

Profile · Kimi K2.6 · Round 01