PlotPointsModelsKimi K2.5
Model Profile · Round 01

Kimi K2.5

Moonshot · open weights · 128K ctx

Strong on tone consistency. Slow generation.

Composite Score
59.5
/100 · canonical
Arena ELO (R1)
joined post-R01
Multi-Turn ELO (R2)
1496
±103 · n=37
Reliability Rank
#8
avg 7.5
▌ Section 02 · The Lede

What this model is for.

Moonshot's Kimi K2.5 is Round 02's slow but considered entry. Generation latency is brutal — 44.7 seconds median, second-slowest in the pool behind only its K2.6 sibling — but the prose that comes out is meaningfully clean: tone consistency at 4.67 (top-2), flaw-hunter mean of 42, agency respect at 4.47. Multi-turn voters slot it at #11 on ELO at 1496, which is roughly mid-pack but ahead of every model that joined Round 02 from the new vendor pool except the Anthropic ones. At $1.36/1M, it's priced as a premium-tier option without quite delivering premium-tier multi-turn engagement.

▌ Section 03 · At a Glance

Cross-test position

Kimi K2.5 sits at #13 on Cost · Latency — the caveat to watch.

Composite
9
Arena ELO
Multi-Turn
11
Rubric
5
Adversarial
8
Cost · Latency
13
▌ Section 04 · Strength & Weakness

Where it shines. Where it stumbles.

Strength
Strong tone consistency (4.67/5, top-2). Best agency respect of any Moonshot model. F13 context attention at 4.57 is competitive with Sonnet/Opus.
Weakness
44.7-second median generation makes it almost unusable for synchronous chat. Multi-turn ELO of 1496 is mid-pack despite the price tag.
▌ Section 05 · Failure Modes

Per-axis breakdown.

Six adversarial probes per session, twenty sessions per model, judged by Sonnet 4 against a fixed rubric. Higher score = the model handled the failure mode better. Bars below show the mean across sessions; the black tick marks the population mean (4.20). Right column shows mean and rank within the rp-bench pool.
▌ Coverage: 4/6F3 · Lore · F8 · Momentum not yet run on this model. Upstream rolls these out incrementally as new models join the pool.
F1 · Agency
Doesn't write your character's actions
4.47
/ 5
F2 · POV / Tense
Holds 2nd-person, present-tense narration
4.20
/ 5
F3 · Lore
not yet run on this model
F8 · Momentum
not yet run on this model
F12 · Instruction Drift
Keeps to the system prompt
4.37
/ 5
F13 · Context Attention
Holds character cards 50+ turns deep
4.57
/ 5
45-second responses, 4.57 context attention. A choice you'd only make for batch generation.
Round 02 verdict · Slow polish
▌ Section 06 · Subjective Dimensions

Engagement · Voice · Collaboration.

All three dimensions scored 1–5 by Sonnet 4 LLM-judge across twenty 12-turn multi-turn sessions. The same battery feeds the failure-mode rubric above — these are the subjective half of that judgment.
Engagement
4.56/5
Tone Consistency
4.67/5
Collaboration
4.39/5
▌ Section 07 · Behavioral Metrics

How it writes.

Quantitative signals from the same 20 multi-turn sessions, compared against the population mean across all 11 models.
Avg words / turn
253
pop avg 265 · -4%
Unique-word ratio
0.681
pop avg 0.655 · +4%
Repetition score
0.037
pop avg 0.049 · -24%
▌ Section 08 · Flaw Hunter

Adversarial probe score.

Score of 100 minus deductions across 22 fail-mode flag types on adversarial 12-turn sessions. Higher = fewer flaws caught. Population range across the round is 12.8–46.9.
42.0
/ 100
▌ Score breakdown
Mean   42.0
Median   42.0
Fatal/sess   0.44
Major/sess   6.33
▌ Top flaws caught
recycled_descriptionpurple_prosenarrating_emotions
▌ Section 09 · Sample Responses

Highest- and lowest-rated turns.

▌ Pending Round 02

Best- and worst-rated sample responses ship with the raw-vote endpoint in Round 02. When that lands, this section will surface the model’s highest- and lowest-scoring blind-arena turns side by side, scored on the same rubric the leaderboard uses.

▌ Round 01 verdict
K2.5 is hard to deploy live. The latency makes interactive roleplay punishing, and the multi-turn ELO doesn't justify the wait when Sonnet generates in a third of the time at 5× the cost. Real use case: batch generation pipelines where you can absorb latency in exchange for cleaner prose. For interactive product surfaces, look anywhere else first.
▌ Section 10 · Compare & Drill

Stack it against another model.

━ All 11 Models
The Standings
Full leaderboard, all tests, all filters.
Compare →
Methodology · Raw votes (CSV) · GitHub · HF dataset
Profile · Kimi K2.5 · Round 01