Model Profile · Round 01
MiniMax M2.7
MiniMax · proprietary · 200K ctx
Strong narrative push. Fragile under adversarial pressure.
Composite Score
35.7
/100 · canonical
Arena ELO (R1)
1510
±48 · n=393
Multi-Turn ELO (R2)
1437
±93 · n=43
Reliability Rank
#10
avg 9.3
▌ Section 02 · The Lede
What this model is for.
MiniMax M2.7 takes the round on narrative momentum — F8 score 4.30, top-2 in the field — and reads as the model that pushes the scene forward when the user goes passive. It's #4 on the community board, balanced word count (260 words/turn — almost exactly the population mean), with population-average prose metrics across the line. The weakness is reliability: 0.79 fatal flaws per session and a Flaw Hunter mean of 41.5. Strong scenes, frequent breakdowns.
▌ Section 03 · At a Glance
Cross-test position
MiniMax M2.7 sits at #17 on Multi-Turn — the caveat to watch.
▌ Section 04 · Strength & Weakness
Where it shines. Where it stumbles.
▲ Strength
Top-2 on narrative momentum (4.30/5). Population-average word count and prose quality — feels neutral, doesn't lean on a tic.
▼ Weakness
Frequent fatal flaws (0.79 per session — fourth-worst in the field, after Grok 1.33 and Mistral / Llama tied at 0.95). NSFW-averse (45% NSFW win rate vs 54% SFW).
▌ Section 05 · Failure Modes
Per-axis breakdown.
Six adversarial probes per session, twenty sessions per model, judged by Sonnet 4 against a fixed rubric. Higher score = the model handled the failure mode better. Bars below show the mean across sessions; the black tick marks the population mean (4.20). Right column shows mean and rank within the rp-bench pool.
F1 · Agency
Doesn't write your character's actions
#7
F2 · POV / Tense
Holds 2nd-person, present-tense narration
#12
F3 · Lore
Doesn't break worldbuilding
#9
F8 · Momentum
Pushes scene forward when user goes passive
#2
F12 · Instruction Drift
Keeps to the system prompt
#17
F13 · Context Attention
Holds character cards 50+ turns deep
#15
“The model that pushes the scene forward — and one that breaks too often getting there.”
— Round 01 verdict · Momentum vs reliability
▌ Section 06 · Subjective Dimensions
Engagement · Voice · Collaboration.
All three dimensions scored 1–5 by Sonnet 4 LLM-judge across twenty 12-turn multi-turn sessions. The same battery feeds the failure-mode rubric above — these are the subjective half of that judgment.
▌ Section 07 · Behavioral Metrics
How it writes.
Quantitative signals from the same 20 multi-turn sessions, compared against the population mean across all 11 models.
Avg words / turn
261↓
pop avg 265 · -2%
Unique-word ratio
0.649↓
pop avg 0.655 · -1%
Repetition score
0.046↓
pop avg 0.049 · -6%
▌ Section 08 · Flaw Hunter
Adversarial probe score.
Score of 100 minus deductions across 22 fail-mode flag types on adversarial 12-turn sessions. Higher = fewer flaws caught. Population range across the round is 12.8–46.9.
▌ Score breakdown
Mean 41.5
Median 44.5
Fatal/sess 0.79
Major/sess 6.00
▌ Top flaws caught
purple_proserecycled_descriptionnarrating_emotions
▌ Section 09 · Sample Responses
Highest- and lowest-rated turns.
▌ Pending Round 02
Best- and worst-rated sample responses ship with the raw-vote endpoint in Round 02. When that lands, this section will surface the model’s highest- and lowest-scoring blind-arena turns side by side, scored on the same rubric the leaderboard uses.
▌ Round 01 verdict
MiniMax M2.7 is a narrative engine. When it's working, it's the model most willing to take the scene where it needs to go without waiting on the user. The flaw count is the cost — fourth-worst fatal-flaw rate in the pool (behind Grok, Mistral, and Llama), and it leans away from NSFW work. Pick it for stalled scenes. Be ready to swipe more than you would on Sonnet or DeepSeek.
▌ Section 10 · Compare & Drill
Stack it against another model.
Profile · MiniMax M2.7 · Round 01