Model Profile · Round 01

Qwen 3.5 Flash

Alibaba · open weights · local-friendly · 128K ctx

⚠ Floor

Floor on agency and instruction drift. Caveat emptor.

Composite Score

7.1

/100 · canonical

Arena ELO (R1)

1487

±47 · n=401

Multi-Turn ELO (R2)

1423

±41 · n=207

Reliability Rank

#19

avg 14.9

▌ Section 02 · The Lede

What this model is for.

Qwen 3.5 Flash sits at #8 community ELO and #9 reliability, with two structural floors: agency respect bottoms at 2.5/5 (lowest in the pool) and instruction drift bottoms at 2.5/5 — both the lowest single-session scores on those axes in our pool. F12 instruction drift mean is also the lowest in our 11 at 3.17. The 42% NSFW win rate is mid-pack but the floor scores tell the story. It's the field's caveat-emptor model.

▌ Section 03 · At a Glance

Cross-test position

Qwen 3.5 Flash sits at #20 on Composite — the caveat to watch.

Composite

Arena ELO

Multi-Turn

Rubric

Adversarial

Cost · Latency

▌ Section 04 · Strength & Weakness

Where it shines. Where it stumbles.

▲ Strength

No standout strength on the dimensions we tested. Best read as a budget option for non-critical scenes where structural failure is acceptable.

▼ Weakness

Catastrophic floors on agency respect (lowest session 2.5) AND instruction drift (lowest session 2.5) — two structural Tier-1 failures. F12 mean of 3.17 is also the lowest in our pool. Llama is the only other model with floors on both axes, but Qwen's are deeper on each. Reach for almost anything else if the system prompt matters.

▌ Section 05 · Failure Modes

Per-axis breakdown.

Six adversarial probes per session, twenty sessions per model, judged by Sonnet 4 against a fixed rubric. Higher score = the model handled the failure mode better. Bars below show the mean across sessions; the black tick marks the population mean (4.20). Right column shows mean and rank within the rp-bench pool.

F1 · Agency

Doesn't write your character's actions

3.80/ 5

#20

F2 · POV / Tense

Holds 2nd-person, present-tense narration

4.03/ 5

#20

F3 · Lore

Doesn't break worldbuilding

4.20/ 5

F8 · Momentum

Pushes scene forward when user goes passive

4.10/ 5

F12 · Instruction Drift

Keeps to the system prompt

3.17/ 5

#21

F13 · Context Attention

Holds character cards 50+ turns deep

4.30/ 5

#17

“Two structural floors — agency and instruction drift, both bottoming at the worst scores in the pool.”

— Round 01 verdict · Caveat emptor

▌ Section 06 · Subjective Dimensions

Engagement · Voice · Collaboration.

All three dimensions scored 1–5 by Sonnet 4 LLM-judge across twenty 12-turn multi-turn sessions. The same battery feeds the failure-mode rubric above — these are the subjective half of that judgment.

Engagement

4.18/5

Tone Consistency

4.16/5

Collaboration

4.13/5

▌ Section 07 · Behavioral Metrics

How it writes.

Quantitative signals from the same 20 multi-turn sessions, compared against the population mean across all 11 models.

Avg words / turn

229↓

pop avg 265 · -13%

Unique-word ratio

0.634↓

pop avg 0.657 · -4%

Repetition score

0.069↑

pop avg 0.048 · +44%

▌ Section 08 · Flaw Hunter

Adversarial probe score.

Score of 100 minus deductions across 22 fail-mode flag types on adversarial 12-turn sessions. Higher = fewer flaws caught. Population range across the round is 12.8–46.9.

39.6

/ 100

▌ Score breakdown

Mean   39.6
Median   39.5
Fatal/sess   0.50
Major/sess   6.50

▌ Top flaws caught

recycled_descriptionpurple_prosenarrating_emotions

▌ Section 09 · Sample Responses

Highest- and lowest-rated turns.

▌ Pending Round 02

Best- and worst-rated sample responses ship with the raw-vote endpoint in Round 02. When that lands, this section will surface the model’s highest- and lowest-scoring blind-arena turns side by side, scored on the same rubric the leaderboard uses.

▌ Round 01 verdict

Qwen 3.5 Flash is the model the failure-rank board flags hardest. There's no axis where it stands out, and two — agency respect and instruction drift — where its floor is the worst in the field. Avoid for any scene where character control matters. The 42% NSFW win rate isn't a saving grace; the structural failures show up regardless of mode.

▌ Section 10 · Compare & Drill

Stack it against another model.

━ All 11 Models

The Standings

Full leaderboard, all tests, all filters.

Compare →

Methodology · Raw votes (CSV) · GitHub · HF dataset

Profile · Qwen 3.5 Flash · Round 01