← back to index

Qwopus3.6-27B v2 (dense) · Q5_K_M evaluation

by Kyle Hessling · @KyleHessling1 on X

Same prompt suite as the 35B-A3B eval (5 agentic + 1 nothink rerun, 5 designs, 6 canvas), plus a full SWE-bench Verified pass on the controlled-202 slice. 15 of 17 outputs published; the Mandelbulb shader and Three.js crystal scene shipped well-formed HTML but didn't render strongly enough to keep, so they're parked in excluded-canvas/. All published runs use thinking on. Single RTX 5090, Q5_K_M, 160K fp16 context.

TL;DR

Setup

ItemValue
ModelQwopus3.6-27B v2 (dense) Q5_K_M GGUF, ~18 GB on disk
ArchitectureDense 27B, 64 layers, GQA (4 KV heads × 256 head dim), native 262 K ctx
LineageFinetune of Qwen/Qwen3.6-27B base
Runtimellama.cpp CUDA-12.8, --flash-attn on, embedded jinja template
Context163,840 tokens, fp16 K+V cache, single slot
HardwareRTX 5090 (32 GB), all 64 layers on GPU, ~31 GB VRAM resident
Sampling, design HTMLtemp 0.75 / top-p 0.95 / thinking on
Sampling, canvas (5 of 6)temp 1.0 / top-p 0.95 / thinking on
Sampling, physics_sandboxtemp 0.75 (first run shipped clean, kept)
Sampling, agentictemp 0.3 / top-p 0.9 / thinking on
SWE-bench harnessmini-swe-agent 2.2.8 · swebench 4.1.0 · temp 1.0 · step_limit 275 · workers 1

SWE-bench Verified

MetricValue
Slicecontrolled-202
Resolved152 / 202 · 75.25%
Empty patches1
Errors (patch failed eval)1
Step-limit (275) hits0
Context-overflow failures0
Median trajectory67 of 275 steps
Max trajectory148 (django-14351)
Wall clock19h 29m, parallel=1

Every instance exited Submitted. Not a single one hit the step limit or got within 50K tokens of the context cap. The 1 empty patch (matplotlib-24026) was the model calling submit with no diff after 64 steps. The 1 error (django-13028) submitted a 711-byte patch that didn't grade as correct.

Run agentic harnesses hot The most actionable result from this run. For thinking-on finetunes in multi-step harnesses, raising temperature collapses the empty-patch failure mode. Earlier runs at temp=0.1 hit 78 empty patches on this slice; this run at temp=1.0 hit 1.

Why low temperature hurts a thinking finetune

The intuition that runs counter to the result: finetunes are trained to converge, so greedy decoding should produce the cleanest output. That holds for short, well-anchored prompts. It is the wrong intuition for multi-step agentic loops.

Inside a harness, the model reads tool output, thinks (which on a thinking-on finetune is a long free-form chain), decides on an action, emits a tool call. At temp=0.1, every step follows the highest-probability path. Combine that with a finetune trained to think thoroughly, and the model lands on the "I should think more about this" branch at every single step. That produces 240-message trajectories where the model proposes, second-guesses, proposes again, and never submits.

At temp=1.0 the model still thinks, but it varies the path through reasoning and is more willing to commit. This run's trajectory data backs that up: median 67 steps, max 148, vs the earlier MoE finetune at temp=0.1 routinely hitting 240+ before timing out. Concrete recommendation: for any multi-step harness (mini-swe-agent, Cline, Claude-style tool use), start at temp=1.0 and walk back only if you see degradation.

Throughput

MetricQwopus 3.6 35B-A3B (MoE, Q5)Qwopus 3.6 27B v2 (dense, Q5)
avg tok/s161.943.9
tok/s range154.4 / 164.843.1 / 44.6
VRAM resident~25 GB (65K q8)~31 GB (160K fp16)
Total tokens (this suite)106,688119,036
Total runtime11.1 min45.3 min

The MoE wins on throughput by ~3.7×, which is what the A3B routing pattern is for: only ~3 B of weights move through cache per token on the MoE vs the full ~16 GB Q5_K_M pass on the dense 27B. The dense pays for that with per-token quality, which shows up most clearly on agentic work. Pick the dense for agentic depth, long context, and reasoning; pick the MoE for high-throughput short-context generation.

Throughput variance is the tightest I've measured on this hardware: 43.1 to 44.6 tok/s across all 17 runs. A 1.5 tok/s spread means the model is fully memory-bandwidth-bound, which is the expected steady state.

Web design

All 5 designs validated end-to-end: DOCTYPE present, </html> reached, no truncation, balanced scripts. Average shipped size 40.8 KB per page.

PromptHTML KBTokensTimeReasoning
saas_landing60.323,801552 s836
analytics_dashboard42.115,390354 s1,898
designer_portfolio32.511,612265 s1,459
pricing_page26.69,360213 s1,077
mobile_app_marketing42.316,590382 s1,650

designer_portfolio: the standout

The designer_portfolio prompt is wide-open: "design a designer portfolio site," no spec. Most models in this class respond with a generic skeleton and placeholder copy. The dense Qwopus picks an angle, commits to it, and executes: kinetic-typography hero, defensible visual rhythm through the case-study section, voice in the copy. It looks like a portfolio draft a human designer would actually ship. This is the clearest illustration of where the finetune still has real lift over the (much-improved) base. The base produces something competent; the finetune produces something with a point of view.

Notes on the other four

Canvas / WebGL

4 of 6 prompts shipped clean. The two that didn't (Mandelbulb shader, Three.js crystal scene) rendered well-formed HTML but the actual visual output wasn't strong enough to publish. Parked in excluded-canvas/ rather than deleted because the failure modes are recognizable:

PromptHTML KBTokensReasoningStatus
particle_attractor9.44,3081,513shipped
generative_flowfield13.97,2376,269shipped
physics_sandbox18.06,8271,665shipped
audio_reactive10.75,7317,645shipped
webgl_shader (Mandelbulb)11.54,9281,831parked
three_scene (crystal)12.54,9801,811parked

Highlights from the 4 published

Agentic reasoning

5 prompts plus a structured-extraction nothink rerun (17th run). Total agentic time: 3.2 min.

PromptCompletion tokensReasoning charsTime
multi_step_planning2,2387,06750 s
tool_use_json1,2622,80728 s
code_debug1,7535,22539 s
structured_extraction (thinking)1,7214,24539 s
self_critique1,2553,30928 s
structured_extraction_nothink35108 s

Lineage

Qwen 3.5 27B → Qwopus 3.5 27B → Qwen 3.6 27B (base) → Qwopus 3.6 27B v2 (this run).

Each step has been a real jump. The first Qwopus finetune was a substantial lift in one-shot front-end execution over Qwen 3.5. The biggest single jump in the lineage came from Alibaba: Qwen 3.6 base raised the floor on every aspect of dense inference at this size. The new base does, out of the box, what the previous generation needed a finetune to do.

The gap between base and finetune is now narrower than it used to be. But the finetune still wins meaningfully on creative execution. As foundation models get stronger, the marginal lift on objective benchmarks shrinks, while the qualitative gap on subjective work (creative writing, design, code style) widens. The finetune training signal is no longer fixing basic mistakes; it's adding aesthetics and judgment on top of an already-strong substrate. That's where Qwopus 3.6 lives.

Tuning recommendations

Caveats

Verdict

Qwopus 3.6 27B v2 (dense) is the agentic model of choice in this lineage, with one-shot design quality strong enough to share equal billing. It beats the 35B-A3B MoE finetune on SWE-bench by ~6 points on the same slice and effectively eliminates the empty-patch failure mode. The temperature-1.0 finding is the most actionable result; it's reproducible, the magnitude is large, and the underlying reasoning generalizes to other harnesses.

The creative bench is in genuinely good shape. All 5 designs ship complete, opinionated, production-quality pages on first try. The designer_portfolio output is the best one-shot pass I've seen at this size class. 4 of 6 canvas demos ship clean; the 2 that didn't are parked rather than deleted.

For high-throughput short-context work, the MoE is still the right model. For agentic, long-context, or reasoning-heavy work, this dense 27B is what you want. The lineage tells the story: the 3.6 base raised the floor enormously, and Qwopus 3.6 takes that base and adds the creative-execution edge.

Raw outputs, per-run metadata JSON, and backup samples preserved alongside each HTML / TXT file in this repo. Same harness and prompts as the Qwopus3.6-35B-A3B-v1 eval.