by Kyle Hessling · v2 update to the dense 27B preview
Same suite as the 35B-A3B eval (5 agentic + 1 nothink rerun, 5 web-design, 4 canvas). 2 canvas outputs excluded for visual quality and parked in excluded-canvas/. Thinking is on for every run. Q5_K_M on a single RTX 5090 via llama.cpp.
i
75.25% on SWE-bench Verified (controlled-202 slice)
Run separately under temp=1.0 / step_limit=275 / single-slot on the same controlled-202 instances we benchmark every Qwopus build against. 152 / 202 resolved, just 1 empty patch across all 202. The temp-1 + thinking-on combination effectively eliminates the reasoning-loop failure mode earlier finetunes ran into.
19h29m wall-clock on a single RTX 5090, 160K fp16 context. Every instance exited Submitted, 0 step-limit hits, 0 context overflows. Median trajectory length 67 / 275.
⚡
Run agentic harnesses hot.
Counter-intuitive but consistent across our runs: for agentic harnesses with thinking-on, temp=1.0 outperforms temp=0.1 by a wide margin. Greedy decoding hands the finetune its strongest single-path reasoning chain back to itself every step, which is the recipe for over-deliberating, looping inside <think>, and the empty-patch failure mode. Raising temperature lets the finetune use the breadth of reasoning paths the training installed instead of refining one. The 78 to 1 collapse in empty patches between our 35B-A3B temp-0.1 and this 27B dense temp-1 run is the cleanest case we have. For one-shot creative HTML, drop back to 0.6 to 0.8 and try the slider for your workload.
Three of four run at temp=1.0 (thinking on); physics_sandbox stays at temp=0.75 since the first run shipped clean. The Mandelbulb shader and Three.js crystal scene rendered but weren't strong enough to publish; both outputs are parked in excluded-canvas/ for inspection.
Lineage: Qwen 3.5 27B → Qwopus 3.5 27B → Qwen 3.6 base → Qwopus 3.6 27B
Each step has been a real jump. The Qwen 3.5 → Qwopus 3.5 finetune was a big lift in front-end execution; the next big jump came from Alibaba raising the floor with the Qwen 3.6 base. The gap between base and finetune is narrowing, but Qwopus is still meaningfully better at executing creative briefs. The designer portfolio in this run is the best one-shot pass I've seen anywhere in this size class; the base produces something competent on the same prompt, the finetune turns it into something with a point of view.