Dense Fine-tune · 27B

Qwopus3.6-27B v2 (dense)

Same suite as the 35B-A3B eval (5 agentic + 1 nothink rerun, 5 web-design, 4 canvas). 2 canvas outputs excluded for visual quality and parked in excluded-canvas/. Thinking is on for every run. Q5_K_M on a single RTX 5090 via llama.cpp.

i
75.25% on SWE-bench Verified (controlled-202 slice) Run separately under temp=1.0 / step_limit=275 / single-slot on the same controlled-202 instances we benchmark every Qwopus build against. 152 / 202 resolved, just 1 empty patch across all 202. The temp-1 + thinking-on combination effectively eliminates the reasoning-loop failure mode earlier finetunes ran into.
43.9avg tok/s
15 / 17published · 2 canvas parked
75.25%SWE-bench Verified (202)
119,036completion tokens
~31 GBVRAM · 160K fp16 KV

SWE-bench Verified · controlled-202 slice

RunSamplingResolvedEmptyResolve %
Qwopus 3.6 27B v2 (dense)temp 1.0, step 275, single-slot152 / 202175.25%

19h29m wall-clock on a single RTX 5090, 160K fp16 context. Every instance exited Submitted, 0 step-limit hits, 0 context overflows. Median trajectory length 67 / 275.

Run agentic harnesses hot. Counter-intuitive but consistent across our runs: for agentic harnesses with thinking-on, temp=1.0 outperforms temp=0.1 by a wide margin. Greedy decoding hands the finetune its strongest single-path reasoning chain back to itself every step, which is the recipe for over-deliberating, looping inside <think>, and the empty-patch failure mode. Raising temperature lets the finetune use the breadth of reasoning paths the training installed instead of refining one. The 78 to 1 collapse in empty patches between our 35B-A3B temp-0.1 and this 27B dense temp-1 run is the cleanest case we have. For one-shot creative HTML, drop back to 0.6 to 0.8 and try the slider for your workload.

Web design · open to preview

SaaS landing pageAI observability product page
60.3 KB · 23,801 tok · 552 s · thinking-on
Analytics dashboardLight-theme dashboard layout
42.1 KB · 15,390 tok · 354 s · thinking-on
Designer portfolioKinetic-typography portfolio
32.5 KB · 11,612 tok · 265 s · thinking-on
Pricing page3 tiers + animated toggle + FAQ
26.6 KB · 9,360 tok · 213 s · thinking-on
Mobile app marketingApp landing with device mock
42.3 KB · 16,590 tok · 382 s · thinking-on

Canvas / WebGL · creative coding

Three of four run at temp=1.0 (thinking on); physics_sandbox stays at temp=0.75 since the first run shipped clean. The Mandelbulb shader and Three.js crystal scene rendered but weren't strong enough to publish; both outputs are parked in excluded-canvas/ for inspection.

Particle attractor3000-particle fluid swarm
9.4 KB · 4,308 tok · 97 s · temp 1.0 · 1,513 chars reasoning
Generative flowfieldInk-line agents on noise
13.9 KB · 7,237 tok · 163 s · temp 1.0 · 6,269 chars reasoning
Soft-body physics sandboxVerlet integration playground
18.0 KB · 6,827 tok · 154 s · temp 0.75 · 1,665 chars reasoning
Audio-reactive visualizerFFT bars + bloom on mic input
10.7 KB · 5,731 tok · 129 s · temp 1.0 · 7,645 chars reasoning
Lineage: Qwen 3.5 27B → Qwopus 3.5 27B → Qwen 3.6 base → Qwopus 3.6 27B Each step has been a real jump. The Qwen 3.5 → Qwopus 3.5 finetune was a big lift in front-end execution; the next big jump came from Alibaba raising the floor with the Qwen 3.6 base. The gap between base and finetune is narrowing, but Qwopus is still meaningfully better at executing creative briefs. The designer portfolio in this run is the best one-shot pass I've seen anywhere in this size class; the base produces something competent on the same prompt, the finetune turns it into something with a point of view.

Agentic reasoning · text output

Multi-step planningURL shortener deploy plan
thinking: 2,238 tok · 50 s · 7,067 chars reasoning
Tool-use planningWeather + flights + hotel
thinking: 1,262 tok · 28 s · 2,807 chars reasoning
Code debug4-bug k-th smallest element
thinking: 1,753 tok · 39 s · 5,225 chars reasoning
Structured JSON extractionCalendar + roster from prose
thinking: 1,721 tok · 39 s · clean pass
Self-critique loopPalindrome · iterate to O(n²)
thinking: 1,255 tok · 28 s · 3,309 chars reasoning
JSON extraction · no-thinkSame prompt, thinking off
351 tok · 8 s