Same prompt suite as the 35B-A3B eval (5 agentic + 1 nothink rerun, 5 designs, 6 canvas), plus a full SWE-bench Verified pass on the controlled-202 slice. 15 of 17 outputs published; the Mandelbulb shader and Three.js crystal scene shipped well-formed HTML but didn't render strongly enough to keep, so they're parked in excluded-canvas/. All published runs use thinking on. Single RTX 5090, Q5_K_M, 160K fp16 context.
| Item | Value |
|---|---|
| Model | Qwopus3.6-27B v2 (dense) Q5_K_M GGUF, ~18 GB on disk |
| Architecture | Dense 27B, 64 layers, GQA (4 KV heads × 256 head dim), native 262 K ctx |
| Lineage | Finetune of Qwen/Qwen3.6-27B base |
| Runtime | llama.cpp CUDA-12.8, --flash-attn on, embedded jinja template |
| Context | 163,840 tokens, fp16 K+V cache, single slot |
| Hardware | RTX 5090 (32 GB), all 64 layers on GPU, ~31 GB VRAM resident |
| Sampling, design HTML | temp 0.75 / top-p 0.95 / thinking on |
| Sampling, canvas (5 of 6) | temp 1.0 / top-p 0.95 / thinking on |
| Sampling, physics_sandbox | temp 0.75 (first run shipped clean, kept) |
| Sampling, agentic | temp 0.3 / top-p 0.9 / thinking on |
| SWE-bench harness | mini-swe-agent 2.2.8 · swebench 4.1.0 · temp 1.0 · step_limit 275 · workers 1 |
| Metric | Value |
|---|---|
| Slice | controlled-202 |
| Resolved | 152 / 202 · 75.25% |
| Empty patches | 1 |
| Errors (patch failed eval) | 1 |
| Step-limit (275) hits | 0 |
| Context-overflow failures | 0 |
| Median trajectory | 67 of 275 steps |
| Max trajectory | 148 (django-14351) |
| Wall clock | 19h 29m, parallel=1 |
Every instance exited Submitted. Not a single one hit the step limit or got within 50K tokens of the context cap. The 1 empty patch (matplotlib-24026) was the model calling submit with no diff after 64 steps. The 1 error (django-13028) submitted a 711-byte patch that didn't grade as correct.
The intuition that runs counter to the result: finetunes are trained to converge, so greedy decoding should produce the cleanest output. That holds for short, well-anchored prompts. It is the wrong intuition for multi-step agentic loops.
Inside a harness, the model reads tool output, thinks (which on a thinking-on finetune is a long free-form chain), decides on an action, emits a tool call. At temp=0.1, every step follows the highest-probability path. Combine that with a finetune trained to think thoroughly, and the model lands on the "I should think more about this" branch at every single step. That produces 240-message trajectories where the model proposes, second-guesses, proposes again, and never submits.
At temp=1.0 the model still thinks, but it varies the path through reasoning and is more willing to commit. This run's trajectory data backs that up: median 67 steps, max 148, vs the earlier MoE finetune at temp=0.1 routinely hitting 240+ before timing out. Concrete recommendation: for any multi-step harness (mini-swe-agent, Cline, Claude-style tool use), start at temp=1.0 and walk back only if you see degradation.
| Metric | Qwopus 3.6 35B-A3B (MoE, Q5) | Qwopus 3.6 27B v2 (dense, Q5) |
|---|---|---|
| avg tok/s | 161.9 | 43.9 |
| tok/s range | 154.4 / 164.8 | 43.1 / 44.6 |
| VRAM resident | ~25 GB (65K q8) | ~31 GB (160K fp16) |
| Total tokens (this suite) | 106,688 | 119,036 |
| Total runtime | 11.1 min | 45.3 min |
The MoE wins on throughput by ~3.7×, which is what the A3B routing pattern is for: only ~3 B of weights move through cache per token on the MoE vs the full ~16 GB Q5_K_M pass on the dense 27B. The dense pays for that with per-token quality, which shows up most clearly on agentic work. Pick the dense for agentic depth, long context, and reasoning; pick the MoE for high-throughput short-context generation.
Throughput variance is the tightest I've measured on this hardware: 43.1 to 44.6 tok/s across all 17 runs. A 1.5 tok/s spread means the model is fully memory-bandwidth-bound, which is the expected steady state.
All 5 designs validated end-to-end: DOCTYPE present, </html> reached, no truncation, balanced scripts. Average shipped size 40.8 KB per page.
| Prompt | HTML KB | Tokens | Time | Reasoning |
|---|---|---|---|---|
| saas_landing | 60.3 | 23,801 | 552 s | 836 |
| analytics_dashboard | 42.1 | 15,390 | 354 s | 1,898 |
| designer_portfolio | 32.5 | 11,612 | 265 s | 1,459 |
| pricing_page | 26.6 | 9,360 | 213 s | 1,077 |
| mobile_app_marketing | 42.3 | 16,590 | 382 s | 1,650 |
The designer_portfolio prompt is wide-open: "design a designer portfolio site," no spec. Most models in this class respond with a generic skeleton and placeholder copy. The dense Qwopus picks an angle, commits to it, and executes: kinetic-typography hero, defensible visual rhythm through the case-study section, voice in the copy. It looks like a portfolio draft a human designer would actually ship. This is the clearest illustration of where the finetune still has real lift over the (much-improved) base. The base produces something competent; the finetune produces something with a point of view.
max_tokens to 48K produced this clean 16.6K-token result.4 of 6 prompts shipped clean. The two that didn't (Mandelbulb shader, Three.js crystal scene) rendered well-formed HTML but the actual visual output wasn't strong enough to publish. Parked in excluded-canvas/ rather than deleted because the failure modes are recognizable:
| Prompt | HTML KB | Tokens | Reasoning | Status |
|---|---|---|---|---|
| particle_attractor | 9.4 | 4,308 | 1,513 | shipped |
| generative_flowfield | 13.9 | 7,237 | 6,269 | shipped |
| physics_sandbox | 18.0 | 6,827 | 1,665 | shipped |
| audio_reactive | 10.7 | 5,731 | 7,645 | shipped |
| webgl_shader (Mandelbulb) | 11.5 | 4,928 | 1,831 | parked |
| three_scene (crystal) | 12.5 | 4,980 | 1,811 | parked |
5 prompts plus a structured-extraction nothink rerun (17th run). Total agentic time: 3.2 min.
| Prompt | Completion tokens | Reasoning chars | Time |
|---|---|---|---|
| multi_step_planning | 2,238 | 7,067 | 50 s |
| tool_use_json | 1,262 | 2,807 | 28 s |
| code_debug | 1,753 | 5,225 | 39 s |
| structured_extraction (thinking) | 1,721 | 4,245 | 39 s |
| self_critique | 1,255 | 3,309 | 28 s |
| structured_extraction_nothink | 351 | 0 | 8 s |
= vs ==, useless loop, off-by-one).search_flights, book_hotel, get_weather) with valid arg shapes.Qwen 3.5 27B → Qwopus 3.5 27B → Qwen 3.6 27B (base) → Qwopus 3.6 27B v2 (this run).
Each step has been a real jump. The first Qwopus finetune was a substantial lift in one-shot front-end execution over Qwen 3.5. The biggest single jump in the lineage came from Alibaba: Qwen 3.6 base raised the floor on every aspect of dense inference at this size. The new base does, out of the box, what the previous generation needed a finetune to do.
The gap between base and finetune is now narrower than it used to be. But the finetune still wins meaningfully on creative execution. As foundation models get stronger, the marginal lift on objective benchmarks shrinks, while the qualitative gap on subjective work (creative writing, design, code style) widens. The finetune training signal is no longer fixing basic mistakes; it's adding aesthetics and judgment on top of an already-strong substrate. That's where Qwopus 3.6 lives.
Qwopus 3.6 27B v2 (dense) is the agentic model of choice in this lineage, with one-shot design quality strong enough to share equal billing. It beats the 35B-A3B MoE finetune on SWE-bench by ~6 points on the same slice and effectively eliminates the empty-patch failure mode. The temperature-1.0 finding is the most actionable result; it's reproducible, the magnitude is large, and the underlying reasoning generalizes to other harnesses.
The creative bench is in genuinely good shape. All 5 designs ship complete, opinionated, production-quality pages on first try. The designer_portfolio output is the best one-shot pass I've seen at this size class. 4 of 6 canvas demos ship clean; the 2 that didn't are parked rather than deleted.
For high-throughput short-context work, the MoE is still the right model. For agentic, long-context, or reasoning-heavy work, this dense 27B is what you want. The lineage tells the story: the 3.6 base raised the floor enormously, and Qwopus 3.6 takes that base and adds the creative-execution edge.
Raw outputs, per-run metadata JSON, and backup samples preserved alongside each HTML / TXT file in this repo. Same harness and prompts as the Qwopus3.6-35B-A3B-v1 eval.