Qwopus3.6-27B v2 (dense) · Q5_K_M evaluation

by Kyle Hessling · @KyleHessling1 on X

Same prompt suite as the 35B-A3B eval (5 agentic + 1 nothink rerun, 5 designs, 6 canvas), plus a full SWE-bench Verified pass on the controlled-202 slice. 15 of 17 outputs published; the Mandelbulb shader and Three.js crystal scene shipped well-formed HTML but didn't render strongly enough to keep, so they're parked in excluded-canvas/. All published runs use thinking on. Single RTX 5090, Q5_K_M, 160K fp16 context.

TL;DR

75.25% on SWE-bench Verified (controlled-202): 152 resolved, 1 empty patch, 0 step-limit hits, 0 context overflows. Median trajectory 67 of 275 steps.
Run agentic harnesses hot. Lifting temperature from 0.1 to 1.0 collapsed empty/loop failures from 78 to 1 on the same slice between our earlier 35B-A3B run and this one. Details below.
One-shot HTML is excellent. All 5 designs ship complete, opinionated pages first try. The designer_portfolio output is the best one-shot pass I've seen at this size class.
4 of 6 canvas demos ship clean. Particle attractor, generative flowfield, soft-body sandbox, and audio-reactive visualizer all work. Mandelbulb and crystal scene rendered but not strongly; parked rather than deleted.
44 tok/s, ~31 GB VRAM at 160K fp16. Slower than the MoE 35B-A3B (162 tok/s), which is the expected dense-vs-MoE tradeoff.

Setup

Item	Value
Model	`Qwopus3.6-27B v2 (dense)` Q5_K_M GGUF, ~18 GB on disk
Architecture	Dense 27B, 64 layers, GQA (4 KV heads × 256 head dim), native 262 K ctx
Lineage	Finetune of `Qwen/Qwen3.6-27B` base
Runtime	llama.cpp CUDA-12.8, `--flash-attn on`, embedded jinja template
Context	163,840 tokens, fp16 K+V cache, single slot
Hardware	RTX 5090 (32 GB), all 64 layers on GPU, ~31 GB VRAM resident
Sampling, design HTML	temp 0.75 / top-p 0.95 / thinking on
Sampling, canvas (5 of 6)	temp 1.0 / top-p 0.95 / thinking on
Sampling, physics_sandbox	temp 0.75 (first run shipped clean, kept)
Sampling, agentic	temp 0.3 / top-p 0.9 / thinking on
SWE-bench harness	mini-swe-agent 2.2.8 · swebench 4.1.0 · temp 1.0 · step_limit 275 · workers 1

SWE-bench Verified

Metric	Value
Slice	controlled-202
Resolved	152 / 202 · 75.25%
Empty patches	1
Errors (patch failed eval)	1
Step-limit (275) hits	0
Context-overflow failures	0
Median trajectory	67 of 275 steps
Max trajectory	148 (django-14351)
Wall clock	19h 29m, parallel=1

Every instance exited Submitted. Not a single one hit the step limit or got within 50K tokens of the context cap. The 1 empty patch (matplotlib-24026) was the model calling submit with no diff after 64 steps. The 1 error (django-13028) submitted a 711-byte patch that didn't grade as correct.

Why low temperature hurts a thinking finetune

The intuition that runs counter to the result: finetunes are trained to converge, so greedy decoding should produce the cleanest output. That holds for short, well-anchored prompts. It is the wrong intuition for multi-step agentic loops.

Inside a harness, the model reads tool output, thinks (which on a thinking-on finetune is a long free-form chain), decides on an action, emits a tool call. At temp=0.1, every step follows the highest-probability path. Combine that with a finetune trained to think thoroughly, and the model lands on the "I should think more about this" branch at every single step. That produces 240-message trajectories where the model proposes, second-guesses, proposes again, and never submits.

At temp=1.0 the model still thinks, but it varies the path through reasoning and is more willing to commit. This run's trajectory data backs that up: median 67 steps, max 148, vs the earlier MoE finetune at temp=0.1 routinely hitting 240+ before timing out. Concrete recommendation: for any multi-step harness (mini-swe-agent, Cline, Claude-style tool use), start at temp=1.0 and walk back only if you see degradation.

Throughput

Metric	Qwopus 3.6 35B-A3B (MoE, Q5)	Qwopus 3.6 27B v2 (dense, Q5)
avg tok/s	161.9	43.9
tok/s range	154.4 / 164.8	43.1 / 44.6
VRAM resident	~25 GB (65K q8)	~31 GB (160K fp16)
Total tokens (this suite)	106,688	119,036
Total runtime	11.1 min	45.3 min

The MoE wins on throughput by ~3.7×, which is what the A3B routing pattern is for: only ~3 B of weights move through cache per token on the MoE vs the full ~16 GB Q5_K_M pass on the dense 27B. The dense pays for that with per-token quality, which shows up most clearly on agentic work. Pick the dense for agentic depth, long context, and reasoning; pick the MoE for high-throughput short-context generation.

Throughput variance is the tightest I've measured on this hardware: 43.1 to 44.6 tok/s across all 17 runs. A 1.5 tok/s spread means the model is fully memory-bandwidth-bound, which is the expected steady state.

Web design

All 5 designs validated end-to-end: DOCTYPE present, </html> reached, no truncation, balanced scripts. Average shipped size 40.8 KB per page.

Prompt	HTML KB	Tokens	Time	Reasoning
saas_landing	60.3	23,801	552 s	836
analytics_dashboard	42.1	15,390	354 s	1,898
designer_portfolio	32.5	11,612	265 s	1,459
pricing_page	26.6	9,360	213 s	1,077
mobile_app_marketing	42.3	16,590	382 s	1,650

designer_portfolio: the standout

The designer_portfolio prompt is wide-open: "design a designer portfolio site," no spec. Most models in this class respond with a generic skeleton and placeholder copy. The dense Qwopus picks an angle, commits to it, and executes: kinetic-typography hero, defensible visual rhythm through the case-study section, voice in the copy. It looks like a portfolio draft a human designer would actually ship. This is the clearest illustration of where the finetune still has real lift over the (much-improved) base. The base produces something competent; the finetune produces something with a point of view.

Notes on the other four

saas_landing · 60.3 KB. Real micro-interactions, real chart skeletons, real navigation states. Reads like a real product page rather than a screenshot of one.
analytics_dashboard · 42.1 KB. Light theme, emerald accent, hardcoded data with hover states. Legend, filter, and time-range chrome are wired up.
pricing_page · 26.6 KB. Three tiers with an animated monthly/annual toggle and a real FAQ accordion. Smaller than the saas_landing because the spec is more constrained.
mobile_app_marketing · 42.3 KB. Includes a CSS-only device mock. An earlier rerun of this prompt hit the 24K-token cap mid-SVG; bumping max_tokens to 48K produced this clean 16.6K-token result.

Canvas / WebGL

4 of 6 prompts shipped clean. The two that didn't (Mandelbulb shader, Three.js crystal scene) rendered well-formed HTML but the actual visual output wasn't strong enough to publish. Parked in excluded-canvas/ rather than deleted because the failure modes are recognizable:

Mandelbulb (webgl_shader): raymarching loop runs and the shader compiles, but the visible fractal is a flat shape. The structure of a raymarcher is right; the iterative distance-estimator math is underplayed.
Three.js crystal scene: scene loads, camera orbits, materials respond to light, but the composition lacks the dramatic transmission/refraction that makes the prompt interesting. A second turn fixing the material setup would likely land it.

Prompt	HTML KB	Tokens	Reasoning	Status
particle_attractor	9.4	4,308	1,513	shipped
generative_flowfield	13.9	7,237	6,269	shipped
physics_sandbox	18.0	6,827	1,665	shipped
audio_reactive	10.7	5,731	7,645	shipped
webgl_shader (Mandelbulb)	11.5	4,928	1,831	parked
three_scene (crystal)	12.5	4,980	1,811	parked

Highlights from the 4 published

physics_sandbox: soft-body verlet integration with mouse interaction. Cloth tearing and pinned-corner constraints both work. This is the one we kept at temp=0.75 because the first run shipped clean; everything else in canvas ran at temp=1.0.
audio_reactive: FFT bars + bloom on mic input. The model spent 7,645 chars of reasoning on this one, the most of any prompt in the suite. Bar-band layout, mic-permission flow, and bloom shader are all wired up. Needs HTTPS or local-file load for mic permission to grant.
generative_flowfield: simplex-noise vector field with ink-line agents. Aesthetic landing is good.
particle_attractor: 3000-particle fluid swarm. Smallest output (9.4 KB) but visually clean.

Agentic reasoning

5 prompts plus a structured-extraction nothink rerun (17th run). Total agentic time: 3.2 min.

Prompt	Completion tokens	Reasoning chars	Time
multi_step_planning	2,238	7,067	50 s
tool_use_json	1,262	2,807	28 s
code_debug	1,753	5,225	39 s
structured_extraction (thinking)	1,721	4,245	39 s
self_critique	1,255	3,309	28 s
structured_extraction_nothink	351	0	8 s

code_debug: caught all 4 bugs (sort order, = vs ==, useless loop, off-by-one).
self_critique: followed INITIAL → CRITIQUE → IMPROVED structure exactly. Stepped a brute-force palindrome to O(n²) expand-around-center.
multi_step_planning: 10-step deploy plan for the FastAPI URL shortener with explicit pip dependencies and Dockerfile hand-off.
tool_use_json: correct 3-tool sequence (search_flights, book_hotel, get_weather) with valid arg shapes.
structured_extraction: thinking-on produces valid JSON with all three people resolved and project ownership correctly mapped. The nothink rerun is also clean, just terser.

Lineage

Qwen 3.5 27B → Qwopus 3.5 27B → Qwen 3.6 27B (base) → Qwopus 3.6 27B v2 (this run).

Each step has been a real jump. The first Qwopus finetune was a substantial lift in one-shot front-end execution over Qwen 3.5. The biggest single jump in the lineage came from Alibaba: Qwen 3.6 base raised the floor on every aspect of dense inference at this size. The new base does, out of the box, what the previous generation needed a finetune to do.

The gap between base and finetune is now narrower than it used to be. But the finetune still wins meaningfully on creative execution. As foundation models get stronger, the marginal lift on objective benchmarks shrinks, while the qualitative gap on subjective work (creative writing, design, code style) widens. The finetune training signal is no longer fixing basic mistakes; it's adding aesthetics and judgment on top of an already-strong substrate. That's where Qwopus 3.6 lives.

Tuning recommendations

Agentic harnesses (Cline, mini-swe-agent, Claude-style tool use): temp=1.0, thinking on, top_p 0.9 to 0.95. The strongest recommendation in this report.
One-shot creative HTML (designs, dashboards): temp=0.75, thinking on, top_p 0.95.
Creative-canvas / generative code: temp=1.0, thinking on. Exception is physics_sandbox where the first run at 0.75 shipped clean and we kept it.
Structured output (JSON extraction, tool args, fixed schemas): temp=0.3, top_p 0.9. Thinking on for hard / ambiguous extractions, off for well-anchored ones.
If you don't know: start at temp=0.7, thinking on, and move the slider for your workload.

Caveats

2 of 6 canvas prompts didn't ship. Expect a second turn for complex shader / material work.
Dense throughput is ~3.7× slower than the MoE 35B-A3B. For high-volume short-context generation, take the MoE.
The temperature finding may not transfer cleanly to non-thinking finetunes. The general principle (greedy + thinking + multi-step loop = trouble) should hold, but the exact sweet spot is empirical.
75.25% on a curated 202-instance slice is directional, not authoritative. Treat as signal, not the full Verified set.
Designs have strong aesthetic priors. If you want output that doesn't look like these, be specific in the prompt about going against the grain.

Verdict

Qwopus 3.6 27B v2 (dense) is the agentic model of choice in this lineage, with one-shot design quality strong enough to share equal billing. It beats the 35B-A3B MoE finetune on SWE-bench by ~6 points on the same slice and effectively eliminates the empty-patch failure mode. The temperature-1.0 finding is the most actionable result; it's reproducible, the magnitude is large, and the underlying reasoning generalizes to other harnesses.

The creative bench is in genuinely good shape. All 5 designs ship complete, opinionated, production-quality pages on first try. The designer_portfolio output is the best one-shot pass I've seen at this size class. 4 of 6 canvas demos ship clean; the 2 that didn't are parked rather than deleted.

For high-throughput short-context work, the MoE is still the right model. For agentic, long-context, or reasoning-heavy work, this dense 27B is what you want. The lineage tells the story: the 3.6 base raised the floor enormously, and Qwopus 3.6 takes that base and adds the creative-execution edge.

Raw outputs, per-run metadata JSON, and backup samples preserved alongside each HTML / TXT file in this repo. Same harness and prompts as the Qwopus3.6-35B-A3B-v1 eval.