How AMD Qwen decode passed llama.cpp in six weeks on the Radeon AI PRO R9700
Discuss on XQuick answer: ZINC crossed llama.cpp on the scoped Qwen3.6 decode benchmark by fixing correctness first, then removing dead paths and moving repeated work into the hot Vulkan decode path. The remaining gap is prefill, not whether AMD RDNA4 can be a serious local Qwen inference target.
The first serious RDNA4 trace looked almost good, which is worse than looking broken. zinc was emitting English-shaped text at four tokens per second on the Radeon AI PRO R9700. Then the LM-head dump showed the truth: on Qwen3.5-35B-A3B-UD Q4_K_XL, 240,560 of the 248,320 vocabulary rows were still zero. The model was sampling from the three percent of logits our dispatcher happened to compute.
The fix was not glamorous. The host-side dispatch math for the matrix-vector kernel used the wrong rows-per-workgroup formula. The Q8_0 shader processes two output rows per workgroup, but the dispatcher launched it as if each workgroup covered 64 rows. On the LM head, that meant only 7,760 of 248,320 vocabulary rows were computed. The bug took two days to find and twenty seconds to fix. The longer write-up of that period is honest about how cheerful we were before we noticed.
Six weeks later, the headline is narrower and stronger than the old draft made it sound. On the same Radeon AI PRO R9700 — 32 GB GDDR6, 128 AI accelerators, gfx1201 — the latest published benchmark artifact shows zinc decoding Qwen3.6-35B-A3B UD Q4_K_XL at 117.07 tok/s against llama.cpp’s 104.47 on the same machine and model file. The same artifact also shows the unfinished part: prefill is 88.08 tok/s in zinc against 181.95 in llama.cpp. The numbers are below.
This is not a broad “zinc beats llama.cpp everywhere” claim. The honest scope is narrower and more useful: Qwen3.6-35B-A3B UD Q4_K_XL decode on the Radeon AI PRO R9700 has crossed llama.cpp in the public suite, while hybrid MoE plus SSM prefill is still behind. That is a real milestone and a real caveat at the same time.
Why this result matters
The result matters because it moves the AMD conversation out of the usual fit-versus-fast tradeoff. A 32 GB RDNA4 card can fit a 35B-A3B model, but fitting the model is not the hard bar for local inference. The hard bar is whether the decode loop can stream a long answer at competitive speed once MoE routing, gated DeltaNet state, KV cache reads, sampler work, and Vulkan dispatch overhead all show up at the same time.
These three numbers measure different waits:
| Measurement | zinc | llama.cpp | Why it matters |
|---|---|---|---|
| Published-suite decode | 117.07 tok/s | 104.47 tok/s | The hot answer loop is now ahead on the flagship model. |
| Published-suite prefill | 88.08 tok/s | 181.95 tok/s | Prompt ingestion is still the biggest Qwen gap. |
| Journey delta | 0.8 tok/s broken -> 117.07 tok/s decode | 107 tok/s March llama.cpp reference | The path went from barely coherent to competitive on the same card class. |
The middle row is as important as the first. If a post says only “decode crossed,” it can be true while still hiding the user-visible pause before a long first token. If it says “inference is faster” without the prefill caveat, it is too broad. The interesting engineering state is the combination: decode is now good enough to take seriously, and prefill has a named structural backlog.
The story is not a clean upward chart because the target kept getting harder. The model family moved from Qwen 2.5 to Qwen 3 to Qwen3.5-35B-A3B to Qwen3.6-35B-A3B while the engine was being written under it. The useful version of the history is the proof each phase gave us:
| Date | State | What it proved |
|---|---|---|
| Mar 27 | 0.8 tok/s and mostly wrong | Correctness had to come before optimization. |
| Mar 30 | 33.58 tok/s on Qwen3.5-35B-A3B | The Vulkan path could move real weight bandwidth once dispatch and memory placement were sane. |
| Apr 5 | Qwen 3 landed and the number fell | MoE plus gated DeltaNet was not a Llama-shaped porting problem. |
| Apr 26 | 90.24 tok/s prefill versus about 180 | The remaining gap was structural: SSM state and MoE batching, not one bad shader. |
| May 6 | Prefix KV reuse moved from idea to serving primitive | Cache semantics beat another round of micro-kernel tuning for repeated chat prefixes. |
| May 10 | 117.07 tok/s decode, 88.08 tok/s prefill | Decode crossed llama.cpp; prefill remained the active gap. |
The rest of the post is the longer version of each step and the dead ends we stopped carrying.
The first ten days
The first run that produced any plausible English was on March 30, five days after we started. The path from there is documented in the 33 tok/s recap. The shortest version:
Three changes mattered. We switched from VK_BUFFER_USAGE_TRANSFER_DST_BIT host-visible staging buffers to device-local memory with a single vkCmdCopyBuffer per layer, which killed a PCIe round trip per dot product. We wrote a fused dmmv kernel for the common case of a vector-by-matrix multiply where the matrix is Q4_K_M, replacing the dequant-then-matmul pipeline that was the inherited Vulkan baseline. And we collapsed the vkQueueSubmit per pipeline stage into one command buffer per layer.
None of that was new. All of it was unimplemented in the Vulkan backend we forked from. The 33.58 tok/s number was on Qwen3.5-35B-A3B-UD Q4_K_XL, batch one, decode-only. It was also the moment the project stopped feeling like an interesting bring-up and started feeling like an actual inference engine.
The dead ends were useful
The most useful work in April was not always the work that stayed in the tree. The optimization loop tried a lot of ideas that sounded like “what llama.cpp does” and then measured them as flat or negative because the call site was wrong, the batch shape was wrong, or the engine was still paying a larger state-management tax somewhere else.
| Tried | What happened | What it taught |
|---|---|---|
| Port llama.cpp-style tiled GEMM foundations | mul_mm_q4k was bit-identical in the LM head, measured 78.14 tok/s against a 78.55 baseline, and 1,470 lines of dormant infrastructure were later reverted. | A correct kernel is not a win until it is wired into the hot path. |
| Chase 32-column DMMV before full GEMM | The weight-read math was right, but shared register budgets made the decode path pay for prefill’s column count. | Variant-specific pipelines matter on RDNA4 because VGPR pressure is throughput. |
| Plan Q8_1 activation quantization | The direction stayed right, but a standalone shader port was not enough without the layer-level reuse and buffer lifecycle. | Activation quantization is a systems change, not one shader. |
| Try a vocab-matched speculative draft | Nineteen public draft/verifier configs found no net win once SSM rewind cost was included. | MoE lowers verifier cost enough that the classic dense-model speculation math stops applying. |
| Follow the FP4 wave | FP4 saves footprint, but gfx1201 has no FP4 WMMA instruction, so the compute path dequantizes back toward FP16-class math. | The ISA decides which quantization fashions are real throughput wins. |
This table is why the May result does not feel like one trick. The winning path was not “copy llama.cpp.” It was copy the idea only after understanding which part of the idea was doing the work: fewer repeated reads, fewer launches, better cache keys, and state that lives at the right boundary.
April: this is fine
The dark days started immediately after. Qwen 3 landed on April 5, and Qwen 3 was structurally different from Llama 3 in five places the 7→33 work had not touched. Qwen 3.5/3.6 35B-A3B activates 3B parameters per token through MoE routing. Three out of four attention blocks are replaced by gated DeltaNet linear attention with a recurrent state. The KV cache shape, the sampler chain, and the prefill batching path all need different code from the dense Llama 3 path.
The SSM state NaN was a gated DeltaNet decay-gate underflow in FP16, fixed by clamping the gate before the recurrence. The MoE router-logits-all-zeros bug was a top-k=2 selection happening before softmax in a fused path, fixed by reordering ops. The delta-net Q-norm drift was a missing RMSNorm on the recurrent state, compounding into a 30 percent error by token 100. The flash-attention negative-infinity-in-softmax was a cold pipeline state with an uninitialized FlashAttention scale.
The cumulative effect over April was zinc going from “runs Qwen 3 dense” to “runs Qwen 3.5-35B-A3B, but the hybrid prefill path is structurally behind.” The April 26 gate post is the honest snapshot at the bottom of that valley: zinc 90.24 tok/s prefill against llama.cpp 180 on Qwen 3.5-35B-A3B, with the SSM bucket sitting at 925 ms out of 2,110 ms of GPU phase time.
Late April: naming the prefill gap
The first structural prefill wins were not enough to declare victory, but they changed the shape of the problem. Vulkan specialization constants for the dmmv variants let kernels specialize the matrix shape at compile time instead of branching at runtime. Two days later, the vkQueueSubmit-per-prompt change attacked launch overhead. The larger lesson from the phase profile was that gated DeltaNet state had to become block-resident: keep the recurrent state in registers across the token loop instead of re-reading and re-writing 2 MB per layer per token.
By the May 1 attention-not-GEMM post, the question had changed. The chat-shaped long-prefill case was no longer a generic “we need a GEMM” problem. The remaining profile had named buckets: MoE cohort grouping, attention shape, SSM state, and cache reuse. The public May 10 number says the same thing more bluntly: Qwen3.6 prefill is the biggest gap left.
Early May: making decode credible
Decode is harder because there is less batch to hide behind. The matmul is bandwidth-bound by weight and KV reads, the sampler is on the hot path, and the attention has to be correct at every generated token. The May posts walk through each fix, one per day, and the shape of where they landed in the decode loop is below.
INT8 KV cache by default doubled the per-token KV bandwidth budget at 32k context. Attention sinks resident in KV stopped the long-context perplexity blowup that was forcing aggressive eviction. Min-p before temperature and DRY repetition penalty replaced the sampler chain that had been on by default. Prefix KV reuse remains the highest-leverage chat-serving direction because it removes repeated prefix work rather than making one more single-token dispatch marginally faster.
None of these were inventions. All of them were already known in the research and the local-inference community. The work was reading the literature, implementing the primitives correctly for gfx1201, and shipping them one per day until the decode loop was no longer leaving anything on the table.
What today’s number is and is not
The configuration for the headline numbers is the published May 10 RDNA artifact: Qwen3.6-35B-A3B UD Q4_K_XL on a Radeon AI PRO R9700 with 32 GB VRAM, ZINC CLI against llama.cpp on the same RDNA node, same model file, one discarded warmup, and three measured runs. The numbers quoted here are the top-line reference entry for that model, each reported as the median of the measured samples; the artifact also includes longer-context and extended-decode scenarios. It records ZINC commit 321309a2fe8b and llama.cpp commit 9725a313b.
zinc is not faster than llama.cpp on a 5090. zinc is not faster on every dense model in the 7B to 14B class where llama.cpp’s Vulkan path has been polished for a year. zinc is not done with short-context prefill on hybrid MoE plus SSM models; the current public number is 88.08 tok/s against llama.cpp’s 181.95. zinc is not faster on the Qwen3-Next gated DeltaNet path because the state-checkpoint plane is half-written and the Vulkan MTP-head kernel has not landed. zinc is also not the right engine if you want to run an FP4 quantized model, because the RDNA4 silicon does not accelerate that format.
The honest scope of “we beat llama.cpp” is Qwen3.6-35B-A3B UD Q4_K_XL decode on the R9700 at 32 GB in the published RDNA suite. It is a narrow configuration. It is also the one that tells us the RDNA4 path is no longer a correctness demo. The remaining question is whether we can make prefill equally boring.
What comes next
Three items are open and roughly sized. First, close Qwen3.6 prefill by wiring the batched path through hybrid MoE plus SSM instead of falling back to per-token dispatch when n_experts > 0 or ssm_d_inner > 0. Second, keep the decode lead through the longer-context cases, where llama.cpp still has mature scheduling and cache behavior. Third, finish the Qwen3-Next state-checkpoint plane so prefix reuse can handle gated DeltaNet rollback cleanly instead of only helping transformer-shaped prefixes. FP8 weights via the MI350X-shared Triton path on vLLM is a separate stack but the same hardware, covered in today’s other post.
The shape of the work is the same as the first six weeks. Read the relevant paper. Implement the primitive correctly for gfx1201. Measure honestly. Ship one fix per day. Try not to chase the next quantization fashion before squeezing the throughput the silicon already pays for.
Six weeks ago zinc was 97 percent zero logits. Today, decode on the flagship Qwen3.6 model is fast enough that the interesting question moved to prefill and long-context serving. That is the kind of progress that matters: not a clean sweep, not a victory lap, but a benchmark artifact that narrows the next week of work to the right bottleneck.