A mixture-of-experts model was decoding at 9 tok/s on an RTX 4090 — slower than a dense 27B. This is how ZINC's CUDA backend diagnosed boost-clock starvation, batched its MoE experts into two GPU-side kernels (turning the catalog's slowest models into front-runners that outrun the big dense ones), fused gemma's attention dispatches, and kept every change token-for-token correct against llama.cpp.
ZINC is a Vulkan and Metal inference engine. On the WSL2 NVIDIA box the only Vulkan device is llvmpipe on the CPU, so ZINC could not touch the RTX 4090 or 5090 at all. This is how we wrote a CUDA backend by mirroring the Metal shim, found a per-head attention-gate bug by diffing residuals against llama.cpp, discovered that per-token sync gaps were starving the GPU boost clock, and shipped an async stream/event ring that pushed Qwen3.5-9B decode past llama.cpp.
ZINC is getting a fourth GPU backend: native CUDA for NVIDIA. The surprise that forced it — on Windows + WSL2, NVIDIA exposes only CUDA, not Vulkan, so the one Vulkan device ZINC can see is a CPU. The reassurance that makes it tractable — ZINC's matmuls are int8 dot products, not tensor-core GEMMs, so they map 1:1 onto __dp4a. This is the plan, what we've already validated on an RTX 5090, and the road to a first token.
Yesterday's RDNA4 post showed ZINC moving dense Qwen3-8B prefill from 42.9 to 207.9 tok/s. The tempting question is why Qwen3.6 35B-A3B cannot simply reuse that path. The answer is the hybrid wall: dense batching solved repeated weight reads for a transformer, while Qwen 35B needs batched MoE routing and block-resident SSM state before its prefill gate can come down.
The old RDNA4 batched-prefill docs note was really a blog post hiding in the docs tree. This is the cleaned-up version: how ZINC discovered that ZINC_BATCHED_PREFILL was a no-op, used validate mode to prove the forward pass was correct, fixed a stale GPU argmax sampler bug, replaced serial-over-K DMMV with K-parallel Q4_K/Q6_K batched shaders, and moved Qwen3-8B prompt ingestion on the R9700 from 42.9 to 207.9 tok/s.
ZINC_RT is still not the fast backend. The important change this week is smaller and more useful: direct AMDGPU CS kernels are starting to compute real model rows that the live decode path actually consumes. This post explains the difference between a diagnostic GPU probe and a consumed model slice, why the first row-parallel DMMV attempts failed, why the one-wave Q8_0 path works, and what this says about the road from host-assisted decode to a real direct runtime.
Qwen is where ZINC has the cleanest headline wins. Gemma is where the engine gets audited. The current dashboard shows ZINC close to llama.cpp on Gemma decode and even ahead on Metal dense prefill, but prefill and MoE coverage still expose the weak spots. This post explains why the two Gemma rows matter, what they reveal about sliding-window attention, asymmetric GQA, GEGLU, and command scheduling, and what has to improve before Gemma becomes a release-strength target.
The current ZINC dashboard tells a more interesting story than a single tok/s headline. ZINC is now ahead of llama.cpp on Qwen 3.6 35B A3B decode on both the Radeon AI PRO R9700 and Apple M4 Max. It is not ahead everywhere: prefill remains the broad gap, end-to-end harness latency is still harsh, and Intel Arc is experimental. This post explains how to read the numbers without flattening prefill, decode, and full request latency into one misleading score.
A 100 tok/s decode loop on the Radeon AI PRO R9700 leaves about ten milliseconds per token, and llama.cpp's Vulkan backend currently spends a measurable slice of that on a host-side vkWaitForFences between every submission. The fence is doing what a fence has to do: hold the CPU on the device until the GPU is idle, so the next submission can be safely recorded and queued. Timeline semaphores, core in Vulkan 1.2 since 2020, replace that with a single 64-bit counter the GPU bumps as each submission completes and the host queries when it wants the next result. zinc uses the counter to record three or four decode submissions ahead, hand them to the queue in one burst, and let the GPU pull them back-to-back while the CPU prepares the sampler input for whichever token finishes next. The host roundtrip disappears, and the decode loop turns into a pipeline rather than a ping-pong.
Quantize Qwen3-30B to IQ3_M without an imatrix and the model ships at roughly 3.63 bits per weight, which on a 32 GB Radeon AI PRO R9700 looks like a free win against Q4_K_M. Skip the calibration pass that produced the codebook lookup, though, and the same file gives a coherent assistant at the start of a chat and a confused one by turn three. IQ3_S and IQ3_M are not 3-bit quants in the way Q3_K is. They are 512-entry codebooks the quantizer searches against a weighted RMSE loss, and the only thing supplying the weights is the diagonal of a per-layer activation Hessian computed over a calibration corpus. Without it, the codebook is doing approximately a uniform fit, and the quality curve below 4 bpw collapses back onto k-quants. This is the offline pass nobody talks about and the reason every IQ-quant on Hugging Face has an `imatrix` step in its lineage.
Every decode step on Qwen3 hands the CPU a vector of 151,936 logits, and the default llama.cpp sampler chain walks that vector five times before it picks a token. At 100 tok/s on a Radeon AI PRO R9700, the GPU forward pass is around 10 ms and the naive sampler chain is no longer a rounding error inside it. The reason the chain still fits is a single optimization: ikawrakow's bucket sort from PR #5109, reused by PR #15665 across top-k, top-p, and min-p, so the chain pays one histogram and then walks only the buckets that contain the survivors. FlashSampling, the March 2026 paper that fuses sampling into the LM-head matmul epilogue, is the next move and the reason the CPU chain has a ceiling. On a single-user local engine the choice is not whether to sample on the GPU, it is which samplers can move there and which ones the CPU still has to carry.
A local model can spell an emoji or a full Chinese sentence correctly and still have it land in the browser as a row of black diamonds. The cause is not the model and not the network. Byte-level BPE, the tokenizer scheme behind Qwen3, GPT-2, and most modern LLMs, emits tokens that are spans of raw bytes rather than whole characters, so a single four-byte character like the slightly smiling face 🙂 (F0 9F 99 82) is routinely split across two or more tokens. Ship each token's bytes to the client the instant it decodes and the receiver's UTF-8 decoder sees an unfinished sequence, which it turns into the U+FFFD replacement character. zinc fixes this the way every serious serving stack eventually does: it detokenizes against the codepoint boundary instead of the token boundary, holding the incomplete trailing bytes, up to the three continuation bytes a four-byte sequence can leave dangling, until the next token closes the character. The boundary, not the token, is the unit you are allowed to put on the wire.
PagedAttention is one of the best systems ideas in LLM serving, and for a single-user local engine it is mostly the wrong default. Paging earns its keep by packing many concurrent, variable-length sequences into GPU memory with near-zero waste; a desktop assistant runs one conversation at a time and has nothing to pack, so it inherits the cost of a block-table lookup inside the attention kernel without the benefit. zinc's default reflects this: it borrows vLLM's memory-budget math and then allocates one contiguous arena, not a pool of blocks. The exception is the case that pays for paging anyway, and it is the same feature we called the cheapest speedup left on local chat: the moment conversations branch and share a prefix, block-granular allocation stops being overhead and becomes the only clean way to let siblings share a parent's KV without copying it.
The textbook says a draft that is free and always accepted should make decode several times faster. On Qwen3.6-35B-A3B it does the opposite. A public RTX 3090 benchmark ran n-gram and draft-model speculation at 100 percent acceptance on code and reasoning prompts and watched single-request decode fall from 135.7 tok/s to as low as 59. The reason is not the draft and not the acceptance rate. It is that a sparse mixture-of-experts verifies K drafted tokens by loading the union of the experts they route to, and below an expert-saturation threshold of about 94 tokens every extra drafted token wakes a fresh expert. That single fact decides whether speculation helps or hurts an A3B model, and it is why the right default on a single-user local engine is usually no speculation at all.
Qwen3 is trained to 32,768 tokens, and the way you get to 131,072 is a single config field: a YaRN rope_scaling block with factor 4.0. Flip it and the model reaches 128k. The catch is that open-source frameworks implement static YaRN, so that factor of 4 rescales the position math for every request, including the 1,500-token chat that never needed it. Qwen's own documentation says the quiet part out loud: do not enable YaRN unless you are processing long contexts, because it can degrade short text. That makes context length a per-request decision rather than a launch flag, and it is the difference between an engine that serves both the quick question and the 120k-token document well and one that taxes every prompt to keep the rare long one alive.
You can quantize Qwen3's weights to 4 bits and lose a fraction of a perplexity point. Quantize its activations to 4 bits the same naive way and the model falls apart. The reason lies in the activations themselves: a handful of channels carry values tens to thousands of times larger than the rest, and a single quantization scale cannot hold both the giants and everyone else. Those few channels are only about 0.1 percent of the features, but zeroing them degrades perplexity by hundreds of percent. That asymmetry is why zinc's prefill path quantizes activations to 8-bit q8_1 and not lower, why the clever quantization tricks all push difficulty onto the weights, and why the activation floor on RDNA4 sits at 8 bits while the weight floor sits at 4.
The A3B in Qwen3-30B-A3B means 3.3 billion active parameters, and the story that travels with it is that you get a 30B model at 3B inference cost. Half of that is true. At 4-bit the 3.3B active weights are about 1.9 GB read per token, but the full 30.5B that has to stay resident is about 17 GB, and the card has to hold all 17. Worse, the 1.9 GB is a different 1.9 GB every token, because the router picks a fresh 8 of 128 experts per layer each step. The active-parameter count tells you how fast the model decodes and nothing about how big a card you need, and on a mixture of experts those are very different questions.
A dense 70B at 4-bit needs about 40 GB, so it does not fit on one 32 GB Radeon AI PRO R9700 and has to be split across two cards. Tensor parallelism is the only split that makes a single local chat decode faster, because each card then reads half the weights, but it fires two all-reduces per layer, 160 cross-card synchronizations per token on an 80-layer model. On an Instinct card those run over a 128 GB/s Infinity Fabric link built for exactly this. On two R9700s they run over PCIe, the one link AMD did not give the consumer card, and that is what decides whether a tensor split or a layer split is the right default.
At batch size one, FlashAttention uses less than one percent of an A100, and the same starvation happens on a 64-compute-unit Radeon AI PRO R9700. A single-user local engine lives in that regime all day. Flash-Decoding fixes it by splitting the KV cache into chunks so the attention kernel fills every compute unit, and on the published A100 micro-benchmark it cuts batch=1 attention at 128k context from 4592 microseconds to 107. This is the long-context decode win RDNA4 leaves on the table, and the change is one extra parallelization dimension plus a small reduction kernel.
The Red Hat and AWS team published a state-of-FP8-KV-cache benchmark in April that put Qwen3's decode ITL slope at 54 percent of BF16 on an H100, with Flash Attention 3 doing the QK and PV matmuls in FP8 the whole way. RDNA4 ships the same E4M3FN WMMA instruction Triton already uses on MI350X, but no local Vulkan engine has a Qwen3 attention kernel that reads FP8 K and V. That is the next bandwidth cut on the R9700 decode loop, and the only thing standing between it and shipping is one cooperative-matrix shader.
We hit 117 tok/s decode on Qwen3.6-35B-A3B on a Radeon AI PRO R9700 through the Vulkan backend, ahead of llama.cpp at 104. That is 31 percent of the card's DRAM bandwidth. This is the long technical case for why we are leaving Vulkan in place and writing ZINC_RT alongside it, instead of switching to ROCm or HIP, and what the three stacks actually look like under the hood.
ZINCZINC_RTVulkanROCmHIPAMDRDNA4LLM inferenceGPU runtimescontinuous batchingmultitenant inferenceRadeon AI PRO R9700Qwen3
32 GB of ECC GDDR6, 608 GB/s of memory bandwidth, 256 XMX matrix engines, 367 INT8 TOPS, and PCIe 5.0 x16 in a 230 W card. The Intel Arc Pro B70 is the first card in this VRAM and bandwidth tier at workstation pricing rather than datacenter pricing. This is a long technical look at the Xe2 architecture, why the spec combination changes the math for local 35B-class LLM inference, and what ZINC's M7 T-Intel backend has to ship to use it.
ZINCZINC_RTIntelArcArc ProArc Pro B70BattlemageXe2DPASXMXLLM inferenceGPU runtimeslocal AIconsumer GPU LLM
Alibaba Cloud Model Studio caps a Qwen3 reasoning trace at the token budget you ask for, inserts the same fixed sentence the technical report names, and resumes decoding into the answer. Open-source engines do not. llama.cpp cannot reliably turn thinking off, has no budget, and exposes neither the splice nor the logit warp that would emulate one. On a 117 tok/s Radeon AI PRO R9700 that is fifteen seconds of avoidable reasoning every time the model picks a math problem to chew on.
EXL3 is the cleanest open quantization story of the last twelve months. It is a QTIP variant, it is coherent at 1.6 bits per weight on a 70B, and it runs the Marlin-style memory-bound GEMM that AWQ never quite managed. It also has zero non-CUDA backends. For zinc on a Radeon AI PRO R9700, that gap is the most expensive open problem on the local inference shelf right now.
Qwen3 ties the input and output embedding weights at 0.6B, 1.7B, and 4B. At 8B it stops. The boundary is also where the LMHead matmul stops being a rounding error and starts costing about half a millisecond per decode token on a Radeon AI PRO R9700, and where the GGUF file size jumps by a separate 510 MB tensor that local engines have to dequantize once per token. After the wave32 attention fix landed, that is the next visible decode tax.
Same Qwen3 prompt. Same seed. Temperature zero. One user, one chat tab, one local GPU. The completion changes between runs anyway. Thinking Machines named the mechanism in September 2025: kernels are run-to-run deterministic, but they are not batch-invariant, and the chunk size that decides first-token latency also decides which reduction tree fires. SGLang and llama.cpp now ship a batch-invariant path on CUDA. The Vulkan path on RDNA4 does not, and that is the next correctness fight for local inference.
On a long local Qwen3 prompt the first decode token does not appear until prefill finishes. The size of the prefill chunk is the knob that decides how long that wait is, how much activation memory the engine reserves, and whether a second chat slot gets to make progress at the same time. The default is 512 tokens because it is safe, not because it is fast.
Scalar flash attention in llama.cpp's Vulkan backend was running wave64 by default on AMD GPUs through the end of 2025, which is the wrong wavefront width for any RDNA part and especially wrong for RDNA4. The wave32 commit from PR 19625 is what finally restores long-context decode and prefill on the Radeon AI PRO R9700, with measured pp512@d16384 going from 84 to 247 tok/s on a sibling AMD card and an end-to-end 56 percent throughput gap closing on consumer RDNA hardware. The mechanism is small and architecturally clean: RDNA's native SIMD is 32 lanes wide, RDNA4's dynamic VGPRs only exist in wave32, and a scalar FA tile with row_split=1 has no work for the upper half of a wave64. The post walks through why the default was wrong, what the fix touches, and what the next two RDNA4 occupancy unlocks look like once wave32 is the floor.
Every public RDNA4 tuning thread starts at the GPU. Set the AMD power profile to COMPUTE, force the SMU to profile_peak, lock the SCLK, write a new kernel config. None of those move dense Qwen3 decode on a Radeon AI PRO R9700. The one knob that does live outside the GPU entirely. PCIe ASPM resets to the kernel's default powersave-leaning policy on every boot, and on dense decode that costs 10.8 percent of tokens per second on a 27B model under llama.cpp Vulkan. The fix is one line, the cost is no extra watts that the GPU is already not drawing, and the conversation has been pointed at the wrong subsystem.
The first serious zinc trace on AMD RDNA4 looked almost good, which was exactly the problem: it was English-shaped nonsense from a model whose LM head computed only three percent of the vocabulary rows. Six weeks later, on the same Radeon AI PRO R9700, the latest published suite shows zinc decoding Qwen3.6-35B-A3B UD Q4_K_XL at 117.07 tok/s against llama.cpp's 104.47. Prefill is not won yet: zinc is at 88.08 tok/s against llama.cpp's 181.95. The interesting part is not one magic shader. It is six weeks of correctness fixes, deleted dead ends, and Vulkan work that turned a 32 GB AMD card from a curiosity into a serious local Qwen 3.6 decode target.
FP4 weight formats landed across the GGUF ecosystem in the last two weeks: NVFP4 in mainline llama.cpp, MXFP4 in ik_llama.cpp, with Blackwell-native FP4 tensor cores flipped on in build b8967. The bandwidth math is right on every card. The compute math only works where the tensor cores speak FP4. The RDNA4 ISA on the Radeon AI PRO R9700 lists v_wmma_f32_16x16x16 in fp16, bf16, fp8_e4m3, int8, and iu4 forms but no fp4 form, so an NVFP4 or MXFP4 weight on gfx1201 dequantizes to FP16 before the matmul and lands at the 191 TFLOPS dense FP16 ceiling. The format that earns its place on this card is FP8 E4M3FN, where 383 TFLOPS dense WMMA already exists in hardware and the only thing standing in the way is a small AITER patch and a handful of tuned kernel configs that route gfx1201 through the MI350X Triton path. Skip the FP4 wave on RDNA4. Ship FP8 weights with the patch.
The April 28 argument that draft-model speculative decoding does not net out on Qwen 35B-A3B was the right read at the time. Two pieces moved in the next ten days. PR 22400 made gated DeltaNet rollback partial instead of full, eliminating the SSM rewind tax. PR 22673 wired MTP heads into llama.cpp as a built-in draft, and the measured speedup on Qwen3.6 27B is 2.5x at γ=3 with a 0.72 acceptance rate. The cost ratio that ruined the 0.8B draft drops by an order of magnitude when the draft is one transformer layer attached to the verifier's last hidden state, and on a 32 GB RDNA4 card the only thing standing between local Qwen3 and that 2.5x is a Vulkan kernel that has not landed yet.
Qwen3-Next replaces three out of every four attention blocks with a gated DeltaNet module that carries a fixed-size recurrent state instead of a per-token KV cache. The local engines that just shipped prefix caching for plain transformers cannot reuse anything across turns on this architecture, because the recurrent state at any position is a function of every prior token, not of the matching prefix. The fix is not a flag. It is a state-checkpoint plane that mirrors the radix-tree KV plane at fixed token boundaries, and on a 32 GB RDNA4 card it costs about 580 MB of VRAM per active session at 64k context in exchange for second-turn prefill that drops from 27 seconds back to 0.4.
A 4,800-token Qwen3 system prompt with tool definitions and a few-shot example pack re-prefills on every turn of a chat by default. On a Radeon AI PRO R9700 that is roughly 27 seconds of compute that produces no new tokens. By turn 20 of a session a clean prefix-aware KV cache has saved more wall time than every kernel optimization shipped in zinc since March. The implementation cost is a tokenized-prefix radix tree plus a checkpoint-aware KV slab. Of every lever still left on local Qwen3 chat, this is the one with the largest ratio of payoff to engineering work, and it holds second-turn prefill under half a second on the same card.
Classical repetition penalty punishes whichever individual tokens have already shown up. Looping is not a token-level event. It is a sequence-level event, and on a 64k-context Qwen3 chat the gap between those two definitions is the difference between a coherent reply and the same paragraph rewritten seventeen times. DRY closes that gap with an exponentially scaled penalty on whatever token would extend a verbatim n-gram from earlier in the prompt, and on a 32 GB RDNA4 decode it costs effectively nothing to run before min-p.
Top-p was the default truncation sampler for six years, and on a 151,936-token vocabulary at temperature 1.4 it still occasionally hands the sampler a noise token while three good continuations sit in the same nucleus. Min-p ties the threshold to the model's own top-token probability, which behaves better at high temperature on the kinds of distributions Qwen3 actually produces. The 2025 Stanford reanalysis weakened the original paper's quality claims, but on a local 32 GB RDNA4 decode the practical answer is unchanged: ship min-p as a first-class sampler, run temperature last, and stop pretending the nucleus is doing the work.
Grouped-query attention shrank the KV cache enough to make 128k local context fit on a 32 GB RDNA4 card. It does not shrink it enough to leave headroom. The next reduction is structural, not numerical: low-rank latent attention from the DeepSeek-V2 lineage halves the cache again on a Qwen-class model and turns the bandwidth wall on long-context decode into a different shape entirely.
Constrained JSON decoding used to add a millisecond-class tax to every step of a local Qwen3 generation, because the engine had to scan a 151,936-bit logit mask against a grammar before each sample. XGrammar and llguidance moved that work off the hot path with adaptive caches and a Rust Earley parser that runs in tens of microseconds. On a bandwidth-bound RDNA4 decode loop the change turns a noisy ten percent overhead into something a profiler will not flag, but only if the engine actually adopts the new path.
Every KV eviction scheme for local long-context inference has the same blind spot. Drop the wrong tokens and the model produces garbage; drop the right ones and it stays coherent across a million-token stream. The pivot is not most-recent or heaviest-hitting. It is the first handful of token positions, which the softmax forces every attention head to overweight regardless of content. On a 32 GB RDNA4 card the prefill ceiling and the decode bandwidth wall both end at the same set of four KV slots that cannot be touched.
Prefill on the Radeon AI PRO R9700 has two ceilings, not one. The matrix cores fix the GEMM ceiling. They do nothing for the attention ceiling. On Qwen3-30B-A3B the attention FLOP count crosses the active-weight GEMM FLOP count near 16k tokens, and past the crossover the same prefill spends more time in QK^T and PV than in every linear layer combined. Flash attention is what makes that ceiling reachable in the first place. The next lever past it is algorithmic, not numeric.
AMD specs the Radeon AI PRO R9700 at 389 TFLOPS of sparse FP16 and 1557 TOPS of sparse INT4. A local 35B-A3B decode on the same card runs at roughly 50 tokens per second and uses almost none of that compute. The arithmetic intensity of a single decoded token is around two operations per byte. The ridge point of the roofline on this card is north of six hundred. The matrix cores are not the lever for solo decode. They are the lever for prefill, fine-tuning, and any workload that touches the same weight twice.
A vocab-matched 0.8B draft model on Qwen 3.6 35B-A3B with llama.cpp's speculative decoding path failed to beat the baseline across nineteen public benchmark configurations on a single RTX 3090. The reason is not tuning. The same hybrid MoE-plus-SSM structure that put yesterday's KV crossover at 16k tokens also breaks the cost ratio that classical speculative decoding depends on. With only three billion active parameters in the verifier, a 0.8B draft is not small enough to be free, and the gated delta-net hidden state adds a rewind tax on every rejected token.
On a Radeon AI PRO R9700, decoding one token of Qwen 3.5 35B-A3B reads about 2 GB of active weights from VRAM. At 16k context the KV cache hits the same 2 GB, and past that point the KV cache is the larger tenant on every decode step. The bandwidth ceiling falls from roughly 213 tok/s at 8k to 37 tok/s at 128k, and the part of the picture that moves is not the weights.
ZINC's per-token prefill on Qwen 3.5/3.6 35B-A3B runs at 90.24 tok/s on a Radeon AI PRO R9700. llama.cpp on the same card hits 180 tok/s on the same prompt and weights. The remaining 2x gap is not a kernel-by-kernel gap. It is a single early return in canUseBatchedPrefillRdna that locks every Mixture-of-Experts plus state-space hybrid model onto a per-token decode loop, dispatching 45,000 workgroups per prompt where llama.cpp dispatches 288. Here is what 50 autonomous-loop cycles found, what is still left, and which two changes close most of the remaining ground.
On a 32 GB Radeon AI PRO R9700 the model weights are not the binding constraint at long context. The KV cache is. At 128k tokens on Qwen3.5-35B-A3B a default FP16 KV cache is 15.4 GiB, which together with the Q4_K_M weights walks straight off the 32 GB ceiling. Quantizing the KV cache is not a polish step. It is the only way the long-context local prompt fits at all.
ZINC currently records one Vulkan command buffer per transformer layer and calls vkQueueSubmit once for each. On a 60-layer Gemma 4 prefill that is 60 round trips through the amdgpu ioctl path, and the gap between submits is wide enough to land on every flame graph the engine produces. Collapsing those 60 submits into one is a small change at the recording layer with a measurable upside, and it is the right shape of fix to land before AMD's user-mode queues make the per-submit cost smaller but not zero.
Qwen3-8B prefill on ZINC's RDNA4 batched path runs at 187 tok/s. The same engine on Gemma 4 31B runs at 4.97 tok/s, because Gemma's batched-prefill gate is closed. Five of the six things blocking it open with small, mechanical fixes. The sixth is a single head_dim push constant on flash_attn_batched that quietly assumes Q and KV use the same dimension. Gemma 4's full-attention layers do not. Here is why one push constant became the bottleneck on a 31B model and what the right shape of the fix looks like.
Yesterday's post left ZINC's RDNA4 DMMV with a compromise: one pipeline, MAX_COLS=32 shared between decode and prefill, because raising it to 64 regressed decode from 78 tok/s to 59 tok/s on Qwen3-8B. Vulkan specialization constants are the quiet fix. One SPIR-V binary, one `VkSpecializationInfo` per call site, five pipelines compiled with five different register budgets. Here is how that works under the hood, why it only costs a quarter second of extra cold build time, and why specialization constants are a more general answer to the next three kernels in the port.
ZINC's per-token RDNA4 prefill on Qwen3-8B runs at 59 tok/s while llama.cpp's Vulkan backend hits 662 tok/s on the same card. The first instinct is to ship a tiled matmul kernel. The real first move is smaller: a column-batched DMMV that reads each weight once and multiplies it by up to 32 prompt tokens at a time. Here is why that shape, not a GEMM, is where an RDNA4 prefill port should start.
Yesterday's post identified a second tax on Metal LLM cold starts: pipeline-state compilation. llama.cpp's current Metal library cache kills that cost across repeated contexts in one process, but does nothing for a fresh CLI invocation. Apple's `MTLBinaryArchive` API is the only mechanism that makes the compiled pipelines survive a process exit. Here is why ZINC needs to ship one, what it actually costs, and where MoE quietly breaks the approach.
Metal prefill in ZINC went from 7.9 tok/s to 298 tok/s on Qwen3 8B with a single gated code path. The batched path also revealed that ZINC's per-token DMMV prefill had been producing subtly different output than llama.cpp all along. Both answers are numerically valid, and matching llama.cpp's answer turned out to require changing the arithmetic, not fixing a bug.
On Apple Silicon the first prompt after a fresh local LLM binary is slow for two independent reasons. One is mmap page faulting. The other is Metal compiling a compute pipeline state object for every unique shader the model will dispatch. llama.cpp pays this cost once at server warmup. ZINC currently pays it on the first user prompt, every launch.
Our own benchmark harness made ZINC look slower than it is. The April 18 Metal suite reported 1.0 tok/s prefill on Qwen3.5-35B-A3B. The April 15 run reported 2.1. Same engine, same model, same machine. The difference was measurement, not regression. Here is why cold-process CLI benchmarks stop working once the model mmaps 21 GiB of weights.
Weight quantization is everywhere in local LLM inference. Activation quantization is not, and that is the reason RDNA4 prefill on Qwen3.5-35B is still reading FP32 through a dequantize-then-multiply pipeline. Here is why Q8_1 and mul_mmq are the biggest unshipped prefill win on consumer AMD, how llama.cpp routes around the problem ZINC still has, and what the numbers say the upside looks like.
Prefill on Qwen3.5-35B on AMD RDNA4 in ZINC is stuck at 25.67 tok/s while decode runs at 73. After 24 optimization cycles, only three moved the number. Here is what the measurements said about why, and what they say about how local inference engines should actually be tuned on consumer AMD.
Qwen3.6-35B-A3B-UD-Q4_K_XL is now a supported ZINC managed model for AMD RDNA4 Vulkan and Apple Silicon Metal local inference, with one-command pull and chat-ready defaults.
OpenAI's GPT-OSS 20B can run locally on Apple Silicon Macs in ZINC, but getting there was not just a loader change. The model forced our Metal backend to learn new quantization formats, new attention semantics, a new chat template, and a different MoE activation path.
Qwen3.6-Plus is exactly the kind of model that matters for local LLM inference: hybrid attention, sparse MoE routing, agentic coding, and a 1M context window. This deep technical post breaks down Qwen 3.6 architecture signals, how they compare with Qwen3 and Qwen3-Next, and what ZINC would need to run a local Qwen 3.6 release on Vulkan and Metal.
Mixture of Experts (MoE) models only run a small set of experts per token, but the routing details decide whether inference is elegant or slow. This post explains how MoE models work, which Qwen and Gemma MoE models ZINC supports today, and how ZINC executes router top-k, batched expert kernels, and shared-expert paths on GPU.
ZINC is a from-scratch LLM inference engine in Zig targeting Vulkan and Metal. This post walks through every major design decision — from 'why not fork llama.cpp' to static compute graphs, hand-tuned shaders, paged KV cache, and a zero-dependency architecture — and explains what we learned, what surprised us, and what we would do again.
Zig is the best systems programming language we have found for GPU work. We used it to build a dual-backend inference engine targeting both Vulkan and Metal from a single codebase. Comptime eliminated runtime dispatch, explicit allocators made memory predictable, and the build system compiled shaders and linked frameworks without a single external tool. Here is why Zig beats C++ for cross-platform GPU programming.
ZINC now runs natively on Apple Silicon through a Metal backend built from scratch. Not a Vulkan translation layer. Not MLX. Hand-tuned MSL shaders, zero-copy model loading, and the same OpenAI-compatible API. This is the story of how we got there.
ZINC used to look stuck at about 7 tok/s on AMD RDNA4. The clean ReleaseFast baseline now measures 33.58 tok/s on Qwen3.5-35B-A3B-UD Q4_K_XL on a Radeon AI PRO R9700. This is what changed, which old numbers were misleading, and what still separates us from llama.cpp.
ZINC could already run Qwen3.5-35B-A3B on AMD RDNA4, but local LLM inference was stuck at 4 tok/s because the Vulkan compute shaders behind SSM and MoE routing were still wrong. This is how we debugged RADV crashes, recurrent state drift, and CPU-GPU round trips to move decode back onto the GPU.
Why Karpathy's autoresearch is viral right now, how a self-improving AI loop actually works, and how ZINC uses a remote GPU verification loop to compress brutal debugging cycles into repeatable overnight search.
How I built the home RDNA4 node behind ZINC — from a Ryzen 9800X3D trading workstation to an AMD Radeon AI PRO R9700 inference rig. Parts list, Noctua fan swap, assembly photos, and why a $1300 GPU turned an overclocking platform into a local AI server.
ZINC is a purpose-built LLM inference engine for AMD RDNA3 and RDNA4 consumer GPUs — RX 9070 XT, RX 7900 XTX, Radeon AI PRO R9700. Hand-tuned Vulkan compute shaders, continuous batching, paged KV cache, TurboQuant compression. Built with Zig. OpenAI-compatible API. No ROCm required.