I'm building ZINC, a local LLM inference engine for AMD GPUs and Apple Silicon. This site is where I write about the work — GPU kernels, Vulkan and Metal compute, quantization, and the systems engineering underneath.
A mixture-of-experts model was decoding at 9 tok/s on an RTX 4090 — slower than a dense 27B. This is how ZINC's CUDA backend diagnosed boost-clock starvation, batched its MoE experts into two GPU-side kernels (turning the catalog's slowest models into front-runners that outrun the big dense ones), fused gemma's attention dispatches, and kept every change token-for-token correct against llama.cpp.
ZINC is a Vulkan and Metal inference engine. On the WSL2 NVIDIA box the only Vulkan device is llvmpipe on the CPU, so ZINC could not touch the RTX 4090 or 5090 at all. This is how we wrote a CUDA backend by mirroring the Metal shim, found a per-head attention-gate bug by diffing residuals against llama.cpp, discovered that per-token sync gaps were starving the GPU boost clock, and shipped an async stream/event ring that pushed Qwen3.5-9B decode past llama.cpp.
ZINC is getting a fourth GPU backend: native CUDA for NVIDIA. The surprise that forced it — on Windows + WSL2, NVIDIA exposes only CUDA, not Vulkan, so the one Vulkan device ZINC can see is a CPU. The reassurance that makes it tractable — ZINC's matmuls are int8 dot products, not tensor-core GEMMs, so they map 1:1 onto __dp4a. This is the plan, what we've already validated on an RTX 5090, and the road to a first token.
Yesterday's RDNA4 post showed ZINC moving dense Qwen3-8B prefill from 42.9 to 207.9 tok/s. The tempting question is why Qwen3.6 35B-A3B cannot simply reuse that path. The answer is the hybrid wall: dense batching solved repeated weight reads for a transformer, while Qwen 35B needs batched MoE routing and block-resident SSM state before its prefill gate can come down.
The old RDNA4 batched-prefill docs note was really a blog post hiding in the docs tree. This is the cleaned-up version: how ZINC discovered that ZINC_BATCHED_PREFILL was a no-op, used validate mode to prove the forward pass was correct, fixed a stale GPU argmax sampler bug, replaced serial-over-K DMMV with K-parallel Q4_K/Q6_K batched shaders, and moved Qwen3-8B prompt ingestion on the R9700 from 42.9 to 207.9 tok/s.