Stepan Zolotukhin

I'm building ZINC, a local LLM inference engine for AMD GPUs and Apple Silicon. This site is where I write about the work — GPU kernels, Vulkan and Metal compute, quantization, and the systems engineering underneath.

ZINC

Local LLM inference on AMD GPUs and Apple Silicon — hand-tuned Vulkan and Metal shaders, no ROCm, no MLX, one binary. 37.95 tok/s on RDNA4 with a 35B model, native Apple Silicon support via Metal.

Learn more →

Recent posts

How we made CUDA LLM decode up to 5× faster: batched MoE experts and kernel fusion in ZINC

A mixture-of-experts model was decoding at 9 tok/s on an RTX 4090 — slower than a dense 27B. This is how ZINC's CUDA backend diagnosed boost-clock starvation, batched its MoE experts into two GPU-side kernels (turning the catalog's slowest models into front-runners that outrun the big dense ones), fused gemma's attention dispatches, and kept every change token-for-token correct against llama.cpp.

zinccudanvidiartx-4090rtx-5090blackwellllm-inferencegpu-optimizationkernel-fusionmixture-of-expertsmoedecode-throughputqwen3gemmallama-cppcuda-kernelslocal-llm

How ZINC got a CUDA backend and beat llama.cpp decode on NVIDIA

ZINC is a Vulkan and Metal inference engine. On the WSL2 NVIDIA box the only Vulkan device is llvmpipe on the CPU, so ZINC could not touch the RTX 4090 or 5090 at all. This is how we wrote a CUDA backend by mirroring the Metal shim, found a per-head attention-gate bug by diffing residuals against llama.cpp, discovered that per-token sync gaps were starving the GPU boost clock, and shipped an async stream/event ring that pushed Qwen3.5-9B decode past llama.cpp.

zincnvidiacudartx-5090rtx-4090blackwellwsl2local-llmllm-inferencegpu-kernelsdmmvnvrtccuda-streamsqwen3gemma

Bringing ZINC to NVIDIA: a CUDA backend, because WSL2 only speaks CUDA

ZINC is getting a fourth GPU backend: native CUDA for NVIDIA. The surprise that forced it — on Windows + WSL2, NVIDIA exposes only CUDA, not Vulkan, so the one Vulkan device ZINC can see is a CPU. The reassurance that makes it tractable — ZINC's matmuls are int8 dot products, not tensor-core GEMMs, so they map 1:1 onto __dp4a. This is the plan, what we've already validated on an RTX 5090, and the road to a first token.

zincnvidiacudartx-5090blackwellzigllm-inferencegpu-kernelsqwen3-6wsl2

Why Qwen 35B cannot use ZINC's 208 tok/s batched prefill path yet

Yesterday's RDNA4 post showed ZINC moving dense Qwen3-8B prefill from 42.9 to 207.9 tok/s. The tempting question is why Qwen3.6 35B-A3B cannot simply reuse that path. The answer is the hybrid wall: dense batching solved repeated weight reads for a transformer, while Qwen 35B needs batched MoE routing and block-resident SSM state before its prefill gate can come down.

zincrdna4amdvulkanprefillqwen3-6qwen35moessmlocal-llmllm-inferencegpu-kernels

How ZINC's RDNA4 batched prefill went from 42 to 208 tok/s

The old RDNA4 batched-prefill docs note was really a blog post hiding in the docs tree. This is the cleaned-up version: how ZINC discovered that ZINC_BATCHED_PREFILL was a no-op, used validate mode to prove the forward pass was correct, fixed a stale GPU argmax sampler bug, replaced serial-over-K DMMV with K-parallel Q4_K/Q6_K batched shaders, and moved Qwen3-8B prompt ingestion on the R9700 from 42.9 to 207.9 tok/s.

zincrdna4amdvulkanprefillqwen3-8bgemmalocal-llmllm-inferencegpu-kernelsdmmvq4-kq6-k

All posts →