ZINC — LLM Inference Engine for AMD GPUs & Apple Silicon

Fast local LLM inference on AMD Radeon GPUs (RDNA4 & RDNA3) and Apple Silicon. Hand-tuned Vulkan and Metal shaders — no ROCm, no MLX.

Runs on: Linux (AMD Radeon, Intel Arc Xe2 — experimental) or macOS (Apple Silicon M1–M5). Windows is not supported.

ZINC Chat — streaming LLM inference on an AMD Radeon RDNA4 GPU via Vulkan

A 35B-parameter Qwen3.6 model running locally on a single AMD Radeon GPU — Zig + Vulkan, no ROCm, no MLX

Consumer GPUs have the hardware for fast LLM inference, but the software stack doesn't use it well. AMD's RDNA cards — the Radeon RX 9070 XT, RX 7900 XTX, and Radeon AI PRO R9700 — are ignored by ROCm and get only generic Vulkan in llama.cpp. Apple Silicon runs MLX and llama.cpp Metal, but no engine is built from scratch around Metal's strengths. ZINC targets both platforms with hand-tuned shaders, so you can run LLM inference on an AMD GPU without ROCm.

ZINC is not finished productized infrastructure yet. It is an experimental engine under active development, and the best current experience is still Linux-first, AMD-first, and CLI-first. The point of the docs is to make that path explicit instead of making you reverse-engineer it from technical specs.

ZINC takes a different approach. It is an inference engine written in Zig with hand-tuned Vulkan compute shaders, built specifically for the GPU architecture these cards actually have.

Why not vLLM or llama.cpp?

ZINCvLLMllama.cpp (Vulkan)
RDNA3/RDNA4 supportFirst-class targetNot supported (needs ROCm)Works, no RDNA4 tuning
Continuous batchingRoadmapBuilt-inServer-bolted
Paged KV cachevLLM-style PagedAttentionYesNo (contiguous)
KV cache compressionTurboQuant roadmap / experimental pathNoNo
GPU kernel tuningRDNA4 wave64, coop matrixROCm/CUDA onlyGeneric Vulkan
OpenAI APINativeNativeVia server wrapper

Performance

Every number below is a measured median from the public perf suite, with ZINC and llama.cpp built and run back-to-back on the same machine (zig build -Doptimize=ReleaseFast) — not a projected target. Full breakdown with prefill latency, scenarios, and run provenance lives on the benchmarks dashboard.

109.3 tok/s decode Qwen3.6-35B-A3B · RDNA4
1.00x vs llama.cpp same R9700, decode
2 GPU backends AMD Vulkan + Apple Metal
5 models benchmarked Qwen 3 / 3.5 / 3.6 + Gemma 4

AMD RDNA4 — Radeon AI PRO R9700 32 GB · 576 GB/s · 2026-06-07

Median across the RDNA suite with RADV_PERFTEST=coop_matrix. ZINC leads on the flagship Qwen3.6-35B-A3B MoE and Qwen 3.5 9B decode; prefill remains the active optimization target across the dense and hybrid rows.

Apple Silicon — Mac Studio (M4 Max) 40-core GPU · 64 GB unified · 2026-06-11

Exact machine above, native Metal backend (2026-06-11 refresh) — not a generic "Apple Silicon" average. The Metal path is younger than the RDNA4 one: decode still trails llama.cpp on most models, but ZINC prefill already leads on Qwen3-8B and Gemma 4 31B. Closing the A3B decode gap is the active Metal target.

What ZINC does

Two GPU backends, one engine. On AMD: wave64, cooperative matrix, architecture-aware tiling via Vulkan compute. On Apple Silicon: native MSL kernels with simdgroup ops, zero-copy mmap, and Metal pipeline tuning. ZINC picks the right backend at build time.

The API is OpenAI-compatible: POST /v1/chat/completions with SSE streaming, plus a built-in chat UI with thinking mode support. Point your existing client at ZINC and it works. API reference.

Models load from GGUF files via direct memory-map — DMA to GPU VRAM on Vulkan, zero-copy newBufferWithBytesNoCopy on Metal. Supports Q4_K, Q5_K, Q6_K, Q8_0, and F16 quantization. Architectures: Qwen 3 / 3.5 / 3.6 (dense and A3B MoE), Gemma 4, and SSM-hybrid paths.

The whole thing is written in Zig (with a thin Objective-C shim for Metal). No hidden allocations, direct GPU API calls, comptime dispatch tables, single binary.

Architecture

ZINC LLM inference engine architecture — Vulkan compute pipeline for AMD Radeon RDNA GPUs and Metal for Apple Silicon

Supported hardware: AMD Radeon GPUs & Apple Silicon

ZINC targets AMD Radeon GPUs via Vulkan compute and Apple Silicon via Metal. No ROCm, no MLX, no heavyweight framework.

AMD Radeon GPUs — RDNA4 & RDNA3 (Linux)

Any AMD GPU with Vulkan 1.3 and RADV or AMDVLK drivers should work.

Apple Silicon (macOS)

See the full hardware requirements page for details.


Documentation · TurboQuant spec · GitHub