ZINC — LLM Inference Engine for AMD GPUs & Apple Silicon
Fast local LLM inference on AMD Radeon GPUs (RDNA4 & RDNA3) and Apple Silicon. Hand-tuned Vulkan and Metal shaders — no ROCm, no MLX.
Runs on: Linux (AMD Radeon, Intel Arc Xe2 — experimental) or macOS (Apple Silicon M1–M5). Windows is not supported.
A 35B-parameter Qwen3.6 model running locally on a single AMD Radeon GPU — Zig + Vulkan, no ROCm, no MLX
Consumer GPUs have the hardware for fast LLM inference, but the software stack doesn't use it well. AMD's RDNA cards — the Radeon RX 9070 XT, RX 7900 XTX, and Radeon AI PRO R9700 — are ignored by ROCm and get only generic Vulkan in llama.cpp. Apple Silicon runs MLX and llama.cpp Metal, but no engine is built from scratch around Metal's strengths. ZINC targets both platforms with hand-tuned shaders, so you can run LLM inference on an AMD GPU without ROCm.
ZINC is not finished productized infrastructure yet. It is an experimental engine under active development, and the best current experience is still Linux-first, AMD-first, and CLI-first. The point of the docs is to make that path explicit instead of making you reverse-engineer it from technical specs.
ZINC takes a different approach. It is an inference engine written in Zig with hand-tuned Vulkan compute shaders, built specifically for the GPU architecture these cards actually have.
Why not vLLM or llama.cpp?
| ZINC | vLLM | llama.cpp (Vulkan) | |
|---|---|---|---|
| RDNA3/RDNA4 support | First-class target | Not supported (needs ROCm) | Works, no RDNA4 tuning |
| Continuous batching | Roadmap | Built-in | Server-bolted |
| Paged KV cache | vLLM-style PagedAttention | Yes | No (contiguous) |
| KV cache compression | TurboQuant roadmap / experimental path | No | No |
| GPU kernel tuning | RDNA4 wave64, coop matrix | ROCm/CUDA only | Generic Vulkan |
| OpenAI API | Native | Native | Via server wrapper |
Performance
Every number below is a measured median from the public perf suite, with ZINC and llama.cpp built and run back-to-back on the same machine (zig build -Doptimize=ReleaseFast) — not a projected target. Full breakdown with prefill latency, scenarios, and run provenance lives on the benchmarks dashboard.
AMD RDNA4 — Radeon AI PRO R9700 32 GB · 576 GB/s · 2026-06-07
Prefill 311 tok/s ZINC · 405 tok/s llama.cpp
Prefill 176 tok/s ZINC · 554 tok/s llama.cpp
Prefill 49 tok/s ZINC · 186 tok/s llama.cpp
Prefill 89 tok/s ZINC · 502 tok/s llama.cpp
Prefill 42 tok/s ZINC · 203 tok/s llama.cpp
Median across the RDNA suite with RADV_PERFTEST=coop_matrix. ZINC leads on the flagship Qwen3.6-35B-A3B MoE and Qwen 3.5 9B decode; prefill remains the active optimization target across the dense and hybrid rows.
Apple Silicon — Mac Studio (M4 Max) 40-core GPU · 64 GB unified · 2026-06-11
Prefill 97 tok/s ZINC · 96 tok/s llama.cpp
Prefill 350 tok/s ZINC · n/a llama.cpp
Prefill 131 tok/s ZINC · 32 tok/s llama.cpp
Prefill n/a ZINC · 29 tok/s llama.cpp
Exact machine above, native Metal backend (2026-06-11 refresh) — not a generic "Apple Silicon" average. The Metal path is younger than the RDNA4 one: decode still trails llama.cpp on most models, but ZINC prefill already leads on Qwen3-8B and Gemma 4 31B. Closing the A3B decode gap is the active Metal target.
What ZINC does
Two GPU backends, one engine. On AMD: wave64, cooperative matrix, architecture-aware tiling via Vulkan compute. On Apple Silicon: native MSL kernels with simdgroup ops, zero-copy mmap, and Metal pipeline tuning. ZINC picks the right backend at build time.
The API is OpenAI-compatible: POST /v1/chat/completions with SSE streaming, plus a built-in chat UI with thinking mode support. Point your existing client at ZINC and it works. API reference.
Models load from GGUF files via direct memory-map — DMA to GPU VRAM on Vulkan, zero-copy newBufferWithBytesNoCopy on Metal. Supports Q4_K, Q5_K, Q6_K, Q8_0, and F16 quantization. Architectures: Qwen 3 / 3.5 / 3.6 (dense and A3B MoE), Gemma 4, and SSM-hybrid paths.
The whole thing is written in Zig (with a thin Objective-C shim for Metal). No hidden allocations, direct GPU API calls, comptime dispatch tables, single binary.
Architecture
Supported hardware: AMD Radeon GPUs & Apple Silicon
ZINC targets AMD Radeon GPUs via Vulkan compute and Apple Silicon via Metal. No ROCm, no MLX, no heavyweight framework.
AMD Radeon GPUs — RDNA4 & RDNA3 (Linux)
- RDNA4: RX 9070, RX 9070 XT, Radeon AI PRO R9700 (primary tuning target)
- RDNA3: RX 7900 XTX, RX 7900 XT, RX 7800 XT, RX 7700 XT, RX 7600
Any AMD GPU with Vulkan 1.3 and RADV or AMDVLK drivers should work.
Apple Silicon (macOS)
- M1 through M5: all variants (base, Pro, Max, Ultra)
- Current public reference box: Mac Studio (M4 Max, 40-core GPU, 64 GB unified memory)
- Native Metal shaders, zero-copy model loading, unified memory
- 16 GB for 2B models, 24+ GB for 35B models
See the full hardware requirements page for details.