ZINC

Benchmarks

ZINC vs llama.cpp on the same machines.

Targets 4
Unique Models 5
Comparisons 19
Updated Jun 12

AMD RDNA bench node

RDNA

Radeon AI PRO R9700 · 32 GB VRAM · 576 GB/s · 3 measured · 1 warmup

Run info
GPU
Radeon AI PRO R9700
Chip
AMD Radeon AI PRO R9700
Memory
32 GB VRAM · 576 GB/s
Artifact
Jun 6, 2026, 7:33 PM
ZINC
e1551bff0d33
llama.cpp
513
Fastest 109.3 tok/s Qwen 3.6 35B A3B UD Q4K XL
Decode 69.5 tok/s average
Overall 108% prefill + decode vs llama.cpp

Quick Chat

5 rows 133.5 tok/s prefill 69.5 tok/s decode 82% overall
Decode 69.5 tok/s 96% decode ratio
Prefill 133.5 tok/s 35% prefill ratio · 41 prompt tok
Fastest Qwen 3.6 35B A3B UD Q4K XL 109.3 tok/s
Overall 82% best row: Qwen 3.6 35B A3B UD Q4K XL

Models

ZINC llama.cpp
Gemma 4 26B-A4B MoE Q4K M 49p · 96out Compared
Prefill 18%
ZINC 89.1 tok/s
llama 502.2 tok/s
Decode 89%
ZINC 91.0 tok/s
llama 102.3 tok/s
65% overall
Gemma 4 31B Q4K M 49p · 96out Compared
Prefill 21%
ZINC 41.8 tok/s
llama 202.6 tok/s
Decode 86%
ZINC 24.6 tok/s
llama 28.6 tok/s
71% overall
Qwen 3.5 9B Q4K M 36p · 96out Compared
Prefill 32%
ZINC 176.0 tok/s
llama 554.2 tok/s
Decode 109%
ZINC 93.0 tok/s
llama 85.6 tok/s
96% overall
Qwen 3.6 27B Dense Q4K M 36p · 96out Compared
Prefill 26%
ZINC 49.1 tok/s
llama 185.9 tok/s
Decode 95%
ZINC 29.4 tok/s
llama 30.8 tok/s
83% overall
Qwen 3.6 35B A3B UD Q4K XL 36p · 96out Compared
Prefill 77%
ZINC 311.3 tok/s
llama 404.7 tok/s
Decode 100%
ZINC 109.3 tok/s
llama 109.0 tok/s
98% overall
Details
Gemma 4 26B-A4B MoE Q4K M 49 tok prompt · 96 tok output
Compared
Prefill 18%
ZINC 89.1 tok/s
llama 502.2 tok/s
Decode 89%
ZINC 91.0 tok/s
llama 102.3 tok/s
Overall 65%
ZINC 90.4 tok/s
llama 139.9 tok/s
Prompt+decode. Cold latency: ZINC 9394 ms · llama.cpp 1165 ms
Gemma 4 31B Q4K M 49 tok prompt · 96 tok output
Compared
Prefill 21%
ZINC 41.8 tok/s
llama 202.6 tok/s
Decode 86%
ZINC 24.6 tok/s
llama 28.6 tok/s
Overall 71%
ZINC 28.6 tok/s
llama 40.3 tok/s
Prompt+decode. Cold latency: ZINC 10859 ms · llama.cpp 3728 ms
Qwen 3.5 9B Q4K M 36 tok prompt · 96 tok output
Compared
Prefill 32%
ZINC 176.0 tok/s
llama 554.2 tok/s
Decode 109%
ZINC 93.0 tok/s
llama 85.6 tok/s
Overall 96%
ZINC 106.7 tok/s
llama 111.2 tok/s
Prompt+decode. Cold latency: ZINC 3862 ms · llama.cpp 1302 ms
Qwen 3.6 27B Dense Q4K M 36 tok prompt · 96 tok output
Compared
Prefill 26%
ZINC 49.1 tok/s
llama 185.9 tok/s
Decode 95%
ZINC 29.4 tok/s
llama 30.8 tok/s
Overall 83%
ZINC 33.0 tok/s
llama 39.8 tok/s
Prompt+decode. Cold latency: ZINC 9739 ms · llama.cpp 3454 ms
Qwen 3.6 35B A3B UD Q4K XL 36 tok prompt · 96 tok output
Compared
Prefill 77%
ZINC 311.3 tok/s
llama 404.7 tok/s
Decode 100%
ZINC 109.3 tok/s
llama 109.0 tok/s
Overall 98%
ZINC 132.8 tok/s
llama 136.1 tok/s
Prompt+decode. Cold latency: ZINC 8129 ms · llama.cpp 1106 ms
Raw data
Model Prefill Decode Overall
Gemma 4 26B-A4B MoE Q4K M 49 tok prompt · 96 tok output · 3 runs · Compared
Z
89.1 tok/s 550.0 ms
L
502.2 tok/s 101.6 ms
Z
91.0 tok/s 11.0 ms/tok
L
102.3 tok/s 9.8 ms/tok
Z
90.4 tok/s prompt+decode
L
139.9 tok/s prompt+decode
Gemma 4 31B Q4K M 49 tok prompt · 96 tok output · 3 runs · Compared
Z
41.8 tok/s 1171.9 ms
L
202.6 tok/s 251.7 ms
Z
24.6 tok/s 40.6 ms/tok
L
28.6 tok/s 34.9 ms/tok
Z
28.6 tok/s prompt+decode
L
40.3 tok/s prompt+decode
Qwen 3.5 9B Q4K M 36 tok prompt · 96 tok output · 3 runs · Compared
Z
176.0 tok/s 204.6 ms
L
554.2 tok/s 63.2 ms
Z
93.0 tok/s 10.8 ms/tok
L
85.6 tok/s 11.7 ms/tok
Z
106.7 tok/s prompt+decode
L
111.2 tok/s prompt+decode
Qwen 3.6 27B Dense Q4K M 36 tok prompt · 96 tok output · 3 runs · Compared
Z
49.1 tok/s 733.5 ms
L
185.9 tok/s 188.3 ms
Z
29.4 tok/s 34.1 ms/tok
L
30.8 tok/s 32.5 ms/tok
Z
33.0 tok/s prompt+decode
L
39.8 tok/s prompt+decode
Qwen 3.6 35B A3B UD Q4K XL 36 tok prompt · 96 tok output · 3 runs · Compared
Z
311.3 tok/s 115.6 ms
L
404.7 tok/s 86.5 ms
Z
109.3 tok/s 9.1 ms/tok
L
109.0 tok/s 9.2 ms/tok
Z
132.8 tok/s prompt+decode
L
136.1 tok/s prompt+decode
Methodology ZINC CLI vs llama.cpp on the same RDNA node · Four-scenario matrix with same-file baselines
3 measured · 1 warmup
  • Hardware: one AMD Radeon AI PRO R9700 benchmark node with 32 GB VRAM and 576 GB/s memory bandwidth. ZINC and llama.cpp run on the same Ubuntu host and use the same GGUF file for each model.
  • ZINC path: sync the current source tree to the RDNA node, build with zig build -Doptimize=ReleaseFast, then measure generation through the ZINC CLI with RADV cooperative matrix support enabled.
  • ZINC RDNA backend: vulkan. Published RDNA runs default to Vulkan; zinc_rt is opt-in because it is a separate bring-up runtime.
  • Baseline path: launch llama.cpp against the same model file, preferring one reusable llama-server per model across the full scenario matrix and falling back to llama-cli when the server path is unavailable.
  • Scenarios: Quick Chat, Coding Review, Incident Context, and Long Coding Draft. The prompts are real-world chat, code-review, support-context, and coding-plan workloads instead of synthetic factual completions.
  • Statistics: one warmup pass is discarded, then three measured runs are collected. Published prefill, decode, end-to-end throughput, and latency values are medians.
  • Execution order: the harness records the full ZINC scenario matrix before starting the llama.cpp baseline phase, so only one inference engine owns the GPU during measurement.
  • Run hygiene: published RDNA runs assume a clean node with no competing ZINC, llama.cpp, or other GPU workloads.