ZINC
Benchmarks
ZINC vs llama.cpp on the same machines.
AMD RDNA bench node
RDNA
Radeon AI PRO R9700 · 32 GB VRAM · 576 GB/s · 3 measured · 1 warmup
Run info
- GPU
- Radeon AI PRO R9700
- Chip
- AMD Radeon AI PRO R9700
- Memory
- 32 GB VRAM · 576 GB/s
- Artifact
- Jun 6, 2026, 7:33 PM
- ZINC
- e1551bff0d33
- llama.cpp
- 513
Quick Chat
5
rows
133.5 tok/s
prefill
69.5 tok/s
decode
82%
overall
Models
ZINC llama.cpp
Gemma 4 26B-A4B MoE Q4K M 49p · 96out Compared
65% overall
Gemma 4 31B Q4K M 49p · 96out Compared
71% overall
Qwen 3.5 9B Q4K M 36p · 96out Compared
96% overall
Qwen 3.6 27B Dense Q4K M 36p · 96out Compared
83% overall
Qwen 3.6 35B A3B UD Q4K XL 36p · 96out Compared
98% overall
Details
Gemma 4 26B-A4B MoE Q4K M 49 tok prompt · 96 tok output
Compared Prefill 18%
Decode 89%
Overall 65%
Prompt+decode. Cold latency: ZINC 9394 ms · llama.cpp 1165 ms Gemma 4 31B Q4K M 49 tok prompt · 96 tok output
Compared Prefill 21%
Decode 86%
Overall 71%
Prompt+decode. Cold latency: ZINC 10859 ms · llama.cpp 3728 ms Qwen 3.5 9B Q4K M 36 tok prompt · 96 tok output
Compared Prefill 32%
Decode 109%
Overall 96%
Prompt+decode. Cold latency: ZINC 3862 ms · llama.cpp 1302 ms Qwen 3.6 27B Dense Q4K M 36 tok prompt · 96 tok output
Compared Prefill 26%
Decode 95%
Overall 83%
Prompt+decode. Cold latency: ZINC 9739 ms · llama.cpp 3454 ms Qwen 3.6 35B A3B UD Q4K XL 36 tok prompt · 96 tok output
Compared Prefill 77%
Decode 100%
Overall 98%
Prompt+decode. Cold latency: ZINC 8129 ms · llama.cpp 1106 ms Raw data
| Model | Prefill | Decode | Overall |
|---|---|---|---|
| Gemma 4 26B-A4B MoE Q4K M 49 tok prompt · 96 tok output · 3 runs · Compared | Z 89.1 tok/s 550.0 ms L 502.2 tok/s 101.6 ms | Z 91.0 tok/s 11.0 ms/tok L 102.3 tok/s 9.8 ms/tok | Z 90.4 tok/s prompt+decode L 139.9 tok/s prompt+decode |
| Gemma 4 31B Q4K M 49 tok prompt · 96 tok output · 3 runs · Compared | Z 41.8 tok/s 1171.9 ms L 202.6 tok/s 251.7 ms | Z 24.6 tok/s 40.6 ms/tok L 28.6 tok/s 34.9 ms/tok | Z 28.6 tok/s prompt+decode L 40.3 tok/s prompt+decode |
| Qwen 3.5 9B Q4K M 36 tok prompt · 96 tok output · 3 runs · Compared | Z 176.0 tok/s 204.6 ms L 554.2 tok/s 63.2 ms | Z 93.0 tok/s 10.8 ms/tok L 85.6 tok/s 11.7 ms/tok | Z 106.7 tok/s prompt+decode L 111.2 tok/s prompt+decode |
| Qwen 3.6 27B Dense Q4K M 36 tok prompt · 96 tok output · 3 runs · Compared | Z 49.1 tok/s 733.5 ms L 185.9 tok/s 188.3 ms | Z 29.4 tok/s 34.1 ms/tok L 30.8 tok/s 32.5 ms/tok | Z 33.0 tok/s prompt+decode L 39.8 tok/s prompt+decode |
| Qwen 3.6 35B A3B UD Q4K XL 36 tok prompt · 96 tok output · 3 runs · Compared | Z 311.3 tok/s 115.6 ms L 404.7 tok/s 86.5 ms | Z 109.3 tok/s 9.1 ms/tok L 109.0 tok/s 9.2 ms/tok | Z 132.8 tok/s prompt+decode L 136.1 tok/s prompt+decode |
Coding Review
5
rows
397.2 tok/s
prefill
73.5 tok/s
decode
117%
overall
Models
ZINC llama.cpp
Gemma 4 26B-A4B MoE Q4K M 192p · 160out Compared
67% overall
Gemma 4 31B Q4K M 192p · 160out Compared
103% overall
Qwen 3.5 9B Q4K M 174p · 160out Compared
143% overall
Qwen 3.6 27B Dense Q4K M 174p · 160out Compared
116% overall
Qwen 3.6 35B A3B UD Q4K XL 174p · 160out Compared
157% overall
Details
Gemma 4 26B-A4B MoE Q4K M 192 tok prompt · 160 tok output
Compared Prefill 47%
Decode 90%
Overall 67%
Prompt+decode. Cold latency: ZINC 11534 ms · llama.cpp 1726 ms Gemma 4 31B Q4K M 192 tok prompt · 160 tok output
Compared Prefill 146%
Decode 85%
Overall 103%
Prompt+decode. Cold latency: ZINC 15240 ms · llama.cpp 5886 ms Qwen 3.5 9B Q4K M 174 tok prompt · 160 tok output
Compared Prefill 421%
Decode 110%
Overall 143%
Prompt+decode. Cold latency: ZINC 4921 ms · llama.cpp 1998 ms Qwen 3.6 27B Dense Q4K M 174 tok prompt · 160 tok output
Compared Prefill 261%
Decode 93%
Overall 116%
Prompt+decode. Cold latency: ZINC 12175 ms · llama.cpp 5470 ms Qwen 3.6 35B A3B UD Q4K XL 174 tok prompt · 160 tok output
Compared Prefill 365%
Decode 119%
Overall 157%
Prompt+decode. Cold latency: ZINC 8973 ms · llama.cpp 1613 ms Raw data
| Model | Prefill | Decode | Overall |
|---|---|---|---|
| Gemma 4 26B-A4B MoE Q4K M 192 tok prompt · 160 tok output · 3 runs · Compared | Z 90.5 tok/s 2120.3 ms L 193.4 tok/s 1002.9 ms | Z 91.7 tok/s 10.9 ms/tok L 101.4 tok/s 9.9 ms/tok | Z 91.1 tok/s prompt+decode L 136.9 tok/s prompt+decode |
| Gemma 4 31B Q4K M 192 tok prompt · 160 tok output · 3 runs · Compared | Z 73.6 tok/s 2607.5 ms L 50.3 tok/s 3856.8 ms | Z 24.1 tok/s 41.4 ms/tok L 28.3 tok/s 35.4 ms/tok | Z 38.1 tok/s prompt+decode L 37.1 tok/s prompt+decode |
| Qwen 3.5 9B Q4K M 174 tok prompt · 160 tok output · 3 runs · Compared | Z 874.1 tok/s 199.1 ms L 207.7 tok/s 736.5 ms | Z 94.2 tok/s 10.6 ms/tok L 85.4 tok/s 11.7 ms/tok | Z 176.0 tok/s prompt+decode L 123.2 tok/s prompt+decode |
| Qwen 3.6 27B Dense Q4K M 174 tok prompt · 160 tok output · 3 runs · Compared | Z 194.7 tok/s 893.6 ms L 74.5 tok/s 2054.5 ms | Z 28.4 tok/s 35.3 ms/tok L 30.6 tok/s 32.7 ms/tok | Z 51.1 tok/s prompt+decode L 44.1 tok/s prompt+decode |
| Qwen 3.6 35B A3B UD Q4K XL 174 tok prompt · 160 tok output · 3 runs · Compared | Z 753.3 tok/s 231.0 ms L 206.1 tok/s 742.2 ms | Z 128.9 tok/s 7.8 ms/tok L 108.8 tok/s 9.2 ms/tok | Z 226.9 tok/s prompt+decode L 144.3 tok/s prompt+decode |
Incident Context
5
rows
422.3 tok/s
prefill
73.5 tok/s
decode
139%
overall
Models
ZINC llama.cpp
Gemma 4 26B-A4B MoE Q4K M 346p · 128out Compared
63% overall
Gemma 4 31B Q4K M 346p · 128out Compared
120% overall
Qwen 3.5 9B Q4K M 322p · 128out Compared
177% overall
Qwen 3.6 27B Dense Q4K M 322p · 128out Compared
140% overall
Qwen 3.6 35B A3B UD Q4K XL 322p · 128out Compared
196% overall
Details
Gemma 4 26B-A4B MoE Q4K M 346 tok prompt · 128 tok output
Compared Prefill 53%
Decode 88%
Overall 63%
Prompt+decode. Cold latency: ZINC 12600 ms · llama.cpp 1423 ms Gemma 4 31B Q4K M 346 tok prompt · 128 tok output
Compared Prefill 160%
Decode 85%
Overall 120%
Prompt+decode. Cold latency: ZINC 15876 ms · llama.cpp 4774 ms Qwen 3.5 9B Q4K M 322 tok prompt · 128 tok output
Compared Prefill 441%
Decode 111%
Overall 177%
Prompt+decode. Cold latency: ZINC 4941 ms · llama.cpp 1642 ms Qwen 3.6 27B Dense Q4K M 322 tok prompt · 128 tok output
Compared Prefill 285%
Decode 93%
Overall 140%
Prompt+decode. Cold latency: ZINC 11858 ms · llama.cpp 4365 ms Qwen 3.6 35B A3B UD Q4K XL 322 tok prompt · 128 tok output
Compared Prefill 370%
Decode 121%
Overall 196%
Prompt+decode. Cold latency: ZINC 8710 ms · llama.cpp 1320 ms Raw data
| Model | Prefill | Decode | Overall |
|---|---|---|---|
| Gemma 4 26B-A4B MoE Q4K M 346 tok prompt · 128 tok output · 3 runs · Compared | Z 94.4 tok/s 3663.6 ms L 177.8 tok/s 1934.7 ms | Z 88.8 tok/s 11.3 ms/tok L 100.9 tok/s 9.9 ms/tok | Z 92.8 tok/s prompt+decode L 147.4 tok/s prompt+decode |
| Gemma 4 31B Q4K M 346 tok prompt · 128 tok output · 3 runs · Compared | Z 75.8 tok/s 4565.9 ms L 47.2 tok/s 7280.7 ms | Z 23.9 tok/s 41.8 ms/tok L 28.2 tok/s 35.5 ms/tok | Z 47.8 tok/s prompt+decode L 39.9 tok/s prompt+decode |
| Qwen 3.5 9B Q4K M 322 tok prompt · 128 tok output · 3 runs · Compared | Z 958.8 tok/s 335.8 ms L 217.2 tok/s 1450.4 ms | Z 94.7 tok/s 10.6 ms/tok L 85.4 tok/s 11.7 ms/tok | Z 266.6 tok/s prompt+decode L 150.9 tok/s prompt+decode |
| Qwen 3.6 27B Dense Q4K M 322 tok prompt · 128 tok output · 3 runs · Compared | Z 217.9 tok/s 1477.5 ms L 76.4 tok/s 4123.3 ms | Z 28.3 tok/s 35.3 ms/tok L 30.6 tok/s 32.7 ms/tok | Z 75.1 tok/s prompt+decode L 53.6 tok/s prompt+decode |
| Qwen 3.6 35B A3B UD Q4K XL 322 tok prompt · 128 tok output · 3 runs · Compared | Z 764.8 tok/s 421.0 ms L 206.6 tok/s 1524.7 ms | Z 131.8 tok/s 7.6 ms/tok L 108.9 tok/s 9.2 ms/tok | Z 323.3 tok/s prompt+decode L 164.6 tok/s prompt+decode |
Long Coding Draft
5
rows
295.2 tok/s
prefill
74.6 tok/s
decode
94%
overall
Models
ZINC llama.cpp
Gemma 4 26B-A4B MoE Q4K M 70p · 256out Compared
75% overall
Gemma 4 31B Q4K M 70p · 256out Compared
78% overall
Qwen 3.5 9B Q4K M 64p · 256out Compared
110% overall
Qwen 3.6 27B Dense Q4K M 64p · 256out Compared
87% overall
Qwen 3.6 35B A3B UD Q4K XL 64p · 256out Compared
122% overall
Details
Gemma 4 26B-A4B MoE Q4K M 70 tok prompt · 256 tok output
Compared Prefill 14%
Decode 91%
Overall 75%
Prompt+decode. Cold latency: ZINC 10500 ms · llama.cpp 2755 ms Gemma 4 31B Q4K M 70 tok prompt · 256 tok output
Compared Prefill 20%
Decode 86%
Overall 78%
Prompt+decode. Cold latency: ZINC 17737 ms · llama.cpp 9537 ms Qwen 3.5 9B Q4K M 64 tok prompt · 256 tok output
Compared Prefill 92%
Decode 111%
Overall 110%
Prompt+decode. Cold latency: ZINC 5890 ms · llama.cpp 3212 ms Qwen 3.6 27B Dense Q4K M 64 tok prompt · 256 tok output
Compared Prefill 29%
Decode 93%
Overall 87%
Prompt+decode. Cold latency: ZINC 15875 ms · llama.cpp 8799 ms Qwen 3.6 35B A3B UD Q4K XL 64 tok prompt · 256 tok output
Compared Prefill 78%
Decode 125%
Overall 122%
Prompt+decode. Cold latency: ZINC 9550 ms · llama.cpp 2649 ms Raw data
| Model | Prefill | Decode | Overall |
|---|---|---|---|
| Gemma 4 26B-A4B MoE Q4K M 70 tok prompt · 256 tok output · 3 runs · Compared | Z 94.0 tok/s 744.5 ms L 659.6 tok/s 109.2 ms | Z 92.2 tok/s 10.8 ms/tok L 101.5 tok/s 9.8 ms/tok | Z 92.6 tok/s prompt+decode L 124.1 tok/s prompt+decode |
| Gemma 4 31B Q4K M 70 tok prompt · 256 tok output · 3 runs · Compared | Z 49.5 tok/s 1413.5 ms L 242.4 tok/s 297.1 ms | Z 24.3 tok/s 41.2 ms/tok L 28.3 tok/s 35.4 ms/tok | Z 27.3 tok/s prompt+decode L 34.9 tok/s prompt+decode |
| Qwen 3.5 9B Q4K M 64 tok prompt · 256 tok output · 3 runs · Compared | Z 800.5 tok/s 80.0 ms L 869.3 tok/s 72.5 ms | Z 94.4 tok/s 10.6 ms/tok L 85.2 tok/s 11.7 ms/tok | Z 114.6 tok/s prompt+decode L 104.0 tok/s prompt+decode |
| Qwen 3.6 27B Dense Q4K M 64 tok prompt · 256 tok output · 3 runs · Compared | Z 74.5 tok/s 859.3 ms L 256.6 tok/s 245.5 ms | Z 28.3 tok/s 35.4 ms/tok L 30.5 tok/s 32.8 ms/tok | Z 32.3 tok/s prompt+decode L 37.1 tok/s prompt+decode |
| Qwen 3.6 35B A3B UD Q4K XL 64 tok prompt · 256 tok output · 3 runs · Compared | Z 457.6 tok/s 139.9 ms L 584.2 tok/s 107.8 ms | Z 134.0 tok/s 7.5 ms/tok L 107.1 tok/s 9.3 ms/tok | Z 156.0 tok/s prompt+decode L 128.1 tok/s prompt+decode |
Methodology ZINC CLI vs llama.cpp on the same RDNA node · Four-scenario matrix with same-file baselines 3 measured · 1 warmup
- Hardware: one AMD Radeon AI PRO R9700 benchmark node with 32 GB VRAM and 576 GB/s memory bandwidth. ZINC and llama.cpp run on the same Ubuntu host and use the same GGUF file for each model.
- ZINC path: sync the current source tree to the RDNA node, build with zig build -Doptimize=ReleaseFast, then measure generation through the ZINC CLI with RADV cooperative matrix support enabled.
- ZINC RDNA backend: vulkan. Published RDNA runs default to Vulkan; zinc_rt is opt-in because it is a separate bring-up runtime.
- Baseline path: launch llama.cpp against the same model file, preferring one reusable llama-server per model across the full scenario matrix and falling back to llama-cli when the server path is unavailable.
- Scenarios: Quick Chat, Coding Review, Incident Context, and Long Coding Draft. The prompts are real-world chat, code-review, support-context, and coding-plan workloads instead of synthetic factual completions.
- Statistics: one warmup pass is discarded, then three measured runs are collected. Published prefill, decode, end-to-end throughput, and latency values are medians.
- Execution order: the harness records the full ZINC scenario matrix before starting the llama.cpp baseline phase, so only one inference engine owns the GPU during measurement.
- Run hygiene: published RDNA runs assume a clean node with no competing ZINC, llama.cpp, or other GPU workloads.
Mac Studio
Metal
Apple M4 Max GPU · 64 GB · 3 measured · 1 warmup
Run info
- GPU
- Apple M4 Max GPU
- Chip
- Apple M4 Max
- Memory
- 64 GB
- Artifact
- Jun 11, 2026, 4:11 PM
- ZINC
- ce9f0c8f1efd
- llama.cpp
- 607
Quick Chat
5
rows
151.5 tok/s
prefill
35.3 tok/s
decode
120%
overall
Models
ZINC llama.cpp
Gemma 4 26B-A4B MoE Q4K M 49p · 96out ZINC Only
n/a overall
Gemma 4 31B Q4K M 49p · 96out Compared
118% overall
Qwen 3.5 9B Q4K M 36p · 96out Compared
66% overall
Qwen 3.6 27B Dense Q4K M 35p · 96out Run Failed
n/a overall
Qwen 3.6 35B A3B UD Q4K XL 36p · 96out Compared
175% overall
Details
Gemma 4 26B-A4B MoE Q4K M 49 tok prompt · 96 tok output
ZINC Only Prefill n/a
Decode n/a
Overall n/a
Prompt+decode. Cold latency: ZINC 8202 ms · llama.cpp n/a Gemma 4 31B Q4K M 49 tok prompt · 96 tok output
Compared Prefill 408%
Decode 108%
Overall 118%
Prompt+decode. Cold latency: ZINC 15606 ms · llama.cpp 12565 ms Qwen 3.5 9B Q4K M 36 tok prompt · 96 tok output
Compared Prefill 80%
Decode 62%
Overall 66%
Prompt+decode. Cold latency: ZINC 7348 ms · llama.cpp 3574 ms Qwen 3.6 27B Dense Q4K M 35 tok prompt · 96 tok output
Run Failed Prefill n/a
Decode n/a
Overall n/a
Prompt+decode. Cold latency: ZINC n/a · llama.cpp 4302 ms Qwen 3.6 35B A3B UD Q4K XL 36 tok prompt · 96 tok output
Compared Prefill 101%
Decode 199%
Overall 175%
Prompt+decode. Cold latency: ZINC 3591 ms · llama.cpp 2352 ms Raw data
| Model | Prefill | Decode | Overall |
|---|---|---|---|
| Gemma 4 26B-A4B MoE Q4K M 49 tok prompt · 96 tok output · 3 runs · ZINC Only | Z 350.1 tok/s 140.0 ms L n/a n/a Comparison unavailable Prefill telemetry is missing on one side. | Z 32.1 tok/s 31.1 ms/tok L n/a n/a Comparison unavailable No same-machine decode ratio for this row yet. | Z 46.4 tok/s prompt+decode L n/a prompt+decode Directional only No end-to-end comparison for this row. |
| Gemma 4 31B Q4K M 49 tok prompt · 96 tok output · 3 runs · Compared | Z 131.3 tok/s 373.1 ms L 32.2 tok/s 1584.8 ms | Z 9.2 tok/s 109.1 ms/tok L 8.5 tok/s 118.0 ms/tok | Z 13.4 tok/s prompt+decode L 11.3 tok/s prompt+decode |
| Qwen 3.5 9B Q4K M 36 tok prompt · 96 tok output · 3 runs · Compared | Z 27.3 tok/s 1317.1 ms L 34.1 tok/s 1026.2 ms | Z 17.3 tok/s 57.7 ms/tok L 27.8 tok/s 36.0 ms/tok | Z 19.2 tok/s prompt+decode L 29.3 tok/s prompt+decode |
| Qwen 3.6 27B Dense Q4K M 35 tok prompt · 96 tok output · 3 runs · Run Failed | Z n/a n/a L 28.7 tok/s 1219.0 ms Comparison unavailable Prefill telemetry is missing on one side. | Z n/a n/a L 23.1 tok/s 43.3 ms/tok Comparison unavailable No same-machine decode ratio for this row yet. | Z n/a prompt+decode L 24.4 tok/s prompt+decode Directional only No end-to-end comparison for this row. |
| Qwen 3.6 35B A3B UD Q4K XL 36 tok prompt · 96 tok output · 3 runs · Compared | Z 97.2 tok/s 370.4 ms L 96.4 tok/s 363.2 ms | Z 82.5 tok/s 12.1 ms/tok L 41.6 tok/s 24.1 ms/tok | Z 86.1 tok/s prompt+decode L 49.2 tok/s prompt+decode |
Coding Review
5
rows
49.8 tok/s
prefill
29.3 tok/s
decode
63%
overall
Models
ZINC llama.cpp
Gemma 4 26B-A4B MoE Q4K M n/ap · 160out Run Failed
n/a overall
Gemma 4 31B Q4K M n/ap · 160out Run Failed
n/a overall
Qwen 3.5 9B Q4K M 174p · 160out Compared
63% overall
Qwen 3.6 27B Dense Q4K M 153p · 160out Run Failed
n/a overall
Qwen 3.6 35B A3B UD Q4K XL 174p · 160out ZINC Only
n/a overall
Details
Gemma 4 26B-A4B MoE Q4K M n/a · 160 tok output
Run Failed Prefill n/a
Decode n/a
Overall n/a
Prompt+decode. Cold latency: ZINC n/a · llama.cpp n/a Gemma 4 31B Q4K M n/a · 160 tok output
Run Failed Prefill n/a
Decode n/a
Overall n/a
Prompt+decode. Cold latency: ZINC n/a · llama.cpp n/a Qwen 3.5 9B Q4K M 174 tok prompt · 160 tok output
Compared Prefill 66%
Decode 62%
Overall 63%
Prompt+decode. Cold latency: ZINC 17614 ms · llama.cpp 5893 ms Qwen 3.6 27B Dense Q4K M 153 tok prompt · 160 tok output
Run Failed Prefill n/a
Decode n/a
Overall n/a
Prompt+decode. Cold latency: ZINC n/a · llama.cpp 7104 ms Qwen 3.6 35B A3B UD Q4K XL 174 tok prompt · 160 tok output
ZINC Only Prefill n/a
Decode n/a
Overall n/a
Prompt+decode. Cold latency: ZINC 8098 ms · llama.cpp n/a Raw data
| Model | Prefill | Decode | Overall |
|---|---|---|---|
| Gemma 4 26B-A4B MoE Q4K M n/a · 160 tok output · 0 runs · Run Failed | Telemetry pending No prefill metrics were published for this run. | Run missing No decode result was published for this row. | Telemetry pending No end-to-end or latency metrics were published. |
| Gemma 4 31B Q4K M n/a · 160 tok output · 0 runs · Run Failed | Telemetry pending No prefill metrics were published for this run. | Run missing No decode result was published for this row. | Telemetry pending No end-to-end or latency metrics were published. |
| Qwen 3.5 9B Q4K M 174 tok prompt · 160 tok output · 3 runs · Compared | Z 22.6 tok/s 7694.4 ms L 34.5 tok/s 4440.5 ms | Z 17.1 tok/s 58.5 ms/tok L 27.7 tok/s 36.1 ms/tok | Z 19.6 tok/s prompt+decode L 30.8 tok/s prompt+decode |
| Qwen 3.6 27B Dense Q4K M 153 tok prompt · 160 tok output · 3 runs · Run Failed | Z n/a n/a L 28.7 tok/s 5327.5 ms Comparison unavailable Prefill telemetry is missing on one side. | Z n/a n/a L 23.0 tok/s 43.5 ms/tok Comparison unavailable No same-machine decode ratio for this row yet. | Z n/a prompt+decode L 25.5 tok/s prompt+decode Directional only No end-to-end comparison for this row. |
| Qwen 3.6 35B A3B UD Q4K XL 174 tok prompt · 160 tok output · 3 runs · ZINC Only | Z 76.9 tok/s 2262.1 ms L n/a n/a Comparison unavailable Prefill telemetry is missing on one side. | Z 41.5 tok/s 24.1 ms/tok L n/a n/a Comparison unavailable No same-machine decode ratio for this row yet. | Z 54.6 tok/s prompt+decode L n/a prompt+decode Directional only No end-to-end comparison for this row. |
Incident Context
5
rows
11.6 tok/s
prefill
10.5 tok/s
decode
42%
overall
Models
ZINC llama.cpp
Gemma 4 26B-A4B MoE Q4K M n/ap · 128out Run Failed
n/a overall
Gemma 4 31B Q4K M n/ap · 128out Run Failed
n/a overall
Qwen 3.5 9B Q4K M n/ap · 128out Run Failed
n/a overall
Qwen 3.6 27B Dense Q4K M 322p · 128out Compared
42% overall
Qwen 3.6 35B A3B UD Q4K XL 315p · 128out Run Failed
n/a overall
Details
Gemma 4 26B-A4B MoE Q4K M n/a · 128 tok output
Run Failed Prefill n/a
Decode n/a
Overall n/a
Prompt+decode. Cold latency: ZINC n/a · llama.cpp n/a Gemma 4 31B Q4K M n/a · 128 tok output
Run Failed Prefill n/a
Decode n/a
Overall n/a
Prompt+decode. Cold latency: ZINC n/a · llama.cpp n/a Qwen 3.5 9B Q4K M n/a · 128 tok output
Run Failed Prefill n/a
Decode n/a
Overall n/a
Prompt+decode. Cold latency: ZINC n/a · llama.cpp n/a Qwen 3.6 27B Dense Q4K M 322 tok prompt · 128 tok output
Compared Prefill 40%
Decode 46%
Overall 42%
Prompt+decode. Cold latency: ZINC 41281 ms · llama.cpp 5703 ms Qwen 3.6 35B A3B UD Q4K XL 315 tok prompt · 128 tok output
Run Failed Prefill n/a
Decode n/a
Overall n/a
Prompt+decode. Cold latency: ZINC n/a · llama.cpp 3082 ms Raw data
| Model | Prefill | Decode | Overall |
|---|---|---|---|
| Gemma 4 26B-A4B MoE Q4K M n/a · 128 tok output · 0 runs · Run Failed | Telemetry pending No prefill metrics were published for this run. | Run missing No decode result was published for this row. | Telemetry pending No end-to-end or latency metrics were published. |
| Gemma 4 31B Q4K M n/a · 128 tok output · 0 runs · Run Failed | Telemetry pending No prefill metrics were published for this run. | Run missing No decode result was published for this row. | Telemetry pending No end-to-end or latency metrics were published. |
| Qwen 3.5 9B Q4K M n/a · 128 tok output · 0 runs · Run Failed | Telemetry pending No prefill metrics were published for this run. | Run missing No decode result was published for this row. | Telemetry pending No end-to-end or latency metrics were published. |
| Qwen 3.6 27B Dense Q4K M 322 tok prompt · 128 tok output · 3 runs · Compared | Z 11.6 tok/s 27839.4 ms L 28.6 tok/s 10995.6 ms | Z 10.5 tok/s 95.4 ms/tok L 23.0 tok/s 43.4 ms/tok | Z 11.3 tok/s prompt+decode L 26.8 tok/s prompt+decode |
| Qwen 3.6 35B A3B UD Q4K XL 315 tok prompt · 128 tok output · 3 runs · Run Failed | Z n/a n/a L 90.7 tok/s 3471.7 ms Comparison unavailable Prefill telemetry is missing on one side. | Z n/a n/a L 42.1 tok/s 23.8 ms/tok Comparison unavailable No same-machine decode ratio for this row yet. | Z n/a prompt+decode L 68.0 tok/s prompt+decode Directional only No end-to-end comparison for this row. |
Long Coding Draft
5
rows
159.6 tok/s
prefill
34.3 tok/s
decode
80%
overall
Models
ZINC llama.cpp
Gemma 4 26B-A4B MoE Q4K M 70p · 256out ZINC Only
n/a overall
Gemma 4 31B Q4K M n/ap · 256out Run Failed
n/a overall
Qwen 3.5 9B Q4K M 63p · 256out Run Failed
n/a overall
Qwen 3.6 27B Dense Q4K M 64p · 256out Compared
45% overall
Qwen 3.6 35B A3B UD Q4K XL 64p · 256out Compared
115% overall
Details
Gemma 4 26B-A4B MoE Q4K M 70 tok prompt · 256 tok output
ZINC Only Prefill n/a
Decode n/a
Overall n/a
Prompt+decode. Cold latency: ZINC 11193 ms · llama.cpp n/a Gemma 4 31B Q4K M n/a · 256 tok output
Run Failed Prefill n/a
Decode n/a
Overall n/a
Prompt+decode. Cold latency: ZINC n/a · llama.cpp n/a Qwen 3.5 9B Q4K M 63 tok prompt · 256 tok output
Run Failed Prefill n/a
Decode n/a
Overall n/a
Prompt+decode. Cold latency: ZINC n/a · llama.cpp 9408 ms Qwen 3.6 27B Dense Q4K M 64 tok prompt · 256 tok output
Compared Prefill 40%
Decode 46%
Overall 45%
Prompt+decode. Cold latency: ZINC 31023 ms · llama.cpp 11304 ms Qwen 3.6 35B A3B UD Q4K XL 64 tok prompt · 256 tok output
Compared Prefill 113%
Decode 116%
Overall 115%
Prompt+decode. Cold latency: ZINC 8250 ms · llama.cpp 6277 ms Raw data
| Model | Prefill | Decode | Overall |
|---|---|---|---|
| Gemma 4 26B-A4B MoE Q4K M 70 tok prompt · 256 tok output · 3 runs · ZINC Only | Z 365.2 tok/s 191.7 ms L n/a n/a Comparison unavailable Prefill telemetry is missing on one side. | Z 44.7 tok/s 22.4 ms/tok L n/a n/a Comparison unavailable No same-machine decode ratio for this row yet. | Z 55.1 tok/s prompt+decode L n/a prompt+decode Directional only No end-to-end comparison for this row. |
| Gemma 4 31B Q4K M n/a · 256 tok output · 0 runs · Run Failed | Telemetry pending No prefill metrics were published for this run. | Run missing No decode result was published for this row. | Telemetry pending No end-to-end or latency metrics were published. |
| Qwen 3.5 9B Q4K M 63 tok prompt · 256 tok output · 3 runs · Run Failed | Z n/a n/a L 34.4 tok/s 1833.6 ms Comparison unavailable Prefill telemetry is missing on one side. | Z n/a n/a L 27.5 tok/s 36.3 ms/tok Comparison unavailable No same-machine decode ratio for this row yet. | Z n/a prompt+decode L 28.7 tok/s prompt+decode Directional only No end-to-end comparison for this row. |
| Qwen 3.6 27B Dense Q4K M 64 tok prompt · 256 tok output · 3 runs · Compared | Z 11.6 tok/s 5506.6 ms L 28.7 tok/s 2197.3 ms | Z 10.6 tok/s 94.6 ms/tok L 22.9 tok/s 43.6 ms/tok | Z 10.8 tok/s prompt+decode L 23.9 tok/s prompt+decode |
| Qwen 3.6 35B A3B UD Q4K XL 64 tok prompt · 256 tok output · 3 runs · Compared | Z 102.0 tok/s 627.2 ms L 90.1 tok/s 699.5 ms | Z 47.5 tok/s 21.0 ms/tok L 41.1 tok/s 24.3 ms/tok | Z 53.2 tok/s prompt+decode L 46.1 tok/s prompt+decode |
Methodology ZINC CLI vs llama.cpp on the same Apple Silicon machine · Four-scenario matrix with same-file baselines 3 measured · 1 warmup
- Hardware: the local Apple Silicon target shown above. ZINC and llama.cpp run on the same machine and use the same GGUF file for each model.
- ZINC path: build the current workspace with zig build -Doptimize=ReleaseFast, then measure generation through the ZINC CLI.
- Baseline path: launch llama.cpp against the same model file, preferring one reusable llama-server per model across the full scenario matrix and falling back to llama-cli when the server path is unavailable.
- Scenarios: Quick Chat, Coding Review, Incident Context, and Long Coding Draft. The prompts are real-world chat, code-review, support-context, and coding-plan workloads instead of synthetic factual completions.
- Statistics: one warmup pass is discarded, then three measured runs are collected. Published prefill, decode, end-to-end throughput, and latency values are medians.
- Model resolution: managed catalog models are loaded from the ZINC model cache when available; explicit GGUF paths are preserved when provided.
- Run hygiene: published local runs assume no competing workload is saturating CPU, memory bandwidth, or GPU.
Intel Arc bench node
Intel Arc
Intel(R) Graphics (BMG G31) · n/a · 3 measured · 1 warmup
Run info
- GPU
- Intel(R) Graphics (BMG G31)
- Chip
- Intel Arc Pro B70 / BMG G31
- Memory
- n/a
- Artifact
- Jun 7, 2026, 2:09 PM
- ZINC
- 0e1c7135b67f-dirty
- llama.cpp
- 8863
Quick Chat
5
rows
30.7 tok/s
prefill
26.7 tok/s
decode
67%
overall
Models
ZINC llama.cpp
Gemma 4 26B-A4B MoE Q4K M 49p · 96out Compared
20% overall
Gemma 4 31B Q4K M 49p · 96out Compared
72% overall
Qwen 3.5 9B Q4K M 36p · 96out Compared
92% overall
Qwen 3.6 27B Q4K M 36p · 96out Compared
84% overall
Qwen 3.6 35B A3B Q4K XL 36p · 2out/96cap Compared
54% overall
Details
Gemma 4 26B-A4B MoE Q4K M 49 tok prompt · 96 tok output
Compared Prefill 6%
Decode 26%
Overall 20%
Prompt+decode. Cold latency: ZINC 61223 ms · llama.cpp 1997 ms Gemma 4 31B Q4K M 49 tok prompt · 96 tok output
Compared Prefill 27%
Decode 93%
Overall 72%
Prompt+decode. Cold latency: ZINC 15542 ms · llama.cpp 6636 ms Qwen 3.5 9B Q4K M 36 tok prompt · 96 tok output
Compared Prefill 48%
Decode 106%
Overall 92%
Prompt+decode. Cold latency: ZINC 4435 ms · llama.cpp 2300 ms Qwen 3.6 27B Q4K M 36 tok prompt · 96 tok output
Compared Prefill 54%
Decode 95%
Overall 84%
Prompt+decode. Cold latency: ZINC 16372 ms · llama.cpp 6261 ms Qwen 3.6 35B A3B Q4K XL 36 tok prompt · 2 tok output/96 tok cap
Compared Prefill 51%
Decode 140%
Overall 54%
Prompt+decode. Cold latency: ZINC 11442 ms · llama.cpp 1860 ms Raw data
| Model | Prefill | Decode | Overall |
|---|---|---|---|
| Gemma 4 26B-A4B MoE Q4K M 49 tok prompt · 96 tok output · 3 runs · Compared | Z 17.5 tok/s 2794.5 ms L 270.1 tok/s 192.5 ms | Z 16.1 tok/s 62.0 ms/tok L 62.7 tok/s 16.0 ms/tok | Z 16.6 tok/s prompt+decode L 84.7 tok/s prompt+decode |
| Gemma 4 31B Q4K M 49 tok prompt · 96 tok output · 3 runs · Compared | Z 17.8 tok/s 2751.0 ms L 65.9 tok/s 789.1 ms | Z 16.0 tok/s 62.6 ms/tok L 17.3 tok/s 58.0 ms/tok | Z 16.5 tok/s prompt+decode L 23.0 tok/s prompt+decode |
| Qwen 3.5 9B Q4K M 36 tok prompt · 96 tok output · 3 runs · Compared | Z 67.8 tok/s 531.3 ms L 142.2 tok/s 246.1 ms | Z 56.7 tok/s 17.6 ms/tok L 53.4 tok/s 18.7 ms/tok | Z 59.3 tok/s prompt+decode L 64.3 tok/s prompt+decode |
| Qwen 3.6 27B Q4K M 36 tok prompt · 96 tok output · 3 runs · Compared | Z 19.8 tok/s 1822.9 ms L 36.9 tok/s 948.7 ms | Z 18.1 tok/s 55.2 ms/tok L 19.1 tok/s 52.4 ms/tok | Z 18.5 tok/s prompt+decode L 22.0 tok/s prompt+decode |
| Qwen 3.6 35B A3B Q4K XL 36 tok prompt · 2 tok output/96 tok cap · 3 runs · Compared | Z 68.0 tok/s 529.4 ms L 132.3 tok/s 264.5 ms | Z 104.0 tok/s 9.6 ms/tok L 74.5 tok/s 13.4 ms/tok | Z 69.3 tok/s prompt+decode L 127.1 tok/s prompt+decode |
Coding Review
5
rows
48.1 tok/s
prefill
39.6 tok/s
decode
78%
overall
Models
ZINC llama.cpp
Gemma 4 26B-A4B MoE Q4K M 192p · 160out Compared
69% overall
Gemma 4 31B Q4K M 192p · 160out Compared
81% overall
Qwen 3.5 9B Q4K M 174p · 160out Compared
91% overall
Qwen 3.6 27B Q4K M 174p · 160out Compared
79% overall
Qwen 3.6 35B A3B Q4K XL 174p · 5out/160cap Compared
70% overall
Details
Gemma 4 26B-A4B MoE Q4K M 192 tok prompt · 160 tok output
Compared Prefill 64%
Decode 74%
Overall 69%
Prompt+decode. Cold latency: ZINC 12082 ms · llama.cpp 2908 ms Gemma 4 31B Q4K M 192 tok prompt · 160 tok output
Compared Prefill 68%
Decode 95%
Overall 81%
Prompt+decode. Cold latency: ZINC 27690 ms · llama.cpp 10149 ms Qwen 3.5 9B Q4K M 174 tok prompt · 160 tok output
Compared Prefill 74%
Decode 105%
Overall 91%
Prompt+decode. Cold latency: ZINC 7505 ms · llama.cpp 3380 ms Qwen 3.6 27B Q4K M 174 tok prompt · 160 tok output
Compared Prefill 63%
Decode 95%
Overall 79%
Prompt+decode. Cold latency: ZINC 23829 ms · llama.cpp 8914 ms Qwen 3.6 35B A3B Q4K XL 174 tok prompt · 5 tok output/160 tok cap
Compared Prefill 70%
Decode 88%
Overall 70%
Prompt+decode. Cold latency: ZINC 10462 ms · llama.cpp 2480 ms Raw data
| Model | Prefill | Decode | Overall |
|---|---|---|---|
| Gemma 4 26B-A4B MoE Q4K M 192 tok prompt · 160 tok output · 3 runs · Compared | Z 62.8 tok/s 3059.6 ms L 98.5 tok/s 1979.8 ms | Z 44.5 tok/s 22.5 ms/tok L 59.9 tok/s 16.7 ms/tok | Z 52.9 tok/s prompt+decode L 76.1 tok/s prompt+decode |
| Gemma 4 31B Q4K M 192 tok prompt · 160 tok output · 3 runs · Compared | Z 17.5 tok/s 10965.3 ms L 25.9 tok/s 7533.5 ms | Z 15.8 tok/s 63.5 ms/tok L 16.5 tok/s 60.6 ms/tok | Z 16.7 tok/s prompt+decode L 20.6 tok/s prompt+decode |
| Qwen 3.5 9B Q4K M 174 tok prompt · 160 tok output · 3 runs · Compared | Z 70.3 tok/s 2476.9 ms L 94.5 tok/s 1618.4 ms | Z 55.5 tok/s 18.0 ms/tok L 52.6 tok/s 19.0 ms/tok | Z 62.3 tok/s prompt+decode L 68.4 tok/s prompt+decode |
| Qwen 3.6 27B Q4K M 174 tok prompt · 160 tok output · 3 runs · Compared | Z 20.0 tok/s 8688.9 ms L 32.0 tok/s 4774.8 ms | Z 17.9 tok/s 55.9 ms/tok L 18.9 tok/s 52.8 ms/tok | Z 18.9 tok/s prompt+decode L 24.1 tok/s prompt+decode |
| Qwen 3.6 35B A3B Q4K XL 174 tok prompt · 5 tok output/160 tok cap · 3 runs · Compared | Z 70.0 tok/s 2485.1 ms L 100.7 tok/s 1519.3 ms | Z 64.5 tok/s 15.5 ms/tok L 73.6 tok/s 13.6 ms/tok | Z 69.9 tok/s prompt+decode L 99.7 tok/s prompt+decode |
Incident Context
5
rows
48.2 tok/s
prefill
36.8 tok/s
decode
76%
overall
Models
ZINC llama.cpp
Gemma 4 26B-A4B MoE Q4K M 346p · 128out Compared
69% overall
Gemma 4 31B Q4K M 346p · 128out Compared
80% overall
Qwen 3.5 9B Q4K M 322p · 128out Compared
88% overall
Qwen 3.6 27B Q4K M 322p · 128out Compared
72% overall
Qwen 3.6 35B A3B Q4K XL 322p · 128out Compared
72% overall
Details
Gemma 4 26B-A4B MoE Q4K M 346 tok prompt · 128 tok output
Compared Prefill 66%
Decode 74%
Overall 69%
Prompt+decode. Cold latency: ZINC 13847 ms · llama.cpp 2470 ms Gemma 4 31B Q4K M 346 tok prompt · 128 tok output
Compared Prefill 73%
Decode 96%
Overall 80%
Prompt+decode. Cold latency: ZINC 34975 ms · llama.cpp 8390 ms Qwen 3.5 9B Q4K M 322 tok prompt · 128 tok output
Compared Prefill 80%
Decode 105%
Overall 88%
Prompt+decode. Cold latency: ZINC 9027 ms · llama.cpp 2717 ms Qwen 3.6 27B Q4K M 322 tok prompt · 128 tok output
Compared Prefill 62%
Decode 94%
Overall 72%
Prompt+decode. Cold latency: ZINC 30135 ms · llama.cpp 7218 ms Qwen 3.6 35B A3B Q4K XL 322 tok prompt · 128 tok output
Compared Prefill 72%
Decode 71%
Overall 72%
Prompt+decode. Cold latency: ZINC 14642 ms · llama.cpp 2032 ms Raw data
| Model | Prefill | Decode | Overall |
|---|---|---|---|
| Gemma 4 26B-A4B MoE Q4K M 346 tok prompt · 128 tok output · 3 runs · Compared | Z 62.6 tok/s 5526.6 ms L 94.8 tok/s 3639.0 ms | Z 43.7 tok/s 22.9 ms/tok L 59.2 tok/s 16.9 ms/tok | Z 56.1 tok/s prompt+decode L 81.6 tok/s prompt+decode |
| Gemma 4 31B Q4K M 346 tok prompt · 128 tok output · 3 runs · Compared | Z 17.4 tok/s 19856.1 ms L 23.8 tok/s 14499.5 ms | Z 15.6 tok/s 64.3 ms/tok L 16.2 tok/s 61.8 ms/tok | Z 16.9 tok/s prompt+decode L 21.1 tok/s prompt+decode |
| Qwen 3.5 9B Q4K M 322 tok prompt · 128 tok output · 3 runs · Compared | Z 70.5 tok/s 4567.8 ms L 88.2 tok/s 3571.7 ms | Z 55.0 tok/s 18.2 ms/tok L 52.5 tok/s 19.0 ms/tok | Z 65.3 tok/s prompt+decode L 73.9 tok/s prompt+decode |
| Qwen 3.6 27B Q4K M 322 tok prompt · 128 tok output · 3 runs · Compared | Z 20.0 tok/s 16110.9 ms L 32.1 tok/s 9811.9 ms | Z 17.8 tok/s 56.1 ms/tok L 18.9 tok/s 53.0 ms/tok | Z 19.3 tok/s prompt+decode L 26.8 tok/s prompt+decode |
| Qwen 3.6 35B A3B Q4K XL 322 tok prompt · 128 tok output · 3 runs · Compared | Z 70.3 tok/s 4580.4 ms L 97.6 tok/s 3227.9 ms | Z 51.8 tok/s 19.3 ms/tok L 73.4 tok/s 13.6 ms/tok | Z 63.8 tok/s prompt+decode L 89.2 tok/s prompt+decode |
Long Coding Draft
5
rows
47.9 tok/s
prefill
37.2 tok/s
decode
76%
overall
Models
ZINC llama.cpp
Gemma 4 26B-A4B MoE Q4K M 70p · 256out Compared
65% overall
Gemma 4 31B Q4K M 70p · 256out Compared
80% overall
Qwen 3.5 9B Q4K M 64p · 256out Compared
91% overall
Qwen 3.6 27B Q4K M 64p · 256out Compared
82% overall
Qwen 3.6 35B A3B Q4K XL 64p · 256out Compared
63% overall
Details
Gemma 4 26B-A4B MoE Q4K M 70 tok prompt · 256 tok output
Compared Prefill 19%
Decode 74%
Overall 65%
Prompt+decode. Cold latency: ZINC 12290 ms · llama.cpp 4670 ms Gemma 4 31B Q4K M 70 tok prompt · 256 tok output
Compared Prefill 21%
Decode 94%
Overall 80%
Prompt+decode. Cold latency: ZINC 27118 ms · llama.cpp 16368 ms Qwen 3.5 9B Q4K M 64 tok prompt · 256 tok output
Compared Prefill 24%
Decode 105%
Overall 91%
Prompt+decode. Cold latency: ZINC 7767 ms · llama.cpp 5318 ms Qwen 3.6 27B Q4K M 64 tok prompt · 256 tok output
Compared Prefill 25%
Decode 94%
Overall 82%
Prompt+decode. Cold latency: ZINC 24230 ms · llama.cpp 14621 ms Qwen 3.6 35B A3B Q4K XL 64 tok prompt · 256 tok output
Compared Prefill 26%
Decode 70%
Overall 63%
Prompt+decode. Cold latency: ZINC 13455 ms · llama.cpp 3996 ms Raw data
| Model | Prefill | Decode | Overall |
|---|---|---|---|
| Gemma 4 26B-A4B MoE Q4K M 70 tok prompt · 256 tok output · 3 runs · Compared | Z 63.3 tok/s 1106.1 ms L 331.1 tok/s 220.5 ms | Z 44.7 tok/s 22.4 ms/tok L 60.8 tok/s 16.5 ms/tok | Z 47.7 tok/s prompt+decode L 73.7 tok/s prompt+decode |
| Gemma 4 31B Q4K M 70 tok prompt · 256 tok output · 3 runs · Compared | Z 17.7 tok/s 3956.2 ms L 84.5 tok/s 863.7 ms | Z 15.8 tok/s 63.4 ms/tok L 16.8 tok/s 59.7 ms/tok | Z 16.1 tok/s prompt+decode L 20.2 tok/s prompt+decode |
| Qwen 3.5 9B Q4K M 64 tok prompt · 256 tok output · 3 runs · Compared | Z 69.1 tok/s 926.4 ms L 287.3 tok/s 219.3 ms | Z 55.4 tok/s 18.1 ms/tok L 52.8 tok/s 18.9 ms/tok | Z 57.6 tok/s prompt+decode L 63.1 tok/s prompt+decode |
| Qwen 3.6 27B Q4K M 64 tok prompt · 256 tok output · 3 runs · Compared | Z 19.9 tok/s 3217.7 ms L 80.6 tok/s 781.2 ms | Z 17.9 tok/s 56.0 ms/tok L 18.9 tok/s 52.9 ms/tok | Z 18.2 tok/s prompt+decode L 22.3 tok/s prompt+decode |
| Qwen 3.6 35B A3B Q4K XL 64 tok prompt · 256 tok output · 3 runs · Compared | Z 69.3 tok/s 923.3 ms L 271.0 tok/s 232.4 ms | Z 52.1 tok/s 19.2 ms/tok L 74.2 tok/s 13.5 ms/tok | Z 54.8 tok/s prompt+decode L 86.8 tok/s prompt+decode |
Methodology ZINC CLI vs llama.cpp on the same Intel Arc node · Four-scenario matrix with same-file baselines 3 measured · 1 warmup
- Hardware: one Intel Arc Vulkan benchmark node. ZINC and llama.cpp run on the same Ubuntu host and use the same GGUF file for each model.
- ZINC path: optionally sync the current source tree to the Intel node, build with zig build -Doptimize=ReleaseFast, then measure generation through the ZINC CLI.
- Baseline path: launch llama.cpp against the same model file, preferring one reusable llama-server per model across the full scenario matrix and falling back to llama-cli when the server path is unavailable or fails to start.
- Scenarios: Quick Chat, Coding Review, Incident Context, and Long Coding Draft. The prompts are real-world chat, code-review, support-context, and coding-plan workloads instead of synthetic factual completions.
- Statistics: one warmup pass is discarded, then three measured runs are collected. Published prefill, decode, end-to-end throughput, and latency values are medians.
- Execution order: the harness records the full ZINC scenario matrix before starting the llama.cpp baseline phase, so only one inference engine owns the GPU during measurement.
- Run hygiene: published Intel runs assume a clean node with no competing ZINC, llama.cpp, or other GPU workloads.
NVIDIA CUDA bench node
RTX 5090
GeForce RTX 5090 · 32 GB VRAM · 1792 GB/s · 3 measured · 1 warmup
Run info
- GPU
- GeForce RTX 5090
- Chip
- NVIDIA GeForce RTX 5090
- Memory
- 32 GB VRAM · 1792 GB/s
- Artifact
- Jun 12, 2026, 1:46 PM
- ZINC
- dcc7cc01
- llama.cpp
- 1
Quick Chat
5
rows
48.5 tok/s
prefill
51.0 tok/s
decode
64%
overall
Models
ZINC llama.cpp
Gemma 4 26B-A4B MoE Q4K M 49p · 96out Compared
29% overall
Gemma 4 31B Q4K M 49p · 96out Compared
53% overall
Qwen 3.5 9B Q4K M 36p · 96out Compared
95% overall
Qwen 3.6 27B Dense Q4K M 36p · 96out Compared
96% overall
Qwen 3.6 35B A3B UD Q4K XL 36p · 96out Compared
45% overall
Details
Gemma 4 26B-A4B MoE Q4K M 49 tok prompt · 96 tok output
Compared Prefill 11%
Decode 39%
Overall 29%
Prompt+decode. Cold latency: ZINC 18473 ms · llama.cpp 2300 ms Gemma 4 31B Q4K M 49 tok prompt · 96 tok output
Compared Prefill 13%
Decode 75%
Overall 53%
Prompt+decode. Cold latency: ZINC 20510 ms · llama.cpp 3403 ms Qwen 3.5 9B Q4K M 36 tok prompt · 96 tok output
Compared Prefill 155%
Decode 74%
Overall 95%
Prompt+decode. Cold latency: ZINC 11115 ms · llama.cpp 2248 ms Qwen 3.6 27B Dense Q4K M 36 tok prompt · 96 tok output
Compared Prefill 139%
Decode 79%
Overall 96%
Prompt+decode. Cold latency: ZINC 19392 ms · llama.cpp 3258 ms Qwen 3.6 35B A3B UD Q4K XL 36 tok prompt · 96 tok output
Compared Prefill 71%
Decode 33%
Overall 45%
Prompt+decode. Cold latency: ZINC 22444 ms · llama.cpp 2265 ms Raw data
| Model | Prefill | Decode | Overall |
|---|---|---|---|
| Gemma 4 26B-A4B MoE Q4K M 49 tok prompt · 96 tok output · 3 runs · Compared | Z 40.8 tok/s 1201.4 ms L 382.6 tok/s 135.9 ms | Z 42.4 tok/s 23.6 ms/tok L 109.4 tok/s 9.1 ms/tok | Z 41.9 tok/s prompt+decode L 144.2 tok/s prompt+decode |
| Gemma 4 31B Q4K M 49 tok prompt · 96 tok output · 3 runs · Compared | Z 36.7 tok/s 1336.5 ms L 291.7 tok/s 178.2 ms | Z 39.8 tok/s 25.1 ms/tok L 53.0 tok/s 18.9 ms/tok | Z 38.7 tok/s prompt+decode L 73.3 tok/s prompt+decode |
| Qwen 3.5 9B Q4K M 36 tok prompt · 96 tok output · 3 runs · Compared | Z 89.3 tok/s 403.3 ms L 57.5 tok/s 608.5 ms | Z 85.5 tok/s 11.7 ms/tok L 115.7 tok/s 8.6 ms/tok | Z 86.5 tok/s prompt+decode L 90.7 tok/s prompt+decode |
| Qwen 3.6 27B Dense Q4K M 36 tok prompt · 96 tok output · 3 runs · Compared | Z 41.0 tok/s 878.7 ms L 29.6 tok/s 1183.3 ms | Z 45.4 tok/s 22.1 ms/tok L 57.6 tok/s 17.4 ms/tok | Z 44.1 tok/s prompt+decode L 45.8 tok/s prompt+decode |
| Qwen 3.6 35B A3B UD Q4K XL 36 tok prompt · 96 tok output · 3 runs · Compared | Z 34.6 tok/s 1040.3 ms L 48.8 tok/s 717.5 ms | Z 42.0 tok/s 23.8 ms/tok L 128.4 tok/s 7.8 ms/tok | Z 39.7 tok/s prompt+decode L 88.8 tok/s prompt+decode |
Coding Review
5
rows
54.2 tok/s
prefill
53.4 tok/s
decode
73%
overall
Models
ZINC llama.cpp
Gemma 4 26B-A4B MoE Q4K M 192p · 160out Compared
22% overall
Gemma 4 31B Q4K M 192p · 160out Compared
36% overall
Qwen 3.5 9B Q4K M 174p · 160out Compared
119% overall
Qwen 3.6 27B Dense Q4K M 174p · 160out Compared
128% overall
Qwen 3.6 35B A3B UD Q4K XL 174p · 160out Compared
63% overall
Details
Gemma 4 26B-A4B MoE Q4K M 192 tok prompt · 160 tok output
Compared Prefill 3%
Decode 45%
Overall 22%
Prompt+decode. Cold latency: ZINC 22668 ms · llama.cpp 2908 ms Gemma 4 31B Q4K M 192 tok prompt · 160 tok output
Compared Prefill 4%
Decode 74%
Overall 36%
Prompt+decode. Cold latency: ZINC 26646 ms · llama.cpp 4644 ms Qwen 3.5 9B Q4K M 174 tok prompt · 160 tok output
Compared Prefill 154%
Decode 81%
Overall 119%
Prompt+decode. Cold latency: ZINC 13260 ms · llama.cpp 2932 ms Qwen 3.6 27B Dense Q4K M 174 tok prompt · 160 tok output
Compared Prefill 173%
Decode 82%
Overall 128%
Prompt+decode. Cold latency: ZINC 23038 ms · llama.cpp 4339 ms Qwen 3.6 35B A3B UD Q4K XL 174 tok prompt · 160 tok output
Compared Prefill 90%
Decode 34%
Overall 63%
Prompt+decode. Cold latency: ZINC 26685 ms · llama.cpp 3095 ms Raw data
| Model | Prefill | Decode | Overall |
|---|---|---|---|
| Gemma 4 26B-A4B MoE Q4K M 192 tok prompt · 160 tok output · 3 runs · Compared | Z 46.9 tok/s 4097.6 ms L 1456.6 tok/s 133.9 ms | Z 48.0 tok/s 20.8 ms/tok L 107.8 tok/s 9.3 ms/tok | Z 47.4 tok/s prompt+decode L 217.8 tok/s prompt+decode |
| Gemma 4 31B Q4K M 192 tok prompt · 160 tok output · 3 runs · Compared | Z 39.8 tok/s 4828.0 ms L 1006.4 tok/s 193.8 ms | Z 38.6 tok/s 25.9 ms/tok L 51.9 tok/s 19.3 ms/tok | Z 39.2 tok/s prompt+decode L 107.5 tok/s prompt+decode |
| Qwen 3.5 9B Q4K M 174 tok prompt · 160 tok output · 3 runs · Compared | Z 93.8 tok/s 1855.2 ms L 60.8 tok/s 2515.4 ms | Z 93.0 tok/s 10.8 ms/tok L 115.4 tok/s 8.7 ms/tok | Z 93.4 tok/s prompt+decode L 78.6 tok/s prompt+decode |
| Qwen 3.6 27B Dense Q4K M 174 tok prompt · 160 tok output · 3 runs · Compared | Z 48.3 tok/s 3605.2 ms L 28.0 tok/s 5471.4 ms | Z 46.1 tok/s 21.7 ms/tok L 56.5 tok/s 17.7 ms/tok | Z 47.2 tok/s prompt+decode L 36.9 tok/s prompt+decode |
| Qwen 3.6 35B A3B UD Q4K XL 174 tok prompt · 160 tok output · 3 runs · Compared | Z 42.2 tok/s 4119.3 ms L 47.2 tok/s 3243.9 ms | Z 41.4 tok/s 24.1 ms/tok L 122.9 tok/s 8.1 ms/tok | Z 41.8 tok/s prompt+decode L 66.9 tok/s prompt+decode |
Incident Context
5
rows
53.7 tok/s
prefill
51.8 tok/s
decode
110%
overall
Models
ZINC llama.cpp
Gemma 4 26B-A4B MoE Q4K M 346p · 128out Compared
86% overall
Gemma 4 31B Q4K M 346p · 128out Compared
131% overall
Qwen 3.5 9B Q4K M 322p · 128out Compared
125% overall
Qwen 3.6 27B Dense Q4K M 322p · 128out Compared
135% overall
Qwen 3.6 35B A3B UD Q4K XL 322p · 128out Compared
74% overall
Details
Gemma 4 26B-A4B MoE Q4K M 346 tok prompt · 128 tok output
Compared Prefill 104%
Decode 41%
Overall 86%
Prompt+decode. Cold latency: ZINC 25316 ms · llama.cpp 2779 ms Gemma 4 31B Q4K M 346 tok prompt · 128 tok output
Compared Prefill 155%
Decode 71%
Overall 131%
Prompt+decode. Cold latency: ZINC 29083 ms · llama.cpp 3971 ms Qwen 3.5 9B Q4K M 322 tok prompt · 128 tok output
Compared Prefill 144%
Decode 77%
Overall 125%
Prompt+decode. Cold latency: ZINC 14840 ms · llama.cpp 2554 ms Qwen 3.6 27B Dense Q4K M 322 tok prompt · 128 tok output
Compared Prefill 159%
Decode 79%
Overall 135%
Prompt+decode. Cold latency: ZINC 24795 ms · llama.cpp 3748 ms Qwen 3.6 35B A3B UD Q4K XL 322 tok prompt · 128 tok output
Compared Prefill 90%
Decode 33%
Overall 74%
Prompt+decode. Cold latency: ZINC 29929 ms · llama.cpp 2372 ms Raw data
| Model | Prefill | Decode | Overall |
|---|---|---|---|
| Gemma 4 26B-A4B MoE Q4K M 346 tok prompt · 128 tok output · 3 runs · Compared | Z 48.8 tok/s 7092.8 ms L 46.8 tok/s 7372.8 ms | Z 43.9 tok/s 22.8 ms/tok L 107.4 tok/s 9.3 ms/tok | Z 47.4 tok/s prompt+decode L 55.2 tok/s prompt+decode |
| Gemma 4 31B Q4K M 346 tok prompt · 128 tok output · 3 runs · Compared | Z 40.2 tok/s 8615.6 ms L 25.9 tok/s 13330.5 ms | Z 37.5 tok/s 26.7 ms/tok L 53.0 tok/s 18.9 ms/tok | Z 39.4 tok/s prompt+decode L 30.0 tok/s prompt+decode |
| Qwen 3.5 9B Q4K M 322 tok prompt · 128 tok output · 3 runs · Compared | Z 89.2 tok/s 3610.7 ms L 62.0 tok/s 5080.1 ms | Z 88.8 tok/s 11.3 ms/tok L 115.9 tok/s 8.6 ms/tok | Z 89.1 tok/s prompt+decode L 71.5 tok/s prompt+decode |
| Qwen 3.6 27B Dense Q4K M 322 tok prompt · 128 tok output · 3 runs · Compared | Z 48.3 tok/s 6669.0 ms L 30.3 tok/s 10384.4 ms | Z 45.5 tok/s 22.0 ms/tok L 57.8 tok/s 17.3 ms/tok | Z 47.5 tok/s prompt+decode L 35.1 tok/s prompt+decode |
| Qwen 3.6 35B A3B UD Q4K XL 322 tok prompt · 128 tok output · 3 runs · Compared | Z 42.0 tok/s 7672.1 ms L 46.7 tok/s 6749.4 ms | Z 43.1 tok/s 23.2 ms/tok L 130.2 tok/s 7.7 ms/tok | Z 42.3 tok/s prompt+decode L 57.1 tok/s prompt+decode |
Long Coding Draft
5
rows
50.5 tok/s
prefill
53.0 tok/s
decode
65%
overall
Models
ZINC llama.cpp
Gemma 4 26B-A4B MoE Q4K M 70p · 256out Compared
34% overall
Gemma 4 31B Q4K M 70p · 256out Compared
61% overall
Qwen 3.5 9B Q4K M 64p · 256out Compared
95% overall
Qwen 3.6 27B Dense Q4K M 64p · 256out Compared
96% overall
Qwen 3.6 35B A3B UD Q4K XL 64p · 256out Compared
41% overall
Details
Gemma 4 26B-A4B MoE Q4K M 70 tok prompt · 256 tok output
Compared Prefill 8%
Decode 42%
Overall 34%
Prompt+decode. Cold latency: ZINC 22351 ms · llama.cpp 3841 ms Gemma 4 31B Q4K M 70 tok prompt · 256 tok output
Compared Prefill 9%
Decode 75%
Overall 61%
Prompt+decode. Cold latency: ZINC 25310 ms · llama.cpp 6695 ms Qwen 3.5 9B Q4K M 64 tok prompt · 256 tok output
Compared Prefill 148%
Decode 81%
Overall 95%
Prompt+decode. Cold latency: ZINC 12892 ms · llama.cpp 3794 ms Qwen 3.6 27B Dense Q4K M 64 tok prompt · 256 tok output
Compared Prefill 146%
Decode 82%
Overall 96%
Prompt+decode. Cold latency: ZINC 22046 ms · llama.cpp 6326 ms Qwen 3.6 35B A3B UD Q4K XL 64 tok prompt · 256 tok output
Compared Prefill 78%
Decode 31%
Overall 41%
Prompt+decode. Cold latency: ZINC 27139 ms · llama.cpp 3469 ms Raw data
| Model | Prefill | Decode | Overall |
|---|---|---|---|
| Gemma 4 26B-A4B MoE Q4K M 70 tok prompt · 256 tok output · 3 runs · Compared | Z 42.1 tok/s 1660.7 ms L 516.3 tok/s 141.4 ms | Z 45.9 tok/s 21.8 ms/tok L 109.0 tok/s 9.2 ms/tok | Z 45.0 tok/s prompt+decode L 131.2 tok/s prompt+decode |
| Gemma 4 31B Q4K M 70 tok prompt · 256 tok output · 3 runs · Compared | Z 38.0 tok/s 1840.4 ms L 435.3 tok/s 167.7 ms | Z 39.4 tok/s 25.4 ms/tok L 52.1 tok/s 19.2 ms/tok | Z 39.1 tok/s prompt+decode L 64.3 tok/s prompt+decode |
| Qwen 3.5 9B Q4K M 64 tok prompt · 256 tok output · 3 runs · Compared | Z 91.7 tok/s 698.3 ms L 61.8 tok/s 1018.6 ms | Z 93.3 tok/s 10.7 ms/tok L 115.1 tok/s 8.7 ms/tok | Z 92.9 tok/s prompt+decode L 98.2 tok/s prompt+decode |
| Qwen 3.6 27B Dense Q4K M 64 tok prompt · 256 tok output · 3 runs · Compared | Z 42.9 tok/s 1491.7 ms L 29.3 tok/s 2149.9 ms | Z 46.7 tok/s 21.4 ms/tok L 56.9 tok/s 17.6 ms/tok | Z 45.9 tok/s prompt+decode L 47.9 tok/s prompt+decode |
| Qwen 3.6 35B A3B UD Q4K XL 64 tok prompt · 256 tok output · 3 runs · Compared | Z 37.6 tok/s 1701.9 ms L 48.0 tok/s 1312.2 ms | Z 39.5 tok/s 25.3 ms/tok L 127.3 tok/s 7.9 ms/tok | Z 39.1 tok/s prompt+decode L 95.7 tok/s prompt+decode |
Methodology ZINC CLI vs llama.cpp on the same CUDA node · Four-scenario matrix with same-file baselines 3 measured · 1 warmup
- Hardware: one NVIDIA GeForce RTX 5090 (32 GB VRAM, 1792 GB/s) benchmark node. ZINC and llama.cpp run on the same host and use the same GGUF file for each model.
- ZINC path: sync the current source tree to the CUDA node, build with zig build -Doptimize=ReleaseFast -Dbackend=cuda, then measure generation through the ZINC CLI. CUDA kernels are NVRTC-compiled at runtime for the visible GPU.
- GPU selection: the target GPU is pinned by UUID via CUDA_VISIBLE_DEVICES for both engines, because nvidia-smi indices are unreliable on the WSL2 passthrough.
- Baseline path: launch llama.cpp against the same model file, preferring one reusable llama-server per model across the full scenario matrix and falling back to llama-cli when the server path is unavailable.
- Scenarios: Quick Chat, Coding Review, Incident Context, and Long Coding Draft. The prompts are real-world chat, code-review, support-context, and coding-plan workloads instead of synthetic factual completions.
- Statistics: one warmup pass is discarded, then three measured runs are collected. Published prefill, decode, end-to-end throughput, and latency values are medians.
- Execution order: the harness records the full ZINC scenario matrix before starting the llama.cpp baseline phase, so only one inference engine owns the GPU during measurement.
- Run hygiene: published CUDA runs assume a clean node with no competing ZINC, llama.cpp, or other GPU workloads.