Last updated: 2026-06-06

RDNA4 Tuning Guide#

This guide mixes general RDNA4 platform findings, driver/toolchain notes, and external baseline measurements. Treat it as tuning context, not as the current ZINC benchmark leaderboard.

Findings from extensive profiling of LLM inference on AMD Radeon AI PRO R9700 (RDNA4, gfx1201). The guide covers two backend paths: the Vulkan backend through Mesa RADV (current production, -Dbackend=vulkan), and the ZINC_RT backend built on direct PM4 submission through amdgpu (M0 scaffolding in tree, -Dbackend=zinc_rt). Every firmware, kernel, and driver setting documented below applies to both backends, because both sit on top of the same amdgpu kernel driver.

Current measured peak (Vulkan backend, Qwen3.6-35B-A3B Q4_K_XL)#

Metric ZINC (Vulkan + RADV) llama.cpp (Vulkan) Ratio
Decode tok/s 117.07 104.47 1.12x
Prefill tok/s 88.08 181.95 0.48x
Bandwidth utilization (decode) 31% of 576 GB/s ~28%
Per-token weight traffic ~1.57 GiB ~1.57 GiB

Decode beats llama.cpp by 12%. Prefill still trails — the hybrid MoE plus SSM architecture exposes more parallelism than the current dispatcher exploits, and the dense-model batched-prefill work is covered in How ZINC's RDNA4 batched prefill went from 42 to 208 tok/s. The remaining hybrid work lives in the ZINC_RT chunked-prefill plan in docs/ZINC_RT_DESIGN.md §18.7. The 31% bandwidth utilization at peak decode is the headroom number to watch — the theoretical ceiling at 100% BW is ~365 tok/s, and the path to closing the gap is no longer one missing fused kernel.

Hardware Specifications#

  • GPU: 64 CUs, wave64, 32KB L0 vector cache/CU, 8MB L2
  • Memory: 32GB GDDR6, 576 GB/s bandwidth
  • Vulkan: VK_KHR_cooperative_matrix 16x16x16
  • Architecture: gfx1201 (detected as AMD_RDNA3 by llama.cpp)

Critical: Disable GPU ECC (GECC)#

RDNA4 enables GECC by default, which silently consumes ~10% memory bandwidth for error correction. For inference workloads where occasional bit flips are acceptable, disabling it gives a significant speedup.

# Add to /etc/default/grub:
GRUB_CMDLINE_LINUX_DEFAULT="... amdgpu.ras_enable=0"
# Then: update-grub && reboot

Measured impact: 101 tok/s → 110 tok/s (+9%) on Qwen3.6-35B-A3B Q4_K

RADV Driver Configuration#

# Enable cooperative matrix support
export RADV_PERFTEST=coop_matrix

Without this, all matmul operations fall back to scalar shaders — massive performance loss.

Per-Token Decode Profiling#

Profiled with GGML_VK_PERF_LOGGER=1 on Qwen3.6-35B-A3B (Q4_K_XL, SSM+attention hybrid MoE).

Time Breakdown (per token)#

Component Time (ms) % of Total
Matmul compute 6.5 63%
Non-matmul compute 3.6 35%
Vulkan dispatch overhead ~0.1 <1%
Total ~10.2

Matmul Bandwidth Utilization#

Operation BW Utilization Time/token
Vocab output (m=248320, k=2048) 93.2% 1006 us
Large attention (m=8192, k=2048) 83.6% 1481 us
Medium attention (m=4096, k=2048) 66.1% 682 us
MoE experts (q4_K, m=512, k=2048) 59.6% 1073 us
Small matmul (m=32, k=2048) 2.7% 272 us

Large matmuls are near bandwidth-optimal. Small matmuls can't saturate the memory subsystem.

Q5_K DMMV row packing (recent)#

The Q5_K dequantize-matmul-vector shader was rebuilt to pack two output rows per workgroup. On a 32-row Q5_K matmul this halves the workgroup count, doubles the per-WG accumulator footprint, and reuses the same dequantized block across both rows. Net effect on decode is a measurable few percent on Q5_K-heavy models like Qwen3-8B, with no correctness changes. The pattern is documented in src/shaders/dmmv_q5k.comp; the same row-packing idea is the next candidate for Q3_K and Q6_K, both of which currently dispatch one row per WG.

Non-Matmul Ops (per token)#

Op Dispatches Total Time
RMS_NORM_MUL (fused) 131 593 us
MUL (element-wise) 110 365 us
GET_ROWS 122 338 us
SIGMOID 80 267 us
MULTI_ADD (fused) 80 256 us
GLU (fused) 80 250 us
SILU 60 143 us
L2_NORM 60 125 us
SSM_CONV 30 150 us
GATED_DELTA_NET 30 128 us

Compute Graph Stats#

  • Total graph nodes: 3728
  • Dispatchable ops: 2356
  • After existing fusions: ~1500 dispatches
  • Dispatch overhead: ~0.1ms (negligible — measured 0.016µs per dispatch)

Vulkan Dispatch Overhead (Micro-benchmark)#

Raw Vulkan dispatch cost measured on RDNA4:

Test Result
Single dispatch (record+submit+wait) 33 us
1500 empty dispatches (GPU time) 24 us = 0.016 us/dispatch
1500 dispatches (wall time) 85 us = 0.057 us/dispatch
Pre-recorded command buffer replay 54 us for 1500 dispatches

Key insight: Dispatch overhead is negligible. The 2-5µs per "dispatch" seen in profiling is real kernel execution time on small memory-bound tensors.

Concurrent Request Scaling#

Concurrent Slots Per-slot tok/s Aggregate tok/s
1 110 110
4 108 432

Linear scaling — the GPU is not saturated by a single decode request. Aggregate scaling beyond four slots requires paged KV (the current Vulkan backend's flat KV cache OOMs at sixteen concurrent slots on Qwen3.6-35B-A3B). The paged KV v2 layout in docs/ZINC_RT_DESIGN.md §19 targets ~2 100 aggregate tok/s at sixteen slots on the same R9700.

What Doesn't Help#

Optimization Result Notes
Wave32 for DMMV No improvement Driver's default wave64 is optimal
DMMV_WG_SIZE_LARGE (256 threads) No improvement Too many idle threads for small K
rm_kq > 2 (rows per workgroup) -75% regression Wave64 can't handle 4+ rows
GPU clock forcing (profile_peak) -23% regression Power throttling on memory-bound work
f16 KV cache (vs q8_0) No change KV ops are negligible
Flash attention on/off No change Tiny fraction of decode time
CPU thread count (1-16) No change Workload is 100% GPU-bound
THP=always (vs madvise) Marginal Model weights are in GPU VRAM

SPIR-V Toolchain Compatibility#

Critical: Newer versions of shaderc/spirv-tools produce SPIR-V that RADV (ACO compiler) handles poorly — up to 5x slower.

glslc Version RADV Compatibility Performance
shaderc 2023.8 (Ubuntu 24.04) Excellent 110 tok/s
shaderc v2026.2-dev Broken 19-25 tok/s

The newer glslc adds NonWritable/NonReadable decorations and different control flow that RADV's ACO optimizer can't handle efficiently.

Recommendation: Use the system-provided glslc from Ubuntu packages, not a custom-built version.

SMU Firmware Compatibility#

Kernel 6.17 has SMU driver IF v0x2e, while RDNA4 firmware expects v0x32. This mismatch limits max GPU clock to 2200 MHz instead of 2350 MHz.

Kernel 6.14 or earlier may have a compatible SMU driver version.

Mesa Version Sensitivity#

The RADV ACO compiler in Mesa is the SPIR-V to PM4 path. Two version cliffs are documented:

Mesa version Status on R9700 Notes
25.0.7 Recommended Current bench-node version
25.2.8 ~14% RADV regression Avoid until upstream ACO regressions are reverted

The regression manifests on the Q4_K DMMV path most strongly. We pin Mesa via the system package manager and do not auto-upgrade the bench node.

Load-time Q4_0 Re-quantization#

GGUF models commonly ship Q8_0 for tensors the publisher considered precision-sensitive: SSM in-projection (attn_qkv), SSM out-projection, attention gate, and sometimes the router. On Qwen3.6-35B-A3B the SSM attn_qkv is [8192, 2048] per layer in Q8_0, ~17 MiB, streamed every decode token. Across 30 SSM layers that is ~510 MiB of Q8_0 weights traversed per token, well past L3.

On the T-CPU autopilot path, adding attn_qkv to the Q8_0 → Q4_0 re-quantize-at-load list moved decode from 32.8 to 37.6 tok/s and prefill from 28.5 to 32.7 tok/s on the 9800X3D bench, with output staying coherent. The model is already Q4_K everywhere else; per-weight noise on the SSM in-proj is averaged out by the L2-normalized delta-net recurrence one layer downstream. The same lever applies on RDNA4 wherever a tensor is loaded Q8_0 by default and the kernel has a Q4_0 variant. See forward_zinc_rt.zig's q4_candidates list for the current set.

ZINC_RT, this guide, and what changes#

The Vulkan-specific advice above is about driver, firmware, and toolchain. All of it still applies under ZINC_RT because ZINC_RT uses the same amdgpu kernel driver. Disable GECC. Stay on Mesa 25.0.7 (when running anything that links libvulkan, including dev tooling and CI shader compilation). Stay on kernel 6.14 if you can. Pin shaderc 2023.8.

What ZINC_RT changes is the userspace layer above the kernel driver. The Vulkan tax this guide measures — 33 µs per vkQueueSubmit plus fence, 80 µs command-buffer re-record on a 1500-node graph, the 5x glslc regression risk — does not apply to ZINC_RT's PM4-direct path. ZINC_RT submits via a ring-buffer write plus a doorbell MMIO store, total CPU-to-CP latency around 150-500 ns. The trade is that ZINC_RT is at M0 today (T1 KFD smoke dispatch verified on R9700; full IR lowering is M2). The Vulkan path is the production target until ZINC_RT clears its M1 gate at 140 tok/s decode.

For the long-form story of why ZINC_RT exists at all, the head-to-head architectural comparison vs ROCm and Vulkan, the multitenant batching architecture, and the falsification criteria, see:

  • docs/ZINC_RT_DESIGN.md — the canonical ZINC_RT design
  • The blog post "ROCm vs Vulkan vs ZINC_RT: inside the decision to write our own GPU runtime for local LLM inference on AMD RDNA4" at /blog/inside-the-decision-to-write-our-own-gpu-runtime-for-local-llm-inference

The two backends will both ship indefinitely. The cross-backend logit-equality test in CI is what keeps them honest.