Topic Guide

AMD RDNA4 LLM Inference

RDNA4 is the default hardware story for ZINC: useful consumer and workstation AMD GPUs, strong memory bandwidth, Vulkan support, and no dependence on ROCm for local LLM inference.

Practical Answer

If you want local LLM inference on AMD RDNA4, use Vulkan-first software and treat ROCm as optional rather than required. ZINC targets Radeon AI PRO R9700 and RX 9070-class hardware directly with Vulkan compute. The important performance split is decode versus prefill: decode is mostly memory and scheduling; prefill needs batched kernels, command-buffer discipline, and model-aware routing.

When This Page Helps

Use this page for readers choosing AMD hardware or validating whether Vulkan local inference is real on RDNA4. The page should answer the hardware, driver, benchmark, and baseline questions before sending them to deep posts.

Status Snapshot

Best current answer
RDNA4 local inference is real through Vulkan. The practical question is not "can AMD run LLMs" but which memory class and driver setup make the run useful.
Reader problem
Readers need hardware fit, driver caveats, reproducible benchmark method, and a clear llama.cpp comparison before they trust the result.
Main bottleneck
Decode is usually weight and scheduling bound. Prefill is a different workload and often needs batched kernels plus fewer host-side submits.

Action Plan

  1. Pick the memory budget first The R9700-class 32 GB card changes which Qwen and Gemma models are realistic. RX 9070-class cards share RDNA4 behavior but fit fewer large targets.
  2. Benchmark against llama.cpp Use the same machine, model file, quantization, prompt, output cap, warmup policy, and backend residency. Cross-machine screenshots are not evidence.
  3. Tune the platform before the shader Driver version, GECC, cooperative matrix flags, ASPM, and stale GPU processes can move results enough to hide real kernel work.
  4. Report prefill and decode separately RDNA4 decode can look strong while prefill remains the user-visible latency limiter. A good post shows both phases.

Operator Checklist

  • Record the exact GPU Name the card, VRAM, driver, Mesa version, and Vulkan device. RDNA4 is not a single performance number.
  • Clean the node Stop stale zinc, llama.cpp, and benchmark processes before publishing a comparison.
  • Lock the run shape Use the same model file, quantization, prompt, max tokens, context, warmup, and endpoint across engines.
  • Separate CLI from server CLI decode, raw HTTP completion, and chat completion measure different parts of the stack.

What Matters

  • RDNA4 can run useful local LLMs without ROCm when the engine uses Vulkan compute directly.
  • The R9700 is the strongest ZINC tuning target because 32 GB VRAM changes which Qwen and Gemma models fit.
  • Prefill and decode are different workloads; a fast decode loop does not imply fast time-to-first-token.
  • llama.cpp is the right baseline, but ZINC is deliberately optimizing the AMD path as a first-class target.

What To Measure

Prompt tok/s
Shows whether prefill kernels and command submission are healthy.
Decode tok/s
The headline number, but only meaningful with model, quantization, and context attached.
Latency distribution
For server mode, include TTFT plus p50 and p95 request latency under the stated concurrency.
Platform state
Record driver version, GECC, cooperative matrix flag, ASPM policy, and thermal throttling clues.

Common Traps

  • Making ROCm the story The useful ZINC angle is Vulkan-first AMD inference. Mention ROCm only to clarify that this path does not require it.
  • Comparing dirty runs Background GPU users, cold builds, one-token completions, and changed prompts will swamp the signal.
  • Treating RDNA4 as one number R9700, RX 9070 XT, and future workstation cards share architecture but not memory capacity, clocks, thermals, or deployment fit.

Useful Next Posts

  • RX 9070 XT local LLM guide A 16 GB practical guide for what fits, what needs offload, and what ZINC runs well today.
  • R9700 versus RX 9070 XT A memory-class comparison using the same Qwen and Gemma prompts.
  • RDNA4 benchmark hygiene checklist A reproducibility post covering drivers, stale processes, endpoint choice, and prompt shape.

Run It

Common Questions

Can AMD RDNA4 run local LLMs without ROCm?

Yes. ZINC uses Vulkan compute on AMD RDNA4, so the local inference path does not depend on ROCm support for consumer cards.

Which RDNA4 GPU is the best target for ZINC?

The Radeon AI PRO R9700 is the primary tuning target because it combines RDNA4 behavior with a 32 GB memory budget. RX 9070-class cards share the architecture but fit fewer large models.

Why compare to llama.cpp?

llama.cpp is the strongest widely used local baseline with Vulkan and Metal support. ZINC compares against it on the same machine to avoid misleading cross-hardware numbers.