Last updated: 2026-06-12

Inference Runtime

Kfd

All API Sections

AMDGPU KFD (`/dev/kfd`) bring-up for the T1 PM4-direct tier.

The design's T1 tier submits PM4 packets straight to the AMD command processor. On Linux that means the KFD compute path (`/dev/kfd` + `AMDKFD_IOC_*`), the same userspace ABI ROCm/HSA and tinygrad ride on. It works on every `amdgpu` kernel that ships KFD and does not depend on the experimental user-mode-queue (UMQ / `DRM_IOCTL_AMDGPU_USERQ`) ABI, which on the R9700 (gfx1201, kernel 6.17, `uni_mes` firmware) the kernel rejects with "Usermode queue is not supported for this IP" — see `umq.zig` and §14 of the design doc.

This module brings up the T1 PM4-direct path on the kernel ABI that works: * open `/dev/kfd` + the render node, `AMDKFD_IOC_GET_VERSION`, * match the render minor against `/sys/.../kfd/topology/nodes/*`, * `AMDKFD_IOC_ACQUIRE_VM`, `AMDKFD_IOC_GET_PROCESS_APERTURES_NEW`, * `bringUpPath`: reserve a VA window inside the GPUVM aperture, `ALLOC_MEMORY_OF_GPU` (GTT, writable) + `mmap` + `MAP_MEMORY_TO_GPU` + a CPU round-trip, then `UNMAP_MEMORY_FROM_GPU` + `FREE_MEMORY_OF_GPU`, * `createComputeQueueSmokePath`: allocate the ring / wptr / rptr / EOP / CWSR buffer objects the way `kfd_queue_acquire_buffers` validates them, `AMDKFD_IOC_CREATE_QUEUE` (PM4 compute), stage a couple of PM4 NOP packets into the ring, then `AMDKFD_IOC_DESTROY_QUEUE` and tear down. Ringing the doorbell and retiring a PM4 fence is the next bring-up step.

40 exports 2 methods src/zinc_rt/ring/kfd.zig

40 exports shown

constant

default_render_node

#
pub const default_render_node = "/dev/dri/renderD128"

Default DRM render node used when the caller does not supply one.

The renderD128 minor is the standard single-GPU choice on AMD Linux systems.

src/zinc_rt/ring/kfd.zig:33

constant

kfd_device_node

#
pub const kfd_device_node = "/dev/kfd"

Path to the KFD compute device used for every AMDKFD_IOC_* ioctl.

src/zinc_rt/ring/kfd.zig:35

constant

topology_nodes_dir

#
pub const topology_nodes_dir = "/sys/devices/virtual/kfd/kfd/topology/nodes"

Sysfs root that enumerates KFD topology nodes (one subdirectory per node, each carrying `gpu_id` and a `properties` file that drives queue sizing).

src/zinc_rt/ring/kfd.zig:38

constant

min_kfd_major

#
pub const min_kfd_major: u32 = 1

Minimum KFD ABI major version required for the PM4 compute path; the bring-up fails fast with `kfd_version_too_old` below this number.

src/zinc_rt/ring/kfd.zig:41

constant

ALLOC_MEM_FLAGS_VRAM

#
pub const ALLOC_MEM_FLAGS_VRAM: u32 = 1 << 0

Allocate the BO out of device VRAM (local frame buffer).

src/zinc_rt/ring/kfd.zig:58

constant

ALLOC_MEM_FLAGS_GTT

#
pub const ALLOC_MEM_FLAGS_GTT: u32 = 1 << 1

Allocate the BO out of system GTT memory (the path used by every smoke BO).

src/zinc_rt/ring/kfd.zig:60

constant

ALLOC_MEM_FLAGS_USERPTR

#
pub const ALLOC_MEM_FLAGS_USERPTR: u32 = 1 << 2

Pin a userptr range as the BO backing; not used by the bring-up smoke path.

src/zinc_rt/ring/kfd.zig:62

constant

ALLOC_MEM_FLAGS_DOORBELL

#
pub const ALLOC_MEM_FLAGS_DOORBELL: u32 = 1 << 3

Allocate a doorbell page so a userspace queue can ring its wptr doorbell.

src/zinc_rt/ring/kfd.zig:64

constant

ALLOC_MEM_FLAGS_COHERENT

#
pub const ALLOC_MEM_FLAGS_COHERENT: u32 = 1 << 26

Request a CPU-coherent BO (snooped on x86); not used by the bring-up smoke path.

src/zinc_rt/ring/kfd.zig:66

constant

ALLOC_MEM_FLAGS_PUBLIC

#
pub const ALLOC_MEM_FLAGS_PUBLIC: u32 = 1 << 29

Mark the BO as PCIe-visible / exportable to peer devices.

src/zinc_rt/ring/kfd.zig:68

constant

ALLOC_MEM_FLAGS_EXECUTABLE

#
pub const ALLOC_MEM_FLAGS_EXECUTABLE: u32 = 1 << 30

Mark the BO as containing GPU-executable code (shader binaries).

src/zinc_rt/ring/kfd.zig:70

constant

ALLOC_MEM_FLAGS_WRITABLE

#
pub const ALLOC_MEM_FLAGS_WRITABLE: u32 = 1 << 31

Map the BO with write permission on the GPU side (default for smoke BOs).

src/zinc_rt/ring/kfd.zig:72

constant

QUEUE_TYPE_COMPUTE

#
pub const QUEUE_TYPE_COMPUTE: u32 = 0x0

PM4 compute queue type passed to `AMDKFD_IOC_CREATE_QUEUE` for MEC pipes.

src/zinc_rt/ring/kfd.zig:76

constant

QUEUE_TYPE_SDMA

#
pub const QUEUE_TYPE_SDMA: u32 = 0x1

SDMA queue type; copy engine queue, not used by the PM4 compute bring-up.

src/zinc_rt/ring/kfd.zig:78

constant

QUEUE_TYPE_COMPUTE_AQL

#
pub const QUEUE_TYPE_COMPUTE_AQL: u32 = 0x2

AQL compute queue type (HSA packet processor format) used by ROCm/HSA.

src/zinc_rt/ring/kfd.zig:80

struct

GetVersionArgs

#
pub const GetVersionArgs = extern struct

`AMDKFD_IOC_GET_VERSION` ioctl args — the KFD ABI version reported by the running kernel.

Matches `struct kfd_ioctl_get_version_args` from uapi.

src/zinc_rt/ring/kfd.zig:84

struct

AcquireVmArgs

#
pub const AcquireVmArgs = extern struct

`AMDKFD_IOC_ACQUIRE_VM` ioctl args — binds the calling process's GPUVM to the DRM render node fd so subsequent allocations land in the right address space.

src/zinc_rt/ring/kfd.zig:92

struct

ProcessDeviceApertures

#
pub const ProcessDeviceApertures = extern struct

One device aperture entry returned by `GET_PROCESS_APERTURES_NEW`: the LDS, scratch, and GPUVM windows that this process is allowed to use on the indicated `gpu_id`.

The bring-up only consumes `gpuvm_base/limit`.

src/zinc_rt/ring/kfd.zig:100

struct

GetProcessAperturesNewArgs

#
pub const GetProcessAperturesNewArgs = extern struct

`AMDKFD_IOC_GET_PROCESS_APERTURES_NEW` ioctl args — caller supplies a pointer to an array of `ProcessDeviceApertures` plus its length; the kernel fills the array and updates `num_of_nodes` with the actual count.

src/zinc_rt/ring/kfd.zig:114

struct

AllocMemoryOfGpuArgs

#
pub const AllocMemoryOfGpuArgs = extern struct

`AMDKFD_IOC_ALLOC_MEMORY_OF_GPU` ioctl args.

The caller chooses the GPU VA (`va_addr`) and the kernel returns a BO `handle` plus an `mmap_offset` for the DRM render node so the BO can be CPU-mapped.

src/zinc_rt/ring/kfd.zig:123

struct

FreeMemoryOfGpuArgs

#
pub const FreeMemoryOfGpuArgs = extern struct

`AMDKFD_IOC_FREE_MEMORY_OF_GPU` ioctl args — release the BO referenced by `handle` (must already be unmapped from every GPU it was mapped to).

src/zinc_rt/ring/kfd.zig:134

struct

MapMemoryToGpuArgs

#
pub const MapMemoryToGpuArgs = extern struct

`AMDKFD_IOC_MAP_MEMORY_TO_GPU` / `UNMAP_MEMORY_FROM_GPU` ioctl args.

Same layout for both directions; the kernel updates `n_success` with the number of devices the BO was successfully (un)mapped on.

src/zinc_rt/ring/kfd.zig:141

struct

CreateQueueArgs

#
pub const CreateQueueArgs = extern struct

`AMDKFD_IOC_CREATE_QUEUE` ioctl args — describes the PM4 compute queue the kernel should map onto the MEC.

Every BO address (ring/wptr/rptr/EOP/CWSR) must already be allocated and mapped to the same GPU before the ioctl.

src/zinc_rt/ring/kfd.zig:151

struct

DestroyQueueArgs

#
pub const DestroyQueueArgs = extern struct

`AMDKFD_IOC_DESTROY_QUEUE` ioctl args — release the queue identified by `queue_id` (the value the kernel returned from `CREATE_QUEUE`).

src/zinc_rt/ring/kfd.zig:171

struct

TopologyNode

#
pub const TopologyNode = struct

One GPU topology node parsed from `topology_nodes_dir`.

Carries the values the queue bring-up needs to validate `CREATE_QUEUE`: `gpu_id`, the GFX IP target version, CU/SIMD counts that drive CWSR sizing, and the canonical `cwsr_size` / `ctl_stack_size` advertised by the kernel.

src/zinc_rt/ring/kfd.zig:190

enum

BringUpStatus

#
pub const BringUpStatus = enum

Outcome categories for `bringUpPath`.

Every non-`ok` value identifies the exact bring-up step that failed (open, version check, aperture lookup, VA reservation, alloc, map, mmap, CPU round-trip, unmap, free).

src/zinc_rt/ring/kfd.zig:205

struct

BringUpResult

#
pub const BringUpResult = struct

Full bring-up report produced by `bringUpPath`.

Captures the status plus every observable value the smoke run collected (KFD ABI version, topology info, the GPUVM window, the reserved VA, and the errno of the last failing ioctl when applicable) so callers can render diagnostics without rerunning.

src/zinc_rt/ring/kfd.zig:232

Methods

1

function

lastErrno

#
pub fn lastErrno() ?linux.E

Returns the errno of the most recent failing ioctl, or `null` if the last ioctl succeeded.

Cleared at the start of each `ioctlChecked` call.

src/zinc_rt/ring/kfd.zig:257

function

renderMinorOf

#
pub fn renderMinorOf(render_node: []const u8) ?u32

Extract the DRM render minor number from a `/dev/dri/renderD<N>` path.

contain a recognizable `renderD<digits>` suffix.

Parameters

render_node
Absolute path to the DRM render node (e.g. `/dev/dri/renderD128`).

Returns

The parsed minor number (e.g. 128), or `null` if the path does not

src/zinc_rt/ring/kfd.zig:277

function

findTopologyNode

#
pub fn findTopologyNode(render_minor: u32) ?TopologyNode

Scan `topology_nodes_dir` for the GPU topology node whose `drm_render_minor` matches the given render minor, parsing `gpu_id`, `gfx_target_version`, `simd_count`, `cwsr_size`, and related properties from each node's `properties` file.

CPU-only nodes (gpu_id == 0) are skipped. the sysfs directory cannot be opened.

Parameters

render_minor
DRM render minor to match (e.g. 128 for `renderD128`).

Returns

The matching `TopologyNode`, or `null` if no GPU node matches or if

Notes

Always returns `null` on non-Linux targets.

src/zinc_rt/ring/kfd.zig:316

function

reachable

#
pub fn reachable() bool

Cheaply test whether the KFD PM4 path appears usable on this machine without issuing any ioctls: opens `/dev/kfd`, then verifies that a matching topology node with a non-zero `gpu_id` exists in sysfs for the default render minor.

found for the default render node; `false` otherwise or on non-Linux targets.

Returns

`true` when `/dev/kfd` is accessible and a valid topology node is

src/zinc_rt/ring/kfd.zig:354

function

bringUpDefault

#
pub fn bringUpDefault() BringUpResult

Run `bringUpPath` against `default_render_node` ("/dev/dri/renderD128").

Returns

Status + diagnostics for the GPUVM round-trip.

src/zinc_rt/ring/kfd.zig:365

function

bringUpPath

#
pub fn bringUpPath(render_node: []const u8) BringUpResult

End-to-end KFD bring-up: open `/dev/kfd` and the supplied render node, check the ABI version, acquire the GPUVM, look up the matching topology aperture, reserve a 64 KiB VA window inside it, allocate + mmap + MAP_MEMORY_TO_GPU a 4 KiB GTT scratch BO, write/read a magic value to verify the round-trip, then unmap and free everything.

Every failure point is captured in the returned `BringUpResult` (status + last errno) so a non-Linux caller or a partial-permissions environment can still get a useful diagnostic without panicking. round-trip.

Parameters

render_node
Path to the DRM render node (e.g. `/dev/dri/renderD128`).

Returns

A populated `BringUpResult`; `.ok()` is true only on a clean

Notes

Off Linux this short-circuits to `unsupported_os` and performs no IO.

src/zinc_rt/ring/kfd.zig:381

enum

ComputeQueueSmokeStatus

#
pub const ComputeQueueSmokeStatus = enum

Outcome categories for the PM4 compute-queue smoke path.

Each non-`ok` value identifies a specific failure step (BO allocation, map, `CREATE_QUEUE`, `DESTROY_QUEUE`) so callers can render bring-up reports.

src/zinc_rt/ring/kfd.zig:551

struct

ComputeQueueSmokeResult

#
pub const ComputeQueueSmokeResult = struct

Full report for the PM4 compute-queue smoke run.

Captures the status plus every observable value the run produced (BO VAs, queue id, doorbell offset, initial wptr/rptr, number of PM4 NOP dwords staged, and the errno of the last failing ioctl) so a higher-level CLI can render bring-up output.

src/zinc_rt/ring/kfd.zig:573

Methods

1

method

ComputeQueueSmokeResult.ok

#
pub fn ok(self: ComputeQueueSmokeResult) bool

True when the queue was created, NOPs staged, and the queue destroyed without any ioctl failure.

src/zinc_rt/ring/kfd.zig:599

function

alignUp

#
pub fn alignUp(value: u64, alignment: u64) u64

Round `value` up to the next multiple of `alignment`.

Returns `value` unchanged when `alignment` is zero or `value` is already aligned.

Parameters

value
Number to be rounded up.
alignment
Power-of-two (or any non-zero) boundary to align to.

Returns

Smallest multiple of `alignment` that is `>= value`.

src/zinc_rt/ring/kfd.zig:612

function

computeCwsrBoSize

#
pub fn computeCwsrBoSize( cwsr_size: u32, simd_count: u32, simd_per_cu_in: u32, num_xcc_in: u32, gfx_target_version: u32, ) u64

Compute the CWSR buffer-object size that the kernel's `kfd_queue_acquire_buffers` will accept for a given GPU topology.

The formula is `align_up((cwsr_size + debug_memory_size) * num_xcc, PAGE)`, where for gfx ≥ 10.1.x `debug_memory_size = align_up((simd_count / simd_per_cu / num_xcc) * 32 * 32, 64)` and for older IPs `debug_memory_size = 0`. Verified on the R9700 (gfx1201): cwsr_size 0x1d47000, debug 0x10000, resulting BO 0x1d57000. (2 for RDNA, 4 for GCN/CDNA).

Parameters

cwsr_size
Raw `cwsr_size` from the topology `properties` file (bytes).
simd_count
Total SIMD units across all XCCs for this GPU.
simd_per_cu_in
SIMD units per CU; if 0, derived from `gfx_target_version`
num_xcc_in
Number of XCC dies; treated as 1 if 0.
gfx_target_version
Encoded GFX IP version (major×10000 + minor×100 + step).

Returns

Total BO allocation size in bytes, page-aligned.

src/zinc_rt/ring/kfd.zig:632

function

createComputeQueueSmokeDefault

#
pub fn createComputeQueueSmokeDefault() ComputeQueueSmokeResult

Run `createComputeQueueSmokePath` against `default_render_node`.

Returns

The compute-queue smoke result; `.ok()` is true on a clean run.

src/zinc_rt/ring/kfd.zig:765

function

createComputeQueueSmokePath

#
pub fn createComputeQueueSmokePath(render_node: []const u8) ComputeQueueSmokeResult

Stand up a PM4 compute queue end-to-end on the given render node.

Allocates the ring / wptr / rptr / EOP / CWSR buffer objects with the sizing rules `kfd_queue_acquire_buffers` validates, calls `AMDKFD_IOC_CREATE_QUEUE`, stages a couple of PM4 NOP packets into the ring (no doorbell yet), then destroys the queue and tears everything down. Every BO and ioctl failure is captured in the returned report so the bring-up smoke test can render the exact step that failed.

Parameters

render_node
DRM render node backing the target GPU.

Returns

The full smoke report; `.ok()` is true on a clean create/destroy.

Notes

Off Linux this short-circuits to `unsupported_os` without IO.

src/zinc_rt/ring/kfd.zig:779

function

formatGfxTarget

#
pub fn formatGfxTarget(buf: []u8, gfx_target_version: u32) []const u8

Format a `gfx_target_version` integer (e.g.

120001) as a GFX target string such as `"gfx1201"`. The encoding is major×10000 + minor×100 + step, matching the value read from the KFD topology `properties` file. `gfx_target_version` is 0 or the buffer is too small.

Parameters

buf
Caller-supplied scratch buffer; 16 bytes is sufficient.
gfx_target_version
Encoded GFX IP version, or 0 for an unknown target.

Returns

A slice into `buf` holding the rendered string, or `"gfx?"` when

src/zinc_rt/ring/kfd.zig:942