Last updated: 2026-06-12
Inference Runtime
Kfd
AMDGPU KFD (`/dev/kfd`) bring-up for the T1 PM4-direct tier.
The design's T1 tier submits PM4 packets straight to the AMD command processor. On Linux that means the KFD compute path (`/dev/kfd` + `AMDKFD_IOC_*`), the same userspace ABI ROCm/HSA and tinygrad ride on. It works on every `amdgpu` kernel that ships KFD and does not depend on the experimental user-mode-queue (UMQ / `DRM_IOCTL_AMDGPU_USERQ`) ABI, which on the R9700 (gfx1201, kernel 6.17, `uni_mes` firmware) the kernel rejects with "Usermode queue is not supported for this IP" — see `umq.zig` and §14 of the design doc.
This module brings up the T1 PM4-direct path on the kernel ABI that works: * open `/dev/kfd` + the render node, `AMDKFD_IOC_GET_VERSION`, * match the render minor against `/sys/.../kfd/topology/nodes/*`, * `AMDKFD_IOC_ACQUIRE_VM`, `AMDKFD_IOC_GET_PROCESS_APERTURES_NEW`, * `bringUpPath`: reserve a VA window inside the GPUVM aperture, `ALLOC_MEMORY_OF_GPU` (GTT, writable) + `mmap` + `MAP_MEMORY_TO_GPU` + a CPU round-trip, then `UNMAP_MEMORY_FROM_GPU` + `FREE_MEMORY_OF_GPU`, * `createComputeQueueSmokePath`: allocate the ring / wptr / rptr / EOP / CWSR buffer objects the way `kfd_queue_acquire_buffers` validates them, `AMDKFD_IOC_CREATE_QUEUE` (PM4 compute), stage a couple of PM4 NOP packets into the ring, then `AMDKFD_IOC_DESTROY_QUEUE` and tear down. Ringing the doorbell and retiring a PM4 fence is the next bring-up step.
40 exports shown
constant
default_render_node
pub const default_render_node = "/dev/dri/renderD128" Default DRM render node used when the caller does not supply one.
The renderD128 minor is the standard single-GPU choice on AMD Linux systems.
constant
kfd_device_node
pub const kfd_device_node = "/dev/kfd" Path to the KFD compute device used for every AMDKFD_IOC_* ioctl.
constant
topology_nodes_dir
pub const topology_nodes_dir = "/sys/devices/virtual/kfd/kfd/topology/nodes" Sysfs root that enumerates KFD topology nodes (one subdirectory per node, each carrying `gpu_id` and a `properties` file that drives queue sizing).
constant
min_kfd_major
pub const min_kfd_major: u32 = 1 Minimum KFD ABI major version required for the PM4 compute path; the bring-up fails fast with `kfd_version_too_old` below this number.
constant
ALLOC_MEM_FLAGS_VRAM
pub const ALLOC_MEM_FLAGS_VRAM: u32 = 1 << 0 Allocate the BO out of device VRAM (local frame buffer).
constant
ALLOC_MEM_FLAGS_GTT
pub const ALLOC_MEM_FLAGS_GTT: u32 = 1 << 1 Allocate the BO out of system GTT memory (the path used by every smoke BO).
constant
ALLOC_MEM_FLAGS_USERPTR
pub const ALLOC_MEM_FLAGS_USERPTR: u32 = 1 << 2 Pin a userptr range as the BO backing; not used by the bring-up smoke path.
constant
ALLOC_MEM_FLAGS_DOORBELL
pub const ALLOC_MEM_FLAGS_DOORBELL: u32 = 1 << 3 Allocate a doorbell page so a userspace queue can ring its wptr doorbell.
constant
ALLOC_MEM_FLAGS_COHERENT
pub const ALLOC_MEM_FLAGS_COHERENT: u32 = 1 << 26 Request a CPU-coherent BO (snooped on x86); not used by the bring-up smoke path.
constant
ALLOC_MEM_FLAGS_PUBLIC
pub const ALLOC_MEM_FLAGS_PUBLIC: u32 = 1 << 29 Mark the BO as PCIe-visible / exportable to peer devices.
constant
ALLOC_MEM_FLAGS_EXECUTABLE
pub const ALLOC_MEM_FLAGS_EXECUTABLE: u32 = 1 << 30 Mark the BO as containing GPU-executable code (shader binaries).
constant
ALLOC_MEM_FLAGS_WRITABLE
pub const ALLOC_MEM_FLAGS_WRITABLE: u32 = 1 << 31 Map the BO with write permission on the GPU side (default for smoke BOs).
constant
QUEUE_TYPE_COMPUTE
pub const QUEUE_TYPE_COMPUTE: u32 = 0x0 PM4 compute queue type passed to `AMDKFD_IOC_CREATE_QUEUE` for MEC pipes.
constant
QUEUE_TYPE_SDMA
pub const QUEUE_TYPE_SDMA: u32 = 0x1 SDMA queue type; copy engine queue, not used by the PM4 compute bring-up.
constant
QUEUE_TYPE_COMPUTE_AQL
pub const QUEUE_TYPE_COMPUTE_AQL: u32 = 0x2 AQL compute queue type (HSA packet processor format) used by ROCm/HSA.
struct
GetVersionArgs
pub const GetVersionArgs = extern struct `AMDKFD_IOC_GET_VERSION` ioctl args — the KFD ABI version reported by the running kernel.
Matches `struct kfd_ioctl_get_version_args` from uapi.
struct
AcquireVmArgs
pub const AcquireVmArgs = extern struct `AMDKFD_IOC_ACQUIRE_VM` ioctl args — binds the calling process's GPUVM to the DRM render node fd so subsequent allocations land in the right address space.
struct
ProcessDeviceApertures
pub const ProcessDeviceApertures = extern struct One device aperture entry returned by `GET_PROCESS_APERTURES_NEW`: the LDS, scratch, and GPUVM windows that this process is allowed to use on the indicated `gpu_id`.
The bring-up only consumes `gpuvm_base/limit`.
struct
GetProcessAperturesNewArgs
pub const GetProcessAperturesNewArgs = extern struct `AMDKFD_IOC_GET_PROCESS_APERTURES_NEW` ioctl args — caller supplies a pointer to an array of `ProcessDeviceApertures` plus its length; the kernel fills the array and updates `num_of_nodes` with the actual count.
struct
AllocMemoryOfGpuArgs
pub const AllocMemoryOfGpuArgs = extern struct `AMDKFD_IOC_ALLOC_MEMORY_OF_GPU` ioctl args.
The caller chooses the GPU VA (`va_addr`) and the kernel returns a BO `handle` plus an `mmap_offset` for the DRM render node so the BO can be CPU-mapped.
struct
FreeMemoryOfGpuArgs
pub const FreeMemoryOfGpuArgs = extern struct `AMDKFD_IOC_FREE_MEMORY_OF_GPU` ioctl args — release the BO referenced by `handle` (must already be unmapped from every GPU it was mapped to).
struct
MapMemoryToGpuArgs
pub const MapMemoryToGpuArgs = extern struct `AMDKFD_IOC_MAP_MEMORY_TO_GPU` / `UNMAP_MEMORY_FROM_GPU` ioctl args.
Same layout for both directions; the kernel updates `n_success` with the number of devices the BO was successfully (un)mapped on.
struct
CreateQueueArgs
pub const CreateQueueArgs = extern struct `AMDKFD_IOC_CREATE_QUEUE` ioctl args — describes the PM4 compute queue the kernel should map onto the MEC.
Every BO address (ring/wptr/rptr/EOP/CWSR) must already be allocated and mapped to the same GPU before the ioctl.
struct
DestroyQueueArgs
pub const DestroyQueueArgs = extern struct `AMDKFD_IOC_DESTROY_QUEUE` ioctl args — release the queue identified by `queue_id` (the value the kernel returned from `CREATE_QUEUE`).
struct
TopologyNode
pub const TopologyNode = struct One GPU topology node parsed from `topology_nodes_dir`.
Carries the values the queue bring-up needs to validate `CREATE_QUEUE`: `gpu_id`, the GFX IP target version, CU/SIMD counts that drive CWSR sizing, and the canonical `cwsr_size` / `ctl_stack_size` advertised by the kernel.
enum
BringUpStatus
pub const BringUpStatus = enum Outcome categories for `bringUpPath`.
Every non-`ok` value identifies the exact bring-up step that failed (open, version check, aperture lookup, VA reservation, alloc, map, mmap, CPU round-trip, unmap, free).
struct
BringUpResult
pub const BringUpResult = struct Full bring-up report produced by `bringUpPath`.
Captures the status plus every observable value the smoke run collected (KFD ABI version, topology info, the GPUVM window, the reserved VA, and the errno of the last failing ioctl when applicable) so callers can render diagnostics without rerunning.
Methods
1method
BringUpResult.ok
pub fn ok(self: BringUpResult) bool True when every bring-up step succeeded (`status == .ok`).
function
lastErrno
pub fn lastErrno() ?linux.E Returns the errno of the most recent failing ioctl, or `null` if the last ioctl succeeded.
Cleared at the start of each `ioctlChecked` call.
function
renderMinorOf
pub fn renderMinorOf(render_node: []const u8) ?u32 Extract the DRM render minor number from a `/dev/dri/renderD<N>` path.
contain a recognizable `renderD<digits>` suffix.
function
findTopologyNode
pub fn findTopologyNode(render_minor: u32) ?TopologyNode Scan `topology_nodes_dir` for the GPU topology node whose `drm_render_minor` matches the given render minor, parsing `gpu_id`, `gfx_target_version`, `simd_count`, `cwsr_size`, and related properties from each node's `properties` file.
CPU-only nodes (gpu_id == 0) are skipped. the sysfs directory cannot be opened.
function
reachable
pub fn reachable() bool Cheaply test whether the KFD PM4 path appears usable on this machine without issuing any ioctls: opens `/dev/kfd`, then verifies that a matching topology node with a non-zero `gpu_id` exists in sysfs for the default render minor.
found for the default render node; `false` otherwise or on non-Linux targets.
function
bringUpDefault
pub fn bringUpDefault() BringUpResult Run `bringUpPath` against `default_render_node` ("/dev/dri/renderD128").
function
bringUpPath
pub fn bringUpPath(render_node: []const u8) BringUpResult End-to-end KFD bring-up: open `/dev/kfd` and the supplied render node, check the ABI version, acquire the GPUVM, look up the matching topology aperture, reserve a 64 KiB VA window inside it, allocate + mmap + MAP_MEMORY_TO_GPU a 4 KiB GTT scratch BO, write/read a magic value to verify the round-trip, then unmap and free everything.
Every failure point is captured in the returned `BringUpResult` (status + last errno) so a non-Linux caller or a partial-permissions environment can still get a useful diagnostic without panicking. round-trip.
enum
ComputeQueueSmokeStatus
pub const ComputeQueueSmokeStatus = enum Outcome categories for the PM4 compute-queue smoke path.
Each non-`ok` value identifies a specific failure step (BO allocation, map, `CREATE_QUEUE`, `DESTROY_QUEUE`) so callers can render bring-up reports.
struct
ComputeQueueSmokeResult
pub const ComputeQueueSmokeResult = struct Full report for the PM4 compute-queue smoke run.
Captures the status plus every observable value the run produced (BO VAs, queue id, doorbell offset, initial wptr/rptr, number of PM4 NOP dwords staged, and the errno of the last failing ioctl) so a higher-level CLI can render bring-up output.
Methods
1method
ComputeQueueSmokeResult.ok
pub fn ok(self: ComputeQueueSmokeResult) bool True when the queue was created, NOPs staged, and the queue destroyed without any ioctl failure.
function
alignUp
pub fn alignUp(value: u64, alignment: u64) u64 Round `value` up to the next multiple of `alignment`.
Returns `value` unchanged when `alignment` is zero or `value` is already aligned.
function
computeCwsrBoSize
pub fn computeCwsrBoSize( cwsr_size: u32, simd_count: u32, simd_per_cu_in: u32, num_xcc_in: u32, gfx_target_version: u32, ) u64 Compute the CWSR buffer-object size that the kernel's `kfd_queue_acquire_buffers` will accept for a given GPU topology.
The formula is `align_up((cwsr_size + debug_memory_size) * num_xcc, PAGE)`, where for gfx ≥ 10.1.x `debug_memory_size = align_up((simd_count / simd_per_cu / num_xcc) * 32 * 32, 64)` and for older IPs `debug_memory_size = 0`. Verified on the R9700 (gfx1201): cwsr_size 0x1d47000, debug 0x10000, resulting BO 0x1d57000. (2 for RDNA, 4 for GCN/CDNA).
function
createComputeQueueSmokeDefault
pub fn createComputeQueueSmokeDefault() ComputeQueueSmokeResult Run `createComputeQueueSmokePath` against `default_render_node`.
function
createComputeQueueSmokePath
pub fn createComputeQueueSmokePath(render_node: []const u8) ComputeQueueSmokeResult Stand up a PM4 compute queue end-to-end on the given render node.
Allocates the ring / wptr / rptr / EOP / CWSR buffer objects with the sizing rules `kfd_queue_acquire_buffers` validates, calls `AMDKFD_IOC_CREATE_QUEUE`, stages a couple of PM4 NOP packets into the ring (no doorbell yet), then destroys the queue and tears everything down. Every BO and ioctl failure is captured in the returned report so the bring-up smoke test can render the exact step that failed.
function
formatGfxTarget
pub fn formatGfxTarget(buf: []u8, gfx_target_version: u32) []const u8 Format a `gfx_target_version` integer (e.g.
120001) as a GFX target string such as `"gfx1201"`. The encoding is major×10000 + minor×100 + step, matching the value read from the KFD topology `properties` file. `gfx_target_version` is 0 or the buffer is too small.