Last updated: 2026-06-12
Inference Runtime
Cs
AMDGPU DRM command-submission (CS) path — bring-up of the RADV / radeonsi PM4 submission foundation.
T1 PM4-direct reaches the AMD command processor through three Linux ABIs: * `DRM_IOCTL_AMDGPU_USERQ` — the user-mode-queue ABI; the bench-node firmware reports zero compute USERQ slots, so it is unusable here (see `umq.zig`). * `/dev/kfd` `AMDKFD_IOC_CREATE_QUEUE` + a doorbell ring — works to create a raw `QUEUE_TYPE_COMPUTE` queue, but the MES never retires the PM4 we stage in it on this kernel (see `kfd.zig`). * `DRM_IOCTL_AMDGPU_CS` — the kernel-managed command-submission UAPI every AMD userspace driver (RADV, radeonsi, amdvlk) rides. The kernel owns the ring / doorbell / MES bookkeeping; userspace hands it an indirect buffer (IB) of PM4 and waits on the retired fence. This is the reliable foundation the GPU compute dispatch lowers onto.
This module brings the CS path's first retired PM4 batch up as a benchmark-visible gate: open the render node, query the compute HW IP, allocate an amdgpu context, create a persistent BO list for a GTT indirect-buffer BO plus data/signal/shader BOs, map them into the GPU VM at low VAs, submit PM4 streams through `DRM_IOCTL_AMDGPU_CS` using the same context/BO list, and wait for the returned fences with `DRM_IOCTL_AMDGPU_WAIT_CS`.
This is not the final T1/T2 ring from the design; it is the kernel-managed CS baseline used to validate packet bytes, BO residency, VM mapping, and fence retirement before lowering real decode slices onto the direct tiers.
36 exports shown
constant
default_render_node
pub const default_render_node = "/dev/dri/renderD128" Default DRM render node used by the CS bring-up gate when no path is provided.
constant
AMDGPU_HW_IP_GFX
pub const AMDGPU_HW_IP_GFX: u32 = 0 amdgpu HW IP block id for the graphics ring (uapi/drm/amdgpu_drm.h).
constant
AMDGPU_HW_IP_COMPUTE
pub const AMDGPU_HW_IP_COMPUTE: u32 = 1 amdgpu HW IP block id for the async compute ring used by ZINC submissions.
constant
AMDGPU_CTX_OP_ALLOC_CTX
pub const AMDGPU_CTX_OP_ALLOC_CTX: u32 = 1 `DRM_AMDGPU_CTX` op selector for allocating a new submission context.
constant
AMDGPU_CTX_OP_FREE_CTX
pub const AMDGPU_CTX_OP_FREE_CTX: u32 = 2 `DRM_AMDGPU_CTX` op selector for releasing a previously allocated context.
constant
AMDGPU_BO_LIST_OP_CREATE
pub const AMDGPU_BO_LIST_OP_CREATE: u32 = 0 `DRM_AMDGPU_BO_LIST` op selector to create a residency BO list handle.
constant
AMDGPU_BO_LIST_OP_DESTROY
pub const AMDGPU_BO_LIST_OP_DESTROY: u32 = 1 `DRM_AMDGPU_BO_LIST` op selector to destroy a previously created BO list.
constant
AMDGPU_CHUNK_ID_IB
pub const AMDGPU_CHUNK_ID_IB: u32 = 0x01 CS chunk id for an indirect-buffer descriptor (`DrmAmdgpuCsChunkIb`).
constant
AMDGPU_CHUNK_ID_BO_HANDLES
pub const AMDGPU_CHUNK_ID_BO_HANDLES: u32 = 0x06 CS chunk id for an inline BO-handles list, an alternative to a pre-created BO list.
constant
AMDGPU_IB_FLAG_EMIT_MEM_SYNC
pub const AMDGPU_IB_FLAG_EMIT_MEM_SYNC: u32 = 1 << 6 IB flag instructing the kernel to emit a memory-sync packet around the IB so writes from the BO list reach DRAM before/after the dispatch.
struct
DrmAmdgpuCtxIn
pub const DrmAmdgpuCtxIn = extern struct Input payload of `DRM_IOCTL_AMDGPU_CTX`: selects an op and carries the caller-supplied `ctx_id` and submission priority for that op.
struct
DrmAmdgpuCtxOutAlloc
pub const DrmAmdgpuCtxOutAlloc = extern struct Output payload of `AMDGPU_CTX_OP_ALLOC_CTX`: the kernel-assigned context id returned in the same `DrmAmdgpuCtx` union after a successful allocation.
struct
DrmAmdgpuCtxOutState
pub const DrmAmdgpuCtxOutState = extern struct Output payload of the `AMDGPU_CTX_OP_QUERY_STATE` op: GPU reset state and hang counter for the queried context (unused on the bring-up path).
union
DrmAmdgpuCtx
pub const DrmAmdgpuCtx = extern union Tagged union passed to `DRM_IOCTL_AMDGPU_CTX` covering the input request and the two output shapes (alloc / query-state).
struct
DrmAmdgpuBoListIn
pub const DrmAmdgpuBoListIn = extern struct Input payload of `DRM_IOCTL_AMDGPU_BO_LIST`: op selector plus a pointer to an array of `DrmAmdgpuBoListEntry` describing the BOs the submission must keep resident.
struct
DrmAmdgpuBoListOut
pub const DrmAmdgpuBoListOut = extern struct Output payload of `AMDGPU_BO_LIST_OP_CREATE`: the kernel-assigned BO list handle referenced from subsequent CS submissions.
union
DrmAmdgpuBoList
pub const DrmAmdgpuBoList = extern union Tagged union passed to `DRM_IOCTL_AMDGPU_BO_LIST` covering input and output.
struct
DrmAmdgpuBoListEntry
pub const DrmAmdgpuBoListEntry = extern struct Single residency entry inside a BO list: the GEM handle to make resident and a kernel-visible priority hint for eviction.
struct
DrmAmdgpuCsChunk
pub const DrmAmdgpuCsChunk = extern struct One chunk inside a `DRM_IOCTL_AMDGPU_CS` submission: a typed sub-payload (`chunk_id`, length in dwords, pointer to the chunk data).
struct
DrmAmdgpuCsIn
pub const DrmAmdgpuCsIn = extern struct Input payload of `DRM_IOCTL_AMDGPU_CS`: binds a context, BO list and an array of typed chunks (the IB descriptor lives in one of those chunks).
struct
DrmAmdgpuCsOut
pub const DrmAmdgpuCsOut = extern struct Output payload of `DRM_IOCTL_AMDGPU_CS`: the fence handle the caller waits on via `DRM_IOCTL_AMDGPU_WAIT_CS` for the submission to retire.
union
DrmAmdgpuCs
pub const DrmAmdgpuCs = extern union Tagged union passed to `DRM_IOCTL_AMDGPU_CS` covering input and output.
struct
DrmAmdgpuCsChunkIb
pub const DrmAmdgpuCsChunkIb = extern struct Chunk payload for `AMDGPU_CHUNK_ID_IB`: describes the indirect-buffer VA, its size in bytes, the target IP type/ring, and submission flags such as `AMDGPU_IB_FLAG_EMIT_MEM_SYNC`.
struct
DrmAmdgpuWaitCsIn
pub const DrmAmdgpuWaitCsIn = extern struct Input payload of `DRM_IOCTL_AMDGPU_WAIT_CS`: identifies the fence to wait on (by `handle`/`ctx_id` against a specific IP/ring) and the timeout.
struct
DrmAmdgpuWaitCsOut
pub const DrmAmdgpuWaitCsOut = extern struct Output payload of `DRM_IOCTL_AMDGPU_WAIT_CS`: zero on successful retirement, nonzero on timeout or fence error.
union
DrmAmdgpuWaitCs
pub const DrmAmdgpuWaitCs = extern union Tagged union passed to `DRM_IOCTL_AMDGPU_WAIT_CS` covering input and output.
struct
ArgmaxRangeResult
pub const ArgmaxRangeResult = struct Result produced by the ordered-score argmax row-range kernel.
struct
DmmvArgmaxResult
pub const DmmvArgmaxResult = struct Result produced by a quantized DMMV row-range kernel that performs its own in-kernel argmax over the computed rows.
enum
SmokeStatus
pub const SmokeStatus = enum Outcome classification for the CS bring-up smoke gate.
Each variant maps to a specific failure point in the open → submit → wait pipeline, so the benchmark UI can attribute a regression to render-node access, kernel ABI mismatch, BO/VA setup, submission, or fence retirement.
struct
SmokeResult
pub const SmokeResult = struct Structured result returned by the CS bring-up smoke gate.
Captures the rendezvous addresses, kernel-assigned handles, observed signal value, fence handles, and the final `SmokeStatus` so benchmark output can surface a precise failure mode without re-running the path.
Methods
1method
SmokeResult.ok
pub fn ok(self: SmokeResult) bool Returns true when both PM4 submissions retired and the signal BO read back the expected sentinel value.
struct
TokenBoundary
pub const TokenBoundary = struct Per-token CS submission context for the PM4 bring-up tiers.
Owns the long-lived amdgpu context, BO list, and the GPU-mapped indirect- buffer / input / output / signal / shader buffers used by the `copyU32`, `argmaxTop2`, `rmsNormElement0` and `dmmvF32RowRange` dispatches. Reused across many submissions so each decode step only re-records PM4 into the existing IB and re-submits via `DRM_IOCTL_AMDGPU_CS`.
Methods
16method
TokenBoundary.initDefault
pub fn initDefault() !TokenBoundary Open the canonical render node (`default_render_node`) and finish the full CS bring-up: context, BO list, IB / input / output / signal / shader buffers, all mapped into a low GPU VA range.
method
TokenBoundary.initPath
pub fn initPath(render_node: []const u8) !TokenBoundary Open the given render node and bring up the full CS submission state.
Allocates an amdgpu context, creates GTT-backed BOs for the indirect buffer, input scratch (~2 MiB), output, signal and shader pages, maps each into a fixed low GPU VA so the kernel does not need to re-bind them per submission, uploads the gfx1201 PM4 kernels into the shader page, and creates a persistent BO list referencing all five BOs.
method
TokenBoundary.deinit
pub fn deinit(self: *TokenBoundary) void Tear down every kernel resource the `init*` paths created: destroy the BO list, free the amdgpu context, `munmap` each CPU mapping, and close the render-node file descriptor.
method
TokenBoundary.copyU32
pub fn copyU32(self: *TokenBoundary, value: u32) !u32 Round-trip one `u32` through the GPU as the simplest end-to-end gate: PM4 `COPY_DATA` from the input page to the output page, plus a `WRITE_DATA` of a per-submission sentinel into the signal page.
method
TokenBoundary.produceToken
pub fn produceToken(self: *TokenBoundary, token_id: u32) !u32 Alias for `copyU32` framed as the per-token decode pulse: prove the GPU produced a token by round-tripping its id through a real PM4 submission and fence wait.
method
TokenBoundary.argmaxTop2
pub fn argmaxTop2( self: *TokenBoundary, token0: u32, score0: f32, token1: u32, score1: f32, ) !u32 Dispatch the gfx1201 top-2 argmax kernel and return the selected token.
Loads the argmax program into the compute SGPRs, packs the output VA, two ordered scores, and two token ids into `compute_user_data_2..7`, fires one workgroup, then waits on the signal sentinel before reading the kernel-chosen token from the output page.
method
TokenBoundary.argmaxF32Range
pub fn argmaxF32Range( self: *TokenBoundary, scores: []const f32, start_row: u32, ) !ArgmaxRangeResult Dispatch the gfx1201 ordered-score row-range argmax kernel.
Converts `scores` into sortable u32 keys, copies them into the shared input page, then lets the compute ring select the max row. The returned token id is absolute: `start_row + local_best`.
method
TokenBoundary.rmsNormElement0
pub fn rmsNormElement0( self: *TokenBoundary, hidden0: f32, inv_rms: f32, weight0: f32, ) !f32 Dispatch the single-element gfx1201 final-RMS-norm kernel.
Stores `hidden0 * inv_rms * weight0` into `output_map[0]` via a real PM4 dispatch on the compute ring, with a signal sentinel verifying retirement.
method
TokenBoundary.dmmvF32RowRange
pub fn dmmvF32RowRange( self: *TokenBoundary, input: []const f32, weights_f32: []const u8, rows: u32, cols: u32, output: []f32, ) !void Dispatch the gfx1201 row-range f32 dense matrix-vector kernel.
Copies the input vector and the row-major f32 weight block into the shared input page (64-byte aligned), records PM4 that points the kernel at the input/weights/output pages and the `rows`/`cols` arguments, and waits on the signal sentinel before reading `output`. This is the first row-oriented dense compute kernel the CS path runs; `cols` must be a multiple of 64 and `output` must hold at least `rows` elements.
method
TokenBoundary.dmmvQ4_0RowRange
pub fn dmmvQ4_0RowRange( self: *TokenBoundary, input: []const f32, weights_q4_0: []const u8, rows: u32, cols: u32, output: []f32, ) !void Dispatch the gfx1201 row-range Q4_0 matrix-vector kernel.
Copies the input vector and raw GGML Q4_0 rows into the shared input page, records PM4 for one serial workitem over `rows`, and reads back one f32 result per row. This intentionally validates real quantized model bytes through the native CS path while the full K-parallel DMMV kernel is still under construction.
method
TokenBoundary.dmmvQ4_0RowRangeParallel
pub fn dmmvQ4_0RowRangeParallel( self: *TokenBoundary, input: []const f32, weights_q4_0: []const u8, rows: u32, cols: u32, output: []f32, ) !void Dispatch the wave-lane gfx1201 Q4_0 matrix-vector kernel for exactly 64 rows in parallel.
Stages the same source-format rows as `dmmvQ4_0RowRange`, but launches one wave64 workgroup where each lane computes one row. Intended for 64-row LM-head prefix/window ranges where the GPU row values participate in choosing the sampled token.
method
TokenBoundary.dmmvQ4_0TwoRows
pub fn dmmvQ4_0TwoRows( self: *TokenBoundary, input: []const f32, row_a_q4_0: []const u8, row_b_q4_0: []const u8, cols: u32, output: []f32, ) !void Dispatch Q4_0 DMMV for two arbitrary model rows staged back-to-back.
The caller supplies two individual source-format rows, which are packed into the shared staging page as a compact two-row matrix. This lets the current forward path obtain both LM-head top-2 scores from one real DMMV row-range submission even when the rows are not adjacent in vocab.
method
TokenBoundary.dmmvQ4_0ArgmaxRowRange
pub fn dmmvQ4_0ArgmaxRowRange( self: *TokenBoundary, input: []const f32, weights_q4_0: []const u8, rows: u32, cols: u32, ) !DmmvArgmaxResult Dispatch the gfx1201 Q4_0 row-range DMMV kernel that performs argmax in the same submission.
The method stages the exact same source-format input and Q4_0 rows as `dmmvQ4_0RowRange`, but the kernel only stores the local best row and score. The forward path uses this for LM-head prefix/window candidates so a GPU-produced model value can directly participate in sampling without a follow-up direct argmax dispatch over copied logits.
method
TokenBoundary.dmmvQ8_0RowRange
pub fn dmmvQ8_0RowRange( self: *TokenBoundary, input: []const f32, weights_q8_0: []const u8, rows: u32, cols: u32, output: []f32, ) !void Dispatch the gfx1201 row-range Q8_0 matrix-vector kernel.
Copies the input vector and raw GGML Q8_0 rows into the shared input page, records PM4 for one serial workitem over `rows`, and reads back one f32 result per row. This keeps source-format Q8_0 model-slice validation exact while the final K-parallel DMMV kernel is still under construction.
method
TokenBoundary.dmmvQ8_0TwoRowRanges
pub fn dmmvQ8_0TwoRowRanges( self: *TokenBoundary, input: []const f32, weights_a_q8_0: []const u8, rows_a: u32, weights_b_q8_0: []const u8, rows_b: u32, cols: u32, output: []f32, ) !void Dispatch one gfx1201 Q8_0 DMMV kernel over two adjacent logical row ranges that share the same input vector.
The method packs `weights_a` followed by `weights_b` into the staging page, then runs the same compact Q8_0 row-range kernel over `rows_a + rows_b` rows. The output slice receives A's rows first and B's rows second. This is used by the M1 bridge to consume paired SSM alpha/beta projections without paying two CS submissions for the same activation vector.
method
TokenBoundary.dmmvQ8_0TwoRowRangesParallel64
pub fn dmmvQ8_0TwoRowRangesParallel64( self: *TokenBoundary, input: []const f32, weights_a_q8_0: []const u8, rows_a: u32, weights_b_q8_0: []const u8, rows_b: u32, cols: u32, output: []f32, ) !void Dispatch one wave64 Q8_0 DMMV kernel over two packed row ranges totalling exactly 64 rows.
This is the row-parallel companion to `dmmvQ8_0TwoRowRanges` for the current SSM alpha+beta shape: 32 alpha rows plus 32 beta rows. Each lane computes one row from the packed staging block, eliminating the serial per-row loop of the scalar variant.
function
lastErrno
pub fn lastErrno() ?linux.E Errno captured from the most recent `ioctl` issued by this module, or null if the call succeeded.
Useful for surfacing a precise reason after a `SmokeResult.status` indicates a kernel-side failure.
function
setupSmokeDefault
pub fn setupSmokeDefault() SmokeResult Run the bring-up smoke gate against `default_render_node`.
function
setupSmokePath
pub fn setupSmokePath(render_node: []const u8) SmokeResult Run the bring-up smoke gate against the given DRM render node path.
function
submitNopSmokeDefault
pub fn submitNopSmokeDefault() SmokeResult Backwards-compatible alias for `setupSmokeDefault` named for the underlying PM4 NOP+WRITE_DATA stream that exercises the CS path.
function
submitNopSmokePath
pub fn submitNopSmokePath(render_node: []const u8) SmokeResult Full bring-up smoke implementation: open the render node, query compute IP, allocate a context, create GTT-backed IB + signal BOs and map them at fixed low GPU VAs, build a PM4 NOP + `WRITE_DATA` stream, submit it twice through `DRM_IOCTL_AMDGPU_CS`, and verify each fence retires with the expected signal sentinel in the signal BO.