Last updated: 2026-06-12
Shader Dispatch
DMMV
Wrap the decode-time matrix-vector shader family used for projection ops.
This helper selects quantization-specific DMMV pipelines and records the push constants and workgroup sizes needed for single-token decode.
24 exports shown
struct
DmmvPushConstants
pub const DmmvPushConstants = extern struct Push constants for DMMV shaders (must match GLSL layout).
struct
BatchDmmvPushConstants
pub const BatchDmmvPushConstants = extern struct Push constants for batch DMMV shaders (prefill: multiple columns).
struct
MoeDmmvPushConstants
pub const MoeDmmvPushConstants = extern struct Push constants for MoE DMMV shaders (must match GLSL layout).
Batched expert dispatch: workgroup Y dimension selects expert slot.
struct
MoeBatchedDmmvPushConstants
pub const MoeBatchedDmmvPushConstants = extern struct Push constants for the cross-token batched MoE DMMV (src/shaders/dmmv_q4k_moe_batched.comp).
Each WG handles one (row_pair, expert_slot, token_idx) triple; the routing buffer is flattened [n_tokens × n_experts_used] of expert IDs.
struct
MoeColsDmmvPushConstants
pub const MoeColsDmmvPushConstants = extern struct Push constants for grouped route-column MoE DMMV shaders.
struct
MoeFusedDownAccPushConstants
pub const MoeFusedDownAccPushConstants = extern struct Push constants for the fused MoE down + weighted_acc shader (src/shaders/dmmv_q4k_moe_fused_down_acc.comp).
Same layout as MoeDmmvPushConstants plus n_used (the expert loop is internal to the shader so the dispatch grid drops the Y=n_experts_used dim).
struct
MoeGateUpGegluPushConstants
pub const MoeGateUpGegluPushConstants = extern struct Push constants for Gemma's packed Q4_K MoE gate+up+GEGLU shader.
The shader reads expert ids from the routing buffer, so `expert_stride` spans the full packed gate+up expert and `up_offset` selects the up half.
struct
GemmaTop1MoeBatchPushConstants
pub const GemmaTop1MoeBatchPushConstants = extern struct Push constants for Gemma short-prefill token-batched top-1 MoE shaders.
struct
GemmaTop1GateUpColsPushConstants
pub const GemmaTop1GateUpColsPushConstants = extern struct Push constants for route-packed Gemma short-prefill top-1 gate/up+GEGLU.
struct
OprojMergePushConstants
pub const OprojMergePushConstants = extern struct Push constants for the fused split-K merge + o_proj DMMV-acc shader (src/shaders/dmmv_q4k_o_proj_merge.comp).
Adds the merge-pass parameters to the standard DmmvPushConstants so a single dispatch reads partials, computes per-head LSE merge weights, stages attn_out into LDS, and runs the Q4_K matmul with residual accumulation.
struct
DmmvSigmoidAccPushConstants
pub const DmmvSigmoidAccPushConstants = extern struct Push constants for the fused Q8_0 DMMV + sigmoid-gated scale-accumulate shader (src/shaders/dmmv_q8_0_sigmoid_acc.comp).
Same prefix as DmmvPushConstants but `acc_mode` is replaced by `gate_offset` (the shader always accumulates and always sigmoid-gates; gate_offset selects which f32 in the gate buffer holds the shexp_gate scalar — typically 0).
struct
DmmvGateUpGegluPushConstants
pub const DmmvGateUpGegluPushConstants = extern struct Push constants for the Gemma CPU-MoE fused gate+up+GEGLU shader.
Gate/up offsets are byte offsets into the same packed Q4_K expert tensor; x/y offsets follow the standard DMMV byte convention.
struct
DmmvScaleAccPushConstants
pub const DmmvScaleAccPushConstants = extern struct Push constants for quantized DMMV fused with `y += scale * dot(W, x)`.
struct
DmmvQ8PairPushConstants
pub const DmmvQ8PairPushConstants = extern struct Push constants for the fused Q8_0 pair DMMV shader.
Computes two independent Q8_0 matvecs that share one F32 input vector.
struct
QuantizeQ8_1Push
pub const QuantizeQ8_1Push = extern struct Push constants for the quantize_q8_1 shader.
`ne` = number of f32 input elements (must be a multiple of 32). `num_blocks` = ne / 32. Pass explicitly so the shader does not have to divide.
struct
CountExpertsPush
pub const CountExpertsPush = extern struct Push constants for `count_experts.comp` (effort-6 Step 3 helper). Mirrors llama.cpp's count_experts push so the shader is structurally identical to the upstream version. All strides are in u32 units (not bytes).
For the prefill routing capture buffer with layout slot(token, layer) = (token * n_layers + layer) * (2 * n_experts_used) where the first n_experts_used u32s are expert IDs and the second n_experts_used u32s are f32 weights, configure as: ne00 = n_experts_used (cells per token row) ne01 = n_tokens (number of token rows) nb00 = 1 (consecutive within slot) nb01 = 2 * n_experts_used * n_layers (skip n_layers slots per row step) a_offset = layer * 2 * n_experts_used (jump to layer's slot in token 0)
struct
MoeRoutePackPush
pub const MoeRoutePackPush = extern struct Push constants for `moe_route_pack` — turns per-token expert routing into expert-grouped work lists (per-expert counts, packed token IDs, the active-block list, and the indirect dispatch args consumed by the MoE columns kernels).
struct
MulMmQ4KPush
pub const MulMmQ4KPush = extern struct Push constants for `mul_mm_q4k.comp` (effort-6 Step 1 of 5 foundation: tiled Q4_K dense GEMM).
Mirrors the dispatch-side argument shape needed to address an M×K Q4_K weight tensor against a K×N f32 activation tile. All offsets follow the existing dmmv convention: `a_offset` is in BYTES (the shader divides by 4 to index a_u32[]), `b_offset` and `d_offset` are in FLOATS. Layout for B/D is column-major: B[col][k] = data_b[b_offset + col*stride_b + k].
struct
MulMmQ6KDp4aPush
pub const MulMmQ6KDp4aPush = extern struct Push constants for the int8 DP4a full-tile Q6_K dense-down GEMM.
struct
QuantizeActPush
pub const QuantizeActPush = extern struct Push constants for the one-shot per-32-block activation int8 quantizer.
struct
MulMmQ4KGateUpDp4aPush
pub const MulMmQ4KGateUpDp4aPush = extern struct Push constants for the int8 DP4a full-tile Q4_K gate+up+SwiGLU GEMM.
Same fields as `MulMmQ6KDp4aPush` but `stride_b_scale` counts vec2 entries (one per 32-block) so the shader can fetch (scale, dsum) in one read.
struct
MulMmQ4KGateUpDp4aQ8Push
pub const MulMmQ4KGateUpDp4aQ8Push = extern struct Push constants for the int8 DP4a full-tile Q4_K gate+up+SwiGLU GEMM that emits Q8_0-style packed activations directly (fused SwiGLU + quantize for the Qwen3.6-27B dense-down DP4a path).
constant
Q8_1_BLOCK_BYTES
pub const Q8_1_BLOCK_BYTES: u32 = 36 Size in bytes of a single Q8_1 output block (32 int8 values + f16 d + f16 d*sum).
struct
DmmvDispatch
pub const DmmvDispatch = struct Manages DMMV pipelines for different quantization types.
Methods
42method
DmmvDispatch.init
pub fn init( instance: *const Instance, gpu_config: *const GpuConfig, shader_dir: []const u8, hidden_dim: u32, allocator: std.mem.Allocator, ) !DmmvDispatch Create the DMMV dispatch wrapper and load the supported quantized pipelines.
method
DmmvDispatch.pipelineForType
pub fn pipelineForType(self: *const DmmvDispatch, quant_type: GGMLType) ?*const Pipeline Select the quantization-specific pipeline used for a weight matrix format.
method
DmmvDispatch.moePipelineForType
pub fn moePipelineForType(self: *const DmmvDispatch, quant_type: GGMLType) ?*const Pipeline Select the MoE-specific pipeline for the given weight format (4 bindings: A, x, y, routing).
method
DmmvDispatch.recordMoeDispatch
pub fn recordMoeDispatch( self: *const DmmvDispatch, cmd: *CommandBuffer, quant_type: GGMLType, descriptor_set: vk.c.VkDescriptorSet, M: u32, K: u32, expert_stride: u32, n_experts_y: u32, x_expert_stride: u32, x_offset: u32, y_offset: u32, ) !void Record a batched MoE DMMV dispatch where all experts run in parallel via Y workgroups.
method
DmmvDispatch.recordMoeBatchedDispatch
pub fn recordMoeBatchedDispatch( self: *const DmmvDispatch, cmd: *CommandBuffer, push_desc_fn: ?PushDescriptorFn, a_buf: vk.c.VkBuffer, a_size: vk.c.VkDeviceSize, x_buf: vk.c.VkBuffer, x_size: vk.c.VkDeviceSize, y_buf: vk.c.VkBuffer, y_size: vk.c.VkDeviceSize, routing_buf: vk.c.VkBuffer, routing_size: vk.c.VkDeviceSize, M: u32, K: u32, expert_stride: u32, n_experts_used: u32, n_tokens: u32, x_token_stride: u32, y_token_stride: u32, ) !void Record a cross-token batched MoE DMMV dispatch.
Reads N tokens' inputs from `X_batch[N × K]`, dispatches one WG per (row_pair, expert_slot, token_idx), routes via flattened `routing[N × n_experts_used]`, writes to `Y_batch[N × n_experts_used × M]`. Q4_K only for now.
method
DmmvDispatch.recordMoeRoutePack
pub fn recordMoeRoutePack( self: *const DmmvDispatch, cmd: *CommandBuffer, push_desc_fn: ?PushDescriptorFn, routing_buf: vk.c.VkBuffer, routing_size: vk.c.VkDeviceSize, counts_buf: vk.c.VkBuffer, counts_size: vk.c.VkDeviceSize, ids_buf: vk.c.VkBuffer, ids_size: vk.c.VkDeviceSize, active_count_buf: vk.c.VkBuffer, active_count_size: vk.c.VkDeviceSize, active_blocks_buf: vk.c.VkBuffer, active_blocks_size: vk.c.VkDeviceSize, dispatch_args_buf: vk.c.VkBuffer, dispatch_args_size: vk.c.VkDeviceSize, n_tokens: u32, n_experts: u32, k: u32, routing_stride: u32, ids_stride: u32, gate_up_workgroups_x: u32, down_workgroups_x: u32, routing_token_base: u32, ) !void Record the single-workgroup `moe_route_pack` dispatch that compacts per-token expert routing into expert-grouped work lists for the MoE columns kernels.
Reads `routing_buf`/`ids_buf` and writes the per-expert `counts_buf`, the packed `active_blocks_buf` + `active_count_buf`, and `dispatch_args_buf` for the subsequent indirect MoE dispatch. Each buffer is paired with its byte size. `error.InvalidArgument` on a zero token/expert/k/stride/workgroup count.
method
DmmvDispatch.recordMoeColsDispatchIndirect
pub fn recordMoeColsDispatchIndirect( self: *const DmmvDispatch, cmd: *CommandBuffer, push_desc_fn: ?PushDescriptorFn, quant_type: GGMLType, a_buf: vk.c.VkBuffer, a_size: vk.c.VkDeviceSize, x_buf: vk.c.VkBuffer, x_size: vk.c.VkDeviceSize, y_buf: vk.c.VkBuffer, y_size: vk.c.VkDeviceSize, counts_buf: vk.c.VkBuffer, counts_size: vk.c.VkDeviceSize, ids_buf: vk.c.VkBuffer, ids_size: vk.c.VkDeviceSize, active_blocks_buf: vk.c.VkBuffer, active_blocks_size: vk.c.VkDeviceSize, indirect_buf: vk.c.VkBuffer, indirect_offset: vk.c.VkDeviceSize, M: u32, K: u32, expert_stride: u32, ids_stride: u32, x_route_divisor: u32, a_offset: u32, x_offset: u32, y_offset: u32, accumulate: bool, ) !void Record an **indirect** MoE columns DMMV — the per-expert weight×activation matvec whose workgroup count is read from `indirect_buf` rather than the host.
Indirect form: the active-block count is GPU-resident (produced by `recordMoeRoutePack`), so the host never stalls on a readback. Each buffer is paired with its byte size. `error.InvalidArgument` on zero M/K/ids_stride or K not 256-aligned.
method
DmmvDispatch.recordGemmaTop1GateUpGegluColsDispatchIndirect
pub fn recordGemmaTop1GateUpGegluColsDispatchIndirect( self: *const DmmvDispatch, cmd: *CommandBuffer, push_desc_fn: ?PushDescriptorFn, a_buf: vk.c.VkBuffer, a_size: vk.c.VkDeviceSize, x_buf: vk.c.VkBuffer, x_size: vk.c.VkDeviceSize, y_buf: vk.c.VkBuffer, y_size: vk.c.VkDeviceSize, counts_buf: vk.c.VkBuffer, counts_size: vk.c.VkDeviceSize, ids_buf: vk.c.VkBuffer, ids_size: vk.c.VkDeviceSize, active_blocks_buf: vk.c.VkBuffer, active_blocks_size: vk.c.VkDeviceSize, indirect_buf: vk.c.VkBuffer, indirect_offset: vk.c.VkDeviceSize, M: u32, K: u32, expert_stride: u32, up_offset: u32, ids_stride: u32, x_route_divisor: u32, a_offset: u32, y_offset: u32, ) !void No public doc comment yet.
method
DmmvDispatch.recordMoeColsDispatch
pub fn recordMoeColsDispatch( self: *const DmmvDispatch, cmd: *CommandBuffer, push_desc_fn: ?PushDescriptorFn, quant_type: GGMLType, a_buf: vk.c.VkBuffer, a_size: vk.c.VkDeviceSize, x_buf: vk.c.VkBuffer, x_size: vk.c.VkDeviceSize, y_buf: vk.c.VkBuffer, y_size: vk.c.VkDeviceSize, counts_buf: vk.c.VkBuffer, counts_size: vk.c.VkDeviceSize, ids_buf: vk.c.VkBuffer, ids_size: vk.c.VkDeviceSize, active_blocks_buf: vk.c.VkBuffer, active_blocks_size: vk.c.VkDeviceSize, M: u32, K: u32, expert_stride: u32, active_block_count: u32, ids_stride: u32, x_route_divisor: u32, a_offset: u32, x_offset: u32, y_offset: u32, ) !void Record a **direct** MoE columns DMMV using a host-known `active_block_count` (the non-indirect sibling of `recordMoeColsDispatchIndirect`).
Prefer this when the active block count is already known on the CPU; otherwise use the indirect form to avoid a GPU→CPU readback. This variant always overwrites `y_buf` (no accumulate). Each buffer is paired with its byte size. `error.InvalidArgument` on zero M/K/active_block_count/ids_stride or K not 256-aligned.
method
DmmvDispatch.recordDispatch
pub fn recordDispatch( self: *const DmmvDispatch, cmd: *CommandBuffer, quant_type: GGMLType, descriptor_set: vk.c.VkDescriptorSet, M: u32, K: u32, a_offset: u32, x_offset: u32, y_offset: u32, ) !void Record a decode-time matrix-vector multiply dispatch.
method
DmmvDispatch.recordBatchDispatchPush
pub fn recordBatchDispatchPush( self: *const DmmvDispatch, cmd: *CommandBuffer, quant_type: GGMLType, push_desc_fn: ?PushDescriptorFn, a_buf: vk.c.VkBuffer, a_size: vk.c.VkDeviceSize, x_buf: vk.c.VkBuffer, x_size: vk.c.VkDeviceSize, y_buf: vk.c.VkBuffer, y_size: vk.c.VkDeviceSize, M: u32, K: u32, a_offset: u32, x_offset: u32, y_offset: u32, num_cols: u32, ) !void Record a push-descriptor batch DMMV dispatch covering `num_cols` token columns.
Bindings: 0 = A weight matrix, 1 = X_batch (K × num_cols, column-major), 2 = Y_batch (M × num_cols, column-major).
method
DmmvDispatch.recordQuantizeQ8_1
pub fn recordQuantizeQ8_1( self: *const DmmvDispatch, cmd: *CommandBuffer, push_desc_fn: ?PushDescriptorFn, a_buf: vk.c.VkBuffer, a_size: vk.c.VkDeviceSize, d_buf: vk.c.VkBuffer, d_size: vk.c.VkDeviceSize, ne: u32, ) !void Record a dispatch that quantizes `ne` f32 elements from `a_buf` into Q8_1 blocks (36 bytes each) in `d_buf`.
Foundation for mul_mmq — no production callers yet. Requires `ne` to be a multiple of 32. Returns `error.PipelineNotLoaded` when the shader is unavailable, `error.InvalidArgument` when ne is not a multiple of 32.
method
DmmvDispatch.recordCountExperts
pub fn recordCountExperts( self: *const DmmvDispatch, cmd: *CommandBuffer, push_desc_fn: ?PushDescriptorFn, routing_buf: vk.c.VkBuffer, routing_size: vk.c.VkDeviceSize, counts_buf: vk.c.VkBuffer, counts_size: vk.c.VkDeviceSize, n_tokens: u32, n_layers: u32, layer: u32, n_experts_used: u32, n_experts: u32, d_offset_bytes: vk.c.VkDeviceSize, ) !void Effort 6 Step 3: dispatch the count_experts shader. For one layer, scan a routing buffer that stores per-(token, layer) topk expert IDs and produce a `[n_experts]` u32 count buffer.
Layout assumed for `routing_buf`: slot(token, layer) starts at byte offset (token * n_layers + layer) * (2 * n_experts_used) * 4 the first n_experts_used u32s are expert IDs, the next n_experts_used u32s are f32 weights (mirrors router_output_buf packing in forward.zig:5316+).
`counts_buf` must be sized for at least `n_experts * sizeof(u32)`. Caller is responsible for clearing or overwriting it (the shader writes one element per expert, indexed by gl_WorkGroupID.x).
Returns `error.PipelineNotLoaded` if the count_experts shader is unavailable, `error.InvalidArgument` for zero token counts.
method
DmmvDispatch.recordMulMmQ4K
pub fn recordMulMmQ4K( self: *const DmmvDispatch, cmd: *CommandBuffer, push_desc_fn: ?PushDescriptorFn, a_buf: vk.c.VkBuffer, a_size: vk.c.VkDeviceSize, b_buf: vk.c.VkBuffer, b_size: vk.c.VkDeviceSize, d_buf: vk.c.VkBuffer, d_size: vk.c.VkDeviceSize, M: u32, N: u32, K: u32, stride_b: u32, stride_d: u32, a_offset: u32, b_offset: u32, d_offset: u32, ) !void Effort-6 Step 1: dispatch the tiled Q4_K dense GEMM (`mul_mm_q4k.comp`). Computes D[M, N] = A[M, K] (Q4_K) × B[K, N] (f32), where B and D are column-major (B[col][k] = data_b[b_offset + col*stride_b + k], analogously for D).
Tile shape: WG = 64 threads producing a 32 × 32 output tile. Dispatch grid: ((M+31)/32) × ((N+31)/32) × 1.
Constraints: - K must be a multiple of 256 (Q4_K super-block size). - `a_offset` is in BYTES; `b_offset` and `d_offset` are in FLOATS. - Caller is responsible for any preceding clear of D.
Returns `error.PipelineNotLoaded` if mul_mm_q4k.spv isn't loaded, `error.InvalidArgument` for K-misaligned inputs.
method
DmmvDispatch.recordMulMmQ4KGateUpSwiglu
pub fn recordMulMmQ4KGateUpSwiglu( self: *const DmmvDispatch, cmd: *CommandBuffer, push_desc_fn: ?PushDescriptorFn, gate_buf: vk.c.VkBuffer, gate_size: vk.c.VkDeviceSize, up_buf: vk.c.VkBuffer, up_size: vk.c.VkDeviceSize, b_buf: vk.c.VkBuffer, b_size: vk.c.VkDeviceSize, d_buf: vk.c.VkBuffer, d_size: vk.c.VkDeviceSize, M: u32, N: u32, K: u32, stride_b: u32, stride_d: u32, a_offset: u32, b_offset: u32, d_offset: u32, ) !void Tiled Q4_K batched dense FFN front-end: computes silu(gate_weight * B) * (up_weight * B) directly into D.
Ragged (M or N not multiples of 32) shapes are handled by boundary checks in the shader.
method
DmmvDispatch.recordMulMmQ4KTail8
pub fn recordMulMmQ4KTail8( self: *const DmmvDispatch, cmd: *CommandBuffer, push_desc_fn: ?PushDescriptorFn, a_buf: vk.c.VkBuffer, a_size: vk.c.VkDeviceSize, b_buf: vk.c.VkBuffer, b_size: vk.c.VkDeviceSize, d_buf: vk.c.VkBuffer, d_size: vk.c.VkDeviceSize, M: u32, N: u32, K: u32, stride_b: u32, stride_d: u32, a_offset: u32, b_offset: u32, d_offset: u32, ) !void BN=8 Q4_K GEMM for narrow token tails after the generic 32-column path.
Requires row-aligned M; N may be 1..8 and is bounds-checked in-shader.
method
DmmvDispatch.recordMulMmQ4KGateUpGeglu
pub fn recordMulMmQ4KGateUpGeglu( self: *const DmmvDispatch, cmd: *CommandBuffer, push_desc_fn: ?PushDescriptorFn, gate_buf: vk.c.VkBuffer, gate_size: vk.c.VkDeviceSize, up_buf: vk.c.VkBuffer, up_size: vk.c.VkDeviceSize, b_buf: vk.c.VkBuffer, b_size: vk.c.VkDeviceSize, d_buf: vk.c.VkBuffer, d_size: vk.c.VkDeviceSize, M: u32, N: u32, K: u32, stride_b: u32, stride_d: u32, a_offset: u32, b_offset: u32, d_offset: u32, ) !void Tiled Q4_K batched Gemma dense FFN front-end: computes gelu(gate_weight * B) * (up_weight * B) directly into D.
Same binding and push layout as recordMulMmQ4KGateUpSwiglu.
method
DmmvDispatch.recordMulMmQ4KGateUpGegluFull
pub fn recordMulMmQ4KGateUpGegluFull( self: *const DmmvDispatch, cmd: *CommandBuffer, push_desc_fn: ?PushDescriptorFn, gate_buf: vk.c.VkBuffer, gate_size: vk.c.VkDeviceSize, up_buf: vk.c.VkBuffer, up_size: vk.c.VkDeviceSize, b_buf: vk.c.VkBuffer, b_size: vk.c.VkDeviceSize, d_buf: vk.c.VkBuffer, d_size: vk.c.VkDeviceSize, M: u32, N: u32, K: u32, stride_b: u32, stride_d: u32, a_offset: u32, b_offset: u32, d_offset: u32, ) !void Branchless full-tile Q4_K gate/up/GEGLU GEMM for Gemma dense prefill.
Requires M and N to be multiples of 32; ragged token tails use the checked recordMulMmQ4KGateUpGeglu path.
method
DmmvDispatch.recordMulMmQ4KGateUpGegluTail8
pub fn recordMulMmQ4KGateUpGegluTail8( self: *const DmmvDispatch, cmd: *CommandBuffer, push_desc_fn: ?PushDescriptorFn, gate_buf: vk.c.VkBuffer, gate_size: vk.c.VkDeviceSize, up_buf: vk.c.VkBuffer, up_size: vk.c.VkDeviceSize, b_buf: vk.c.VkBuffer, b_size: vk.c.VkDeviceSize, d_buf: vk.c.VkBuffer, d_size: vk.c.VkDeviceSize, M: u32, N: u32, K: u32, stride_b: u32, stride_d: u32, a_offset: u32, b_offset: u32, d_offset: u32, ) !void BN=8 Q4_K gate/up/GEGLU GEMM for narrow token tails after the 32-column full-tile path.
Requires row-aligned M; N may be 1..8.
method
DmmvDispatch.recordMulMmQ4KGateUpSwigluFull
pub fn recordMulMmQ4KGateUpSwigluFull( self: *const DmmvDispatch, cmd: *CommandBuffer, push_desc_fn: ?PushDescriptorFn, gate_buf: vk.c.VkBuffer, gate_size: vk.c.VkDeviceSize, up_buf: vk.c.VkBuffer, up_size: vk.c.VkDeviceSize, b_buf: vk.c.VkBuffer, b_size: vk.c.VkDeviceSize, d_buf: vk.c.VkBuffer, d_size: vk.c.VkDeviceSize, M: u32, N: u32, K: u32, stride_b: u32, stride_d: u32, a_offset: u32, b_offset: u32, d_offset: u32, ) !void Branchless full-tile Q4_K gate/up/SwiGLU GEMM.
Requires M and N to be multiples of 32; ragged token tails use the checked recordMulMmQ4KGateUpSwiglu path.
method
DmmvDispatch.recordMulMmQ6K
pub fn recordMulMmQ6K( self: *const DmmvDispatch, cmd: *CommandBuffer, push_desc_fn: ?PushDescriptorFn, a_buf: vk.c.VkBuffer, a_size: vk.c.VkDeviceSize, b_buf: vk.c.VkBuffer, b_size: vk.c.VkDeviceSize, d_buf: vk.c.VkBuffer, d_size: vk.c.VkDeviceSize, M: u32, N: u32, K: u32, stride_b: u32, stride_d: u32, a_offset: u32, b_offset: u32, d_offset: u32, ) !void Tiled Q6_K dense GEMM.
Same push/layout as recordMulMmQ4K. Used by Qwen3.6-27B layer-major prefill for dense-down and SSM wqkv.
method
DmmvDispatch.recordMulMmQ6KTail8
pub fn recordMulMmQ6KTail8( self: *const DmmvDispatch, cmd: *CommandBuffer, push_desc_fn: ?PushDescriptorFn, a_buf: vk.c.VkBuffer, a_size: vk.c.VkDeviceSize, b_buf: vk.c.VkBuffer, b_size: vk.c.VkDeviceSize, d_buf: vk.c.VkBuffer, d_size: vk.c.VkDeviceSize, M: u32, N: u32, K: u32, stride_b: u32, stride_d: u32, a_offset: u32, b_offset: u32, d_offset: u32, ) !void BN=8 Q6_K GEMM for narrow token tails after the 32-column full-tile path.
Requires row-aligned M; N may be 1..8 and is bounds-checked in-shader.
method
DmmvDispatch.recordMulMmQ5K
pub fn recordMulMmQ5K( self: *const DmmvDispatch, cmd: *CommandBuffer, push_desc_fn: ?PushDescriptorFn, a_buf: vk.c.VkBuffer, a_size: vk.c.VkDeviceSize, b_buf: vk.c.VkBuffer, b_size: vk.c.VkDeviceSize, d_buf: vk.c.VkBuffer, d_size: vk.c.VkDeviceSize, M: u32, N: u32, K: u32, stride_b: u32, stride_d: u32, a_offset: u32, b_offset: u32, d_offset: u32, ) !void Tiled Q5_K dense GEMM.
Same push/layout as recordMulMmQ4K/recordMulMmQ6K. Used by Qwen3.6-27B layer-major prefill for the SSM out projection, which otherwise falls through to the dmmv_q5k one-WG-per-row batched path.
method
DmmvDispatch.recordMulMmQ8_0
pub fn recordMulMmQ8_0( self: *const DmmvDispatch, cmd: *CommandBuffer, push_desc_fn: ?PushDescriptorFn, a_buf: vk.c.VkBuffer, a_size: vk.c.VkDeviceSize, b_buf: vk.c.VkBuffer, b_size: vk.c.VkDeviceSize, d_buf: vk.c.VkBuffer, d_size: vk.c.VkDeviceSize, M: u32, N: u32, K: u32, stride_b: u32, stride_d: u32, a_offset: u32, b_offset: u32, d_offset: u32, ) !void Tiled Q8_0 dense GEMM for Qwen3.6 A3B layer-major prefill (SSM out projection).
Uses the same `MulMmQ4KPush` layout as `recordMulMmQ4K`, but K must be a multiple of 32 (not 256).
method
DmmvDispatch.recordMulMmQ8_0FullDp4a
pub fn recordMulMmQ8_0FullDp4a( self: *const DmmvDispatch, cmd: *CommandBuffer, push_desc_fn: ?PushDescriptorFn, a_buf: vk.c.VkBuffer, a_size: vk.c.VkDeviceSize, b_packed_buf: vk.c.VkBuffer, b_packed_size: vk.c.VkDeviceSize, b_scale_buf: vk.c.VkBuffer, b_scale_size: vk.c.VkDeviceSize, d_buf: vk.c.VkBuffer, d_size: vk.c.VkDeviceSize, M: u32, N: u32, K: u32, a_offset: u32, d_offset: u32, ) !void Record an int8 DP4a full-tile Q8_0 GEMM.
The activation must already be quantized with recordQuantizeActQ8. Ragged token tails stay on the f32 recordMulMmQ8_0 path at the call site.
method
DmmvDispatch.recordMulMmQ6KFull
pub fn recordMulMmQ6KFull( self: *const DmmvDispatch, cmd: *CommandBuffer, push_desc_fn: ?PushDescriptorFn, a_buf: vk.c.VkBuffer, a_size: vk.c.VkDeviceSize, b_buf: vk.c.VkBuffer, b_size: vk.c.VkDeviceSize, d_buf: vk.c.VkBuffer, d_size: vk.c.VkDeviceSize, M: u32, N: u32, K: u32, stride_b: u32, stride_d: u32, a_offset: u32, b_offset: u32, d_offset: u32, ) !void Branchless full-tile Q6_K GEMM for 32-aligned token counts; the host routes only 32-aligned M/N tiles here.
method
DmmvDispatch.recordQuantizeActQ8
pub fn recordQuantizeActQ8( self: *const DmmvDispatch, cmd: *CommandBuffer, push_desc_fn: ?PushDescriptorFn, src_buf: vk.c.VkBuffer, src_size: vk.c.VkDeviceSize, dst_packed_buf: vk.c.VkBuffer, dst_packed_size: vk.c.VkDeviceSize, dst_scale_buf: vk.c.VkBuffer, dst_scale_size: vk.c.VkDeviceSize, n_tokens: u32, K: u32, ) !void Quantize an f32 activation matrix to packed int8 + per-32-block scales (one shot, no per-tile redundancy) for the int8 DP4a dense-down GEMM.
method
DmmvDispatch.recordMulMmQ6KFullDp4a
pub fn recordMulMmQ6KFullDp4a( self: *const DmmvDispatch, cmd: *CommandBuffer, push_desc_fn: ?PushDescriptorFn, a_buf: vk.c.VkBuffer, a_size: vk.c.VkDeviceSize, b_packed_buf: vk.c.VkBuffer, b_packed_size: vk.c.VkDeviceSize, b_scale_buf: vk.c.VkBuffer, b_scale_size: vk.c.VkDeviceSize, d_buf: vk.c.VkBuffer, d_size: vk.c.VkDeviceSize, M: u32, N: u32, K: u32, a_offset: u32, d_offset: u32, ) !void Record the int8 DP4a full-tile Q6_K dense-down GEMM.
Weights are Q6_K; the activation arrives pre-quantized (packed int8 + per-32-block f32 scale) from recordQuantizeActQ8. Output is token-major f32 [N][M].
method
DmmvDispatch.recordMulMmQ6KRagged72Dp4a
pub fn recordMulMmQ6KRagged72Dp4a( self: *const DmmvDispatch, cmd: *CommandBuffer, push_desc_fn: ?PushDescriptorFn, a_buf: vk.c.VkBuffer, a_size: vk.c.VkDeviceSize, b_packed_buf: vk.c.VkBuffer, b_packed_size: vk.c.VkDeviceSize, b_scale_buf: vk.c.VkBuffer, b_scale_size: vk.c.VkDeviceSize, d_buf: vk.c.VkBuffer, d_size: vk.c.VkDeviceSize, M: u32, N: u32, K: u32, a_offset: u32, d_offset: u32, ) !void Guarded BN=72 int8 DP4a Q6_K dense-down GEMM for Gemma 4 31B's 65-72 token public prompt shape.
This covers the 64-column body and <=8-column ragged tail in one pass over the K=21504 down weights.
method
DmmvDispatch.recordMulMmQ6KTail8Dp4a
pub fn recordMulMmQ6KTail8Dp4a( self: *const DmmvDispatch, cmd: *CommandBuffer, push_desc_fn: ?PushDescriptorFn, a_buf: vk.c.VkBuffer, a_size: vk.c.VkDeviceSize, b_packed_buf: vk.c.VkBuffer, b_packed_size: vk.c.VkDeviceSize, b_scale_buf: vk.c.VkBuffer, b_scale_size: vk.c.VkDeviceSize, d_buf: vk.c.VkBuffer, d_size: vk.c.VkDeviceSize, M: u32, N: u32, K: u32, a_offset: u32, b_packed_offset: u32, b_scale_offset: u32, d_offset: u32, ) !void BN=8 int8 DP4a Q6_K dense-down tail for Gemma 4 31B short prompts.
The packed/scaled activation descriptors are offset to the first tail token, while the output uses d_offset so the token-major destination layout remains contiguous with the full 64-token prefix.
method
DmmvDispatch.recordMulMmQ6KFullDp4aQ8_1
pub fn recordMulMmQ6KFullDp4aQ8_1( self: *const DmmvDispatch, cmd: *CommandBuffer, push_desc_fn: ?PushDescriptorFn, a_buf: vk.c.VkBuffer, a_size: vk.c.VkDeviceSize, b_packed_buf: vk.c.VkBuffer, b_packed_size: vk.c.VkDeviceSize, b_scale_dsum_buf: vk.c.VkBuffer, b_scale_dsum_size: vk.c.VkDeviceSize, d_buf: vk.c.VkBuffer, d_size: vk.c.VkDeviceSize, M: u32, N: u32, K: u32, a_offset: u32, d_offset: u32, ) !void Q8_1-input variant: same Q6_K DP4a GEMM but the activation scale buffer is `vec2 b_scale_dsum[]` per 32-block (Q4_K-style layout).
The shader reads `.x` only — `.y` (dsum) is unused since Q6_K weights have no per-block bias term. Used by the Qwen3.6-27B SSM wqkv projection so it can share a single Q8_1 quantize of scratch_norm with the Q4_K z projection. Push constant stride_b_scale = K/32 (number of vec2 entries per token), same numeric value as the Q8_0 variant since the indexing is in typed-element units.
method
DmmvDispatch.recordQuantizeActQ8_1
pub fn recordQuantizeActQ8_1( self: *const DmmvDispatch, cmd: *CommandBuffer, push_desc_fn: ?PushDescriptorFn, src_buf: vk.c.VkBuffer, src_size: vk.c.VkDeviceSize, dst_packed_buf: vk.c.VkBuffer, dst_packed_size: vk.c.VkDeviceSize, dst_scale_dsum_buf: vk.c.VkBuffer, dst_scale_dsum_size: vk.c.VkDeviceSize, n_tokens: u32, K: u32, ) !void Q8_1-style activation quantize: packed int8 + per-32-block (scale, dsum) for the DP4a Q4_K gate+up GEMM bias-correction term.
method
DmmvDispatch.recordMulMmQ4KGateUpSwigluFullDp4a
pub fn recordMulMmQ4KGateUpSwigluFullDp4a( self: *const DmmvDispatch, cmd: *CommandBuffer, push_desc_fn: ?PushDescriptorFn, gate_buf: vk.c.VkBuffer, gate_size: vk.c.VkDeviceSize, up_buf: vk.c.VkBuffer, up_size: vk.c.VkDeviceSize, b_packed_buf: vk.c.VkBuffer, b_packed_size: vk.c.VkDeviceSize, b_scale_dsum_buf: vk.c.VkBuffer, b_scale_dsum_size: vk.c.VkDeviceSize, d_buf: vk.c.VkBuffer, d_size: vk.c.VkDeviceSize, M: u32, N: u32, K: u32, a_offset: u32, d_offset: u32, ) !void int8 DP4a full-tile Q4_K gate+up+SwiGLU GEMM for Qwen3.6-27B dense FFN prefill.
Activations arrive pre-quantized from recordQuantizeActQ8_1 (packed int8 + per-32-block (scale, dsum)). Output is silu(gate)*up, token-major f32 [N][M].
method
DmmvDispatch.recordMulMmQ4KGateUpGegluFullDp4a
pub fn recordMulMmQ4KGateUpGegluFullDp4a( self: *const DmmvDispatch, cmd: *CommandBuffer, push_desc_fn: ?PushDescriptorFn, gate_buf: vk.c.VkBuffer, gate_size: vk.c.VkDeviceSize, up_buf: vk.c.VkBuffer, up_size: vk.c.VkDeviceSize, b_packed_buf: vk.c.VkBuffer, b_packed_size: vk.c.VkDeviceSize, b_scale_dsum_buf: vk.c.VkBuffer, b_scale_dsum_size: vk.c.VkDeviceSize, d_buf: vk.c.VkBuffer, d_size: vk.c.VkDeviceSize, M: u32, N: u32, K: u32, a_offset: u32, d_offset: u32, ) !void int8 DP4a full-tile Q4_K gate+up+GEGLU GEMM for Gemma dense FFN prefill.
Activations arrive pre-quantized from recordQuantizeActQ8_1 (packed int8 + per-32-block (scale, dsum)). Output is gelu(gate)*up, token-major f32 [N][M].
method
DmmvDispatch.recordMulMmQ4KGateUpSwigluFullDp4aQ8
pub fn recordMulMmQ4KGateUpSwigluFullDp4aQ8( self: *const DmmvDispatch, cmd: *CommandBuffer, push_desc_fn: ?PushDescriptorFn, gate_buf: vk.c.VkBuffer, gate_size: vk.c.VkDeviceSize, up_buf: vk.c.VkBuffer, up_size: vk.c.VkDeviceSize, b_packed_buf: vk.c.VkBuffer, b_packed_size: vk.c.VkDeviceSize, b_scale_dsum_buf: vk.c.VkBuffer, b_scale_dsum_size: vk.c.VkDeviceSize, d_packed_buf: vk.c.VkBuffer, d_packed_size: vk.c.VkDeviceSize, d_scale_buf: vk.c.VkBuffer, d_scale_size: vk.c.VkDeviceSize, M: u32, N: u32, K: u32, a_offset: u32, d_packed_offset: u32, d_scale_offset: u32, ) !void int8 DP4a full-tile Q4_K gate+up+SwiGLU GEMM that emits Q8_0-style packed activation directly.
Output layout matches quantize_act_q8.comp (per-token packed int8 + per-32-block scale), so the downstream dense-down DP4a kernel can consume it directly without the standalone quantize_act_q8 dispatch + barrier. The f32 SwiGLU intermediate is never written to global memory.
method
DmmvDispatch.recordMulMmQ4KGateUpGegluFullDp4aQ8
pub fn recordMulMmQ4KGateUpGegluFullDp4aQ8( self: *const DmmvDispatch, cmd: *CommandBuffer, push_desc_fn: ?PushDescriptorFn, gate_buf: vk.c.VkBuffer, gate_size: vk.c.VkDeviceSize, up_buf: vk.c.VkBuffer, up_size: vk.c.VkDeviceSize, b_packed_buf: vk.c.VkBuffer, b_packed_size: vk.c.VkDeviceSize, b_scale_dsum_buf: vk.c.VkBuffer, b_scale_dsum_size: vk.c.VkDeviceSize, d_packed_buf: vk.c.VkBuffer, d_packed_size: vk.c.VkDeviceSize, d_scale_buf: vk.c.VkBuffer, d_scale_size: vk.c.VkDeviceSize, M: u32, N: u32, K: u32, a_offset: u32, b_packed_offset: u32, b_scale_offset: u32, d_packed_offset: u32, d_scale_offset: u32, ) !void Gemma GEGLU variant of recordMulMmQ4KGateUpSwigluFullDp4aQ8.
Emits Q8_0-style packed GEGLU activation for Q6_K dense-down DP4a.
method
DmmvDispatch.recordMulMmQ4KGateUpSwigluFullDp4aQ8_1
pub fn recordMulMmQ4KGateUpSwigluFullDp4aQ8_1( self: *const DmmvDispatch, cmd: *CommandBuffer, push_desc_fn: ?PushDescriptorFn, gate_buf: vk.c.VkBuffer, gate_size: vk.c.VkDeviceSize, up_buf: vk.c.VkBuffer, up_size: vk.c.VkDeviceSize, b_packed_buf: vk.c.VkBuffer, b_packed_size: vk.c.VkDeviceSize, b_scale_dsum_buf: vk.c.VkBuffer, b_scale_dsum_size: vk.c.VkDeviceSize, d_packed_buf: vk.c.VkBuffer, d_packed_size: vk.c.VkDeviceSize, d_scale_dsum_buf: vk.c.VkBuffer, d_scale_dsum_size: vk.c.VkDeviceSize, M: u32, N: u32, K: u32, a_offset: u32, d_packed_offset: u32, d_scale_dsum_offset: u32, ) !void Q4_K-down sibling of recordMulMmQ4KGateUpSwigluFullDp4aQ8.
Same fused gate+up+SwiGLU DP4a GEMM, but emits Q8_1-style activation (packed int8 + per-32-block (scale, dsum) vec2) so the downstream mul_mm_q4k_full_dp4a (Q4_K-down) consumer can skip the standalone quantize_act_q8_1 dispatch + barrier. dsum = scale * sum(int8_lanes) is computed inside the kernel via subgroupClusteredAdd cluster_size=TPR_M=8, so there's no LDS round-trip beyond the GEMM's existing barriers. Caller is responsible for sizing d_scale_dsum_buf as 2x the Q8_0 scale buffer (per-block vec2 instead of per-block float).
method
DmmvDispatch.recordMulMmQ4KGateUpGegluFullDp4aQ8_1
pub fn recordMulMmQ4KGateUpGegluFullDp4aQ8_1( self: *const DmmvDispatch, cmd: *CommandBuffer, push_desc_fn: ?PushDescriptorFn, gate_buf: vk.c.VkBuffer, gate_size: vk.c.VkDeviceSize, up_buf: vk.c.VkBuffer, up_size: vk.c.VkDeviceSize, b_packed_buf: vk.c.VkBuffer, b_packed_size: vk.c.VkDeviceSize, b_scale_dsum_buf: vk.c.VkBuffer, b_scale_dsum_size: vk.c.VkDeviceSize, d_packed_buf: vk.c.VkBuffer, d_packed_size: vk.c.VkDeviceSize, d_scale_dsum_buf: vk.c.VkBuffer, d_scale_dsum_size: vk.c.VkDeviceSize, M: u32, N: u32, K: u32, a_offset: u32, b_packed_offset: u32, b_scale_offset: u32, d_packed_offset: u32, d_scale_dsum_offset: u32, ) !void Gemma GEGLU variant of recordMulMmQ4KGateUpSwigluFullDp4aQ8_1.
Emits Q8_1-style packed GEGLU activation for Q4_K dense-down DP4a.
method
DmmvDispatch.recordMulMmQ5KFullDp4a
pub fn recordMulMmQ5KFullDp4a( self: *const DmmvDispatch, cmd: *CommandBuffer, push_desc_fn: ?PushDescriptorFn, a_buf: vk.c.VkBuffer, a_size: vk.c.VkDeviceSize, b_packed_buf: vk.c.VkBuffer, b_packed_size: vk.c.VkDeviceSize, b_scale_dsum_buf: vk.c.VkBuffer, b_scale_dsum_size: vk.c.VkDeviceSize, d_buf: vk.c.VkBuffer, d_size: vk.c.VkDeviceSize, M: u32, N: u32, K: u32, a_offset: u32, d_offset: u32, ) !void int8 DP4a full-tile single Q5_K GEMM (no fused activation).
Used by the Qwen3.6-27B SSM out prefill projection (M=hidden_dim, K=d_inner). Activations arrive pre-quantized (packed int8 + per-32-block (scale, dsum)) from recordQuantizeActQ8_1. Output is token-major f32 [N][M]. Same push/binding shape as recordMulMmQ4KFullDp4a — the only difference is the 5-bit weight unpack inside the kernel.
method
DmmvDispatch.recordMulMmQ4KFullDp4a
pub fn recordMulMmQ4KFullDp4a( self: *const DmmvDispatch, cmd: *CommandBuffer, push_desc_fn: ?PushDescriptorFn, a_buf: vk.c.VkBuffer, a_size: vk.c.VkDeviceSize, b_packed_buf: vk.c.VkBuffer, b_packed_size: vk.c.VkDeviceSize, b_scale_dsum_buf: vk.c.VkBuffer, b_scale_dsum_size: vk.c.VkDeviceSize, d_buf: vk.c.VkBuffer, d_size: vk.c.VkDeviceSize, M: u32, N: u32, K: u32, a_offset: u32, d_offset: u32, ) !void int8 DP4a full-tile single Q4_K GEMM (no fused activation).
Used by the Qwen3.6-27B SSM z prefill projection. Activations arrive pre-quantized (packed int8 + per-32-block (scale, dsum)) from recordQuantizeActQ8_1. Output is token-major f32 [N][M].
method
DmmvDispatch.recordMulMmQ4KTail8Dp4a
pub fn recordMulMmQ4KTail8Dp4a( self: *const DmmvDispatch, cmd: *CommandBuffer, push_desc_fn: ?PushDescriptorFn, a_buf: vk.c.VkBuffer, a_size: vk.c.VkDeviceSize, b_packed_buf: vk.c.VkBuffer, b_packed_size: vk.c.VkDeviceSize, b_scale_dsum_buf: vk.c.VkBuffer, b_scale_dsum_size: vk.c.VkDeviceSize, d_buf: vk.c.VkBuffer, d_size: vk.c.VkDeviceSize, M: u32, N: u32, K: u32, a_offset: u32, b_packed_offset: u32, b_scale_dsum_offset: u32, d_offset: u32, ) !void BN=8 int8 DP4a Q4_K dense-down tail for Gemma 4 31B short prompts.
Descriptor offsets select the tail token slice of the pre-quantized Q8_1 activation buffers; d_offset keeps the f32 output token-major.
method
DmmvDispatch.deinit
pub fn deinit(self: *DmmvDispatch) void Destroy the loaded pipelines and descriptor pool.