Last updated: 2026-06-12

Shader Dispatch

DMMV

All API Sections

Wrap the decode-time matrix-vector shader family used for projection ops.

This helper selects quantization-specific DMMV pipelines and records the push constants and workgroup sizes needed for single-token decode.

24 exports 42 methods src/compute/dmmv.zig

24 exports shown

struct

DmmvPushConstants

#
pub const DmmvPushConstants = extern struct

Push constants for DMMV shaders (must match GLSL layout).

src/compute/dmmv.zig:21

struct

BatchDmmvPushConstants

#
pub const BatchDmmvPushConstants = extern struct

Push constants for batch DMMV shaders (prefill: multiple columns).

src/compute/dmmv.zig:31

struct

MoeDmmvPushConstants

#
pub const MoeDmmvPushConstants = extern struct

Push constants for MoE DMMV shaders (must match GLSL layout).

Batched expert dispatch: workgroup Y dimension selects expert slot.

src/compute/dmmv.zig:42

struct

MoeBatchedDmmvPushConstants

#
pub const MoeBatchedDmmvPushConstants = extern struct

Push constants for the cross-token batched MoE DMMV (src/shaders/dmmv_q4k_moe_batched.comp).

Each WG handles one (row_pair, expert_slot, token_idx) triple; the routing buffer is flattened [n_tokens × n_experts_used] of expert IDs.

src/compute/dmmv.zig:55

struct

MoeColsDmmvPushConstants

#
pub const MoeColsDmmvPushConstants = extern struct

Push constants for grouped route-column MoE DMMV shaders.

src/compute/dmmv.zig:65

struct

MoeFusedDownAccPushConstants

#
pub const MoeFusedDownAccPushConstants = extern struct

Push constants for the fused MoE down + weighted_acc shader (src/shaders/dmmv_q4k_moe_fused_down_acc.comp).

Same layout as MoeDmmvPushConstants plus n_used (the expert loop is internal to the shader so the dispatch grid drops the Y=n_experts_used dim).

src/compute/dmmv.zig:81

struct

MoeGateUpGegluPushConstants

#
pub const MoeGateUpGegluPushConstants = extern struct

Push constants for Gemma's packed Q4_K MoE gate+up+GEGLU shader.

The shader reads expert ids from the routing buffer, so `expert_stride` spans the full packed gate+up expert and `up_offset` selects the up half.

src/compute/dmmv.zig:95

struct

GemmaTop1MoeBatchPushConstants

#
pub const GemmaTop1MoeBatchPushConstants = extern struct

Push constants for Gemma short-prefill token-batched top-1 MoE shaders.

src/compute/dmmv.zig:105

struct

GemmaTop1GateUpColsPushConstants

#
pub const GemmaTop1GateUpColsPushConstants = extern struct

Push constants for route-packed Gemma short-prefill top-1 gate/up+GEGLU.

src/compute/dmmv.zig:114

struct

OprojMergePushConstants

#
pub const OprojMergePushConstants = extern struct

Push constants for the fused split-K merge + o_proj DMMV-acc shader (src/shaders/dmmv_q4k_o_proj_merge.comp).

Adds the merge-pass parameters to the standard DmmvPushConstants so a single dispatch reads partials, computes per-head LSE merge weights, stages attn_out into LDS, and runs the Q4_K matmul with residual accumulation.

src/compute/dmmv.zig:130

struct

DmmvSigmoidAccPushConstants

#
pub const DmmvSigmoidAccPushConstants = extern struct

Push constants for the fused Q8_0 DMMV + sigmoid-gated scale-accumulate shader (src/shaders/dmmv_q8_0_sigmoid_acc.comp).

Same prefix as DmmvPushConstants but `acc_mode` is replaced by `gate_offset` (the shader always accumulates and always sigmoid-gates; gate_offset selects which f32 in the gate buffer holds the shexp_gate scalar — typically 0).

src/compute/dmmv.zig:148

struct

DmmvGateUpGegluPushConstants

#
pub const DmmvGateUpGegluPushConstants = extern struct

Push constants for the Gemma CPU-MoE fused gate+up+GEGLU shader.

Gate/up offsets are byte offsets into the same packed Q4_K expert tensor; x/y offsets follow the standard DMMV byte convention.

src/compute/dmmv.zig:160

struct

DmmvScaleAccPushConstants

#
pub const DmmvScaleAccPushConstants = extern struct

Push constants for quantized DMMV fused with `y += scale * dot(W, x)`.

src/compute/dmmv.zig:170

struct

DmmvQ8PairPushConstants

#
pub const DmmvQ8PairPushConstants = extern struct

Push constants for the fused Q8_0 pair DMMV shader.

Computes two independent Q8_0 matvecs that share one F32 input vector.

src/compute/dmmv.zig:181

struct

QuantizeQ8_1Push

#
pub const QuantizeQ8_1Push = extern struct

Push constants for the quantize_q8_1 shader.

`ne` = number of f32 input elements (must be a multiple of 32). `num_blocks` = ne / 32. Pass explicitly so the shader does not have to divide.

src/compute/dmmv.zig:195

struct

CountExpertsPush

#
pub const CountExpertsPush = extern struct

Push constants for `count_experts.comp` (effort-6 Step 3 helper). Mirrors llama.cpp's count_experts push so the shader is structurally identical to the upstream version. All strides are in u32 units (not bytes).

For the prefill routing capture buffer with layout slot(token, layer) = (token * n_layers + layer) * (2 * n_experts_used) where the first n_experts_used u32s are expert IDs and the second n_experts_used u32s are f32 weights, configure as: ne00 = n_experts_used (cells per token row) ne01 = n_tokens (number of token rows) nb00 = 1 (consecutive within slot) nb01 = 2 * n_experts_used * n_layers (skip n_layers slots per row step) a_offset = layer * 2 * n_experts_used (jump to layer's slot in token 0)

src/compute/dmmv.zig:213

struct

MoeRoutePackPush

#
pub const MoeRoutePackPush = extern struct

Push constants for `moe_route_pack` — turns per-token expert routing into expert-grouped work lists (per-expert counts, packed token IDs, the active-block list, and the indirect dispatch args consumed by the MoE columns kernels).

src/compute/dmmv.zig:224

struct

MulMmQ4KPush

#
pub const MulMmQ4KPush = extern struct

Push constants for `mul_mm_q4k.comp` (effort-6 Step 1 of 5 foundation: tiled Q4_K dense GEMM).

Mirrors the dispatch-side argument shape needed to address an M×K Q4_K weight tensor against a K×N f32 activation tile. All offsets follow the existing dmmv convention: `a_offset` is in BYTES (the shader divides by 4 to index a_u32[]), `b_offset` and `d_offset` are in FLOATS. Layout for B/D is column-major: B[col][k] = data_b[b_offset + col*stride_b + k].

src/compute/dmmv.zig:242

struct

MulMmQ6KDp4aPush

#
pub const MulMmQ6KDp4aPush = extern struct

Push constants for the int8 DP4a full-tile Q6_K dense-down GEMM.

src/compute/dmmv.zig:254

struct

QuantizeActPush

#
pub const QuantizeActPush = extern struct

Push constants for the one-shot per-32-block activation int8 quantizer.

src/compute/dmmv.zig:266

struct

MulMmQ4KGateUpDp4aPush

#
pub const MulMmQ4KGateUpDp4aPush = extern struct

Push constants for the int8 DP4a full-tile Q4_K gate+up+SwiGLU GEMM.

Same fields as `MulMmQ6KDp4aPush` but `stride_b_scale` counts vec2 entries (one per 32-block) so the shader can fetch (scale, dsum) in one read.

src/compute/dmmv.zig:276

struct

MulMmQ4KGateUpDp4aQ8Push

#
pub const MulMmQ4KGateUpDp4aQ8Push = extern struct

Push constants for the int8 DP4a full-tile Q4_K gate+up+SwiGLU GEMM that emits Q8_0-style packed activations directly (fused SwiGLU + quantize for the Qwen3.6-27B dense-down DP4a path).

src/compute/dmmv.zig:290

constant

Q8_1_BLOCK_BYTES

#
pub const Q8_1_BLOCK_BYTES: u32 = 36

Size in bytes of a single Q8_1 output block (32 int8 values + f16 d + f16 d*sum).

src/compute/dmmv.zig:306

struct

DmmvDispatch

#
pub const DmmvDispatch = struct

Manages DMMV pipelines for different quantization types.

src/compute/dmmv.zig:309

Methods

42

method

DmmvDispatch.init

#
pub fn init( instance: *const Instance, gpu_config: *const GpuConfig, shader_dir: []const u8, hidden_dim: u32, allocator: std.mem.Allocator, ) !DmmvDispatch

Create the DMMV dispatch wrapper and load the supported quantized pipelines.

Parameters
instance
Active Vulkan instance and logical device.
gpu_config
Derived GPU tuning parameters (wave size, push-descriptor support).
shader_dir
Directory containing compiled SPIR-V shader binaries (`.spv` files).
hidden_dim
Maximum K value used by the Q4_K and F32 shaders' shared-memory array; must be >= hidden_dim, inter_dim, q_dim, and d_inner.
allocator
Allocator used for temporary pipeline creation state.
Returns

A fully-initialised DmmvDispatch ready to record projection work; missing shaders are silently set to null.

src/compute/dmmv.zig:648

method

DmmvDispatch.pipelineForType

#
pub fn pipelineForType(self: *const DmmvDispatch, quant_type: GGMLType) ?*const Pipeline

Select the quantization-specific pipeline used for a weight matrix format.

Parameters
self
Dispatch wrapper containing the loaded DMMV pipelines.
quant_type
GGML quantization format for the weight matrix.
Returns

A pipeline pointer when that quantization format has a loaded shader implementation.

Notes

Unsupported or unloaded formats return `null` so callers can surface `error.UnsupportedQuantType`.

src/compute/dmmv.zig:1625

method

DmmvDispatch.moePipelineForType

#
pub fn moePipelineForType(self: *const DmmvDispatch, quant_type: GGMLType) ?*const Pipeline

Select the MoE-specific pipeline for the given weight format (4 bindings: A, x, y, routing).

Parameters
self
Dispatch wrapper containing the loaded DMMV pipelines.
quant_type
GGML quantization format for the MoE expert weight matrix.
Returns

A pipeline pointer when a MoE shader is loaded for the format, or null for unsupported types.

src/compute/dmmv.zig:1644

method

DmmvDispatch.recordMoeDispatch

#
pub fn recordMoeDispatch( self: *const DmmvDispatch, cmd: *CommandBuffer, quant_type: GGMLType, descriptor_set: vk.c.VkDescriptorSet, M: u32, K: u32, expert_stride: u32, n_experts_y: u32, x_expert_stride: u32, x_offset: u32, y_offset: u32, ) !void

Record a batched MoE DMMV dispatch where all experts run in parallel via Y workgroups.

Parameters
cmd
Command buffer to record into.
quant_type
Weight quantization; resolved via `moePipelineForType`.
descriptor_set
Descriptor set with bindings A, x, y, routing.
M
Output rows (weight rows per expert).
K
Contraction width (shared across all experts).
expert_stride
Byte stride between consecutive experts in the stacked weight tensor.
n_experts_y
Number of experts to process; becomes the Y workgroup dimension.
x_expert_stride
Element stride between consecutive experts' input vectors (0 = shared input; K = per-expert).
x_offset
Element offset into the input buffer.
y_offset
Element offset into the output buffer.
Returns

`error.UnsupportedQuantType` when no MoE pipeline is loaded for `quant_type`.

src/compute/dmmv.zig:1667

method

DmmvDispatch.recordMoeBatchedDispatch

#
pub fn recordMoeBatchedDispatch( self: *const DmmvDispatch, cmd: *CommandBuffer, push_desc_fn: ?PushDescriptorFn, a_buf: vk.c.VkBuffer, a_size: vk.c.VkDeviceSize, x_buf: vk.c.VkBuffer, x_size: vk.c.VkDeviceSize, y_buf: vk.c.VkBuffer, y_size: vk.c.VkDeviceSize, routing_buf: vk.c.VkBuffer, routing_size: vk.c.VkDeviceSize, M: u32, K: u32, expert_stride: u32, n_experts_used: u32, n_tokens: u32, x_token_stride: u32, y_token_stride: u32, ) !void

Record a cross-token batched MoE DMMV dispatch.

Reads N tokens' inputs from `X_batch[N × K]`, dispatches one WG per (row_pair, expert_slot, token_idx), routes via flattened `routing[N × n_experts_used]`, writes to `Y_batch[N × n_experts_used × M]`. Q4_K only for now.

src/compute/dmmv.zig:1704

method

DmmvDispatch.recordMoeRoutePack

#
pub fn recordMoeRoutePack( self: *const DmmvDispatch, cmd: *CommandBuffer, push_desc_fn: ?PushDescriptorFn, routing_buf: vk.c.VkBuffer, routing_size: vk.c.VkDeviceSize, counts_buf: vk.c.VkBuffer, counts_size: vk.c.VkDeviceSize, ids_buf: vk.c.VkBuffer, ids_size: vk.c.VkDeviceSize, active_count_buf: vk.c.VkBuffer, active_count_size: vk.c.VkDeviceSize, active_blocks_buf: vk.c.VkBuffer, active_blocks_size: vk.c.VkDeviceSize, dispatch_args_buf: vk.c.VkBuffer, dispatch_args_size: vk.c.VkDeviceSize, n_tokens: u32, n_experts: u32, k: u32, routing_stride: u32, ids_stride: u32, gate_up_workgroups_x: u32, down_workgroups_x: u32, routing_token_base: u32, ) !void

Record the single-workgroup `moe_route_pack` dispatch that compacts per-token expert routing into expert-grouped work lists for the MoE columns kernels.

Reads `routing_buf`/`ids_buf` and writes the per-expert `counts_buf`, the packed `active_blocks_buf` + `active_count_buf`, and `dispatch_args_buf` for the subsequent indirect MoE dispatch. Each buffer is paired with its byte size. `error.InvalidArgument` on a zero token/expert/k/stride/workgroup count.

Parameters
cmd
Command buffer to record into.
push_desc_fn
Optional push-descriptor function (null uses bound descriptor sets).
n_tokens
Number of tokens being routed.
n_experts
Total expert count in the layer.
k
Experts selected per token (top-k).
routing_stride
Per-token stride (elements) into the routing buffer.
ids_stride
Per-token stride into the packed IDs buffer.
gate_up_workgroups_x
Workgroups-X to encode into the gate/up columns dispatch args.
down_workgroups_x
Workgroups-X to encode into the down columns dispatch args.
routing_token_base
First token offset within the routing buffer.
Returns

`error.PipelineNotLoaded` if the route-pack pipeline is absent, or

src/compute/dmmv.zig:1770

method

DmmvDispatch.recordMoeColsDispatchIndirect

#
pub fn recordMoeColsDispatchIndirect( self: *const DmmvDispatch, cmd: *CommandBuffer, push_desc_fn: ?PushDescriptorFn, quant_type: GGMLType, a_buf: vk.c.VkBuffer, a_size: vk.c.VkDeviceSize, x_buf: vk.c.VkBuffer, x_size: vk.c.VkDeviceSize, y_buf: vk.c.VkBuffer, y_size: vk.c.VkDeviceSize, counts_buf: vk.c.VkBuffer, counts_size: vk.c.VkDeviceSize, ids_buf: vk.c.VkBuffer, ids_size: vk.c.VkDeviceSize, active_blocks_buf: vk.c.VkBuffer, active_blocks_size: vk.c.VkDeviceSize, indirect_buf: vk.c.VkBuffer, indirect_offset: vk.c.VkDeviceSize, M: u32, K: u32, expert_stride: u32, ids_stride: u32, x_route_divisor: u32, a_offset: u32, x_offset: u32, y_offset: u32, accumulate: bool, ) !void

Record an **indirect** MoE columns DMMV — the per-expert weight×activation matvec whose workgroup count is read from `indirect_buf` rather than the host.

Indirect form: the active-block count is GPU-resident (produced by `recordMoeRoutePack`), so the host never stalls on a readback. Each buffer is paired with its byte size. `error.InvalidArgument` on zero M/K/ids_stride or K not 256-aligned.

Parameters
cmd
Command buffer to record into.
push_desc_fn
Optional push-descriptor function (null uses bound descriptor sets).
quant_type
Weight quantization; only `q4_k` and `q5_k` are supported here.
indirect_buf
Buffer holding the `VkDispatchIndirectCommand` workgroup dims.
indirect_offset
Byte offset of the dispatch args within `indirect_buf`.
M
Output rows (expert weight rows).
K
Contraction width; must be a multiple of 256.
expert_stride
Per-expert stride (elements) into the weight buffer.
ids_stride
Per-token stride into the packed IDs buffer.
x_route_divisor
Divisor mapping output rows back to source activation rows.
accumulate
When true, add into `y_buf` instead of overwriting it.
Returns

`error.UnsupportedQuantType` for non-q4_k/q5_k weights, or

src/compute/dmmv.zig:1846

method

DmmvDispatch.recordGemmaTop1GateUpGegluColsDispatchIndirect

#
pub fn recordGemmaTop1GateUpGegluColsDispatchIndirect( self: *const DmmvDispatch, cmd: *CommandBuffer, push_desc_fn: ?PushDescriptorFn, a_buf: vk.c.VkBuffer, a_size: vk.c.VkDeviceSize, x_buf: vk.c.VkBuffer, x_size: vk.c.VkDeviceSize, y_buf: vk.c.VkBuffer, y_size: vk.c.VkDeviceSize, counts_buf: vk.c.VkBuffer, counts_size: vk.c.VkDeviceSize, ids_buf: vk.c.VkBuffer, ids_size: vk.c.VkDeviceSize, active_blocks_buf: vk.c.VkBuffer, active_blocks_size: vk.c.VkDeviceSize, indirect_buf: vk.c.VkBuffer, indirect_offset: vk.c.VkDeviceSize, M: u32, K: u32, expert_stride: u32, up_offset: u32, ids_stride: u32, x_route_divisor: u32, a_offset: u32, y_offset: u32, ) !void

No public doc comment yet.

src/compute/dmmv.zig:1911

method

DmmvDispatch.recordMoeColsDispatch

#
pub fn recordMoeColsDispatch( self: *const DmmvDispatch, cmd: *CommandBuffer, push_desc_fn: ?PushDescriptorFn, quant_type: GGMLType, a_buf: vk.c.VkBuffer, a_size: vk.c.VkDeviceSize, x_buf: vk.c.VkBuffer, x_size: vk.c.VkDeviceSize, y_buf: vk.c.VkBuffer, y_size: vk.c.VkDeviceSize, counts_buf: vk.c.VkBuffer, counts_size: vk.c.VkDeviceSize, ids_buf: vk.c.VkBuffer, ids_size: vk.c.VkDeviceSize, active_blocks_buf: vk.c.VkBuffer, active_blocks_size: vk.c.VkDeviceSize, M: u32, K: u32, expert_stride: u32, active_block_count: u32, ids_stride: u32, x_route_divisor: u32, a_offset: u32, x_offset: u32, y_offset: u32, ) !void

Record a **direct** MoE columns DMMV using a host-known `active_block_count` (the non-indirect sibling of `recordMoeColsDispatchIndirect`).

Prefer this when the active block count is already known on the CPU; otherwise use the indirect form to avoid a GPU→CPU readback. This variant always overwrites `y_buf` (no accumulate). Each buffer is paired with its byte size. `error.InvalidArgument` on zero M/K/active_block_count/ids_stride or K not 256-aligned.

Parameters
cmd
Command buffer to record into.
push_desc_fn
Optional push-descriptor function (null uses bound descriptor sets).
quant_type
Weight quantization; only `q4_k` and `q5_k` are supported here.
M
Output rows (expert weight rows).
K
Contraction width; must be a multiple of 256.
expert_stride
Per-expert stride (elements) into the weight buffer.
active_block_count
Number of active expert blocks to dispatch (workgroups-Y).
ids_stride
Per-token stride into the packed IDs buffer.
x_route_divisor
Divisor mapping output rows back to source activation rows.
Returns

`error.UnsupportedQuantType` for non-q4_k/q5_k weights, or

src/compute/dmmv.zig:1986

method

DmmvDispatch.recordDispatch

#
pub fn recordDispatch( self: *const DmmvDispatch, cmd: *CommandBuffer, quant_type: GGMLType, descriptor_set: vk.c.VkDescriptorSet, M: u32, K: u32, a_offset: u32, x_offset: u32, y_offset: u32, ) !void

Record a decode-time matrix-vector multiply dispatch.

Parameters
self
Dispatch wrapper containing the quantization-specific pipelines.
cmd
Command buffer currently being recorded.
quant_type
GGML quantization format for the weight matrix.
descriptor_set
Descriptor set containing matrix, input vector, and output buffers.
M
Output row count.
K
Input feature width.
a_offset
Byte offset for the weight matrix.
x_offset
Byte offset for the input vector.
y_offset
Byte offset for the output vector.
Returns

`error.UnsupportedQuantType` when no pipeline is available for `quant_type`.

Notes

The helper uses one workgroup per 2 output rows for most quantized formats.

src/compute/dmmv.zig:2062

method

DmmvDispatch.recordBatchDispatchPush

#
pub fn recordBatchDispatchPush( self: *const DmmvDispatch, cmd: *CommandBuffer, quant_type: GGMLType, push_desc_fn: ?PushDescriptorFn, a_buf: vk.c.VkBuffer, a_size: vk.c.VkDeviceSize, x_buf: vk.c.VkBuffer, x_size: vk.c.VkDeviceSize, y_buf: vk.c.VkBuffer, y_size: vk.c.VkDeviceSize, M: u32, K: u32, a_offset: u32, x_offset: u32, y_offset: u32, num_cols: u32, ) !void

Record a push-descriptor batch DMMV dispatch covering `num_cols` token columns.

Bindings: 0 = A weight matrix, 1 = X_batch (K × num_cols, column-major), 2 = Y_batch (M × num_cols, column-major).

Parameters
cmd
Command buffer to record into.
quant_type
Weight quantization; only Q4_K and Q6_K batch shaders are supported.
push_desc_fn
Push-descriptor function pointer (null falls back to bound descriptor sets).
M
Output row count (weight matrix rows).
K
Contraction width (hidden dimension).
num_cols
Number of token columns to process in this batch.
Returns

`error.UnsupportedQuantType` if no batch shader is loaded for `quant_type`.

src/compute/dmmv.zig:2142

method

DmmvDispatch.recordQuantizeQ8_1

#
pub fn recordQuantizeQ8_1( self: *const DmmvDispatch, cmd: *CommandBuffer, push_desc_fn: ?PushDescriptorFn, a_buf: vk.c.VkBuffer, a_size: vk.c.VkDeviceSize, d_buf: vk.c.VkBuffer, d_size: vk.c.VkDeviceSize, ne: u32, ) !void

Record a dispatch that quantizes `ne` f32 elements from `a_buf` into Q8_1 blocks (36 bytes each) in `d_buf`.

Foundation for mul_mmq — no production callers yet. Requires `ne` to be a multiple of 32. Returns `error.PipelineNotLoaded` when the shader is unavailable, `error.InvalidArgument` when ne is not a multiple of 32.

src/compute/dmmv.zig:2195

method

DmmvDispatch.recordCountExperts

#
pub fn recordCountExperts( self: *const DmmvDispatch, cmd: *CommandBuffer, push_desc_fn: ?PushDescriptorFn, routing_buf: vk.c.VkBuffer, routing_size: vk.c.VkDeviceSize, counts_buf: vk.c.VkBuffer, counts_size: vk.c.VkDeviceSize, n_tokens: u32, n_layers: u32, layer: u32, n_experts_used: u32, n_experts: u32, d_offset_bytes: vk.c.VkDeviceSize, ) !void

Effort 6 Step 3: dispatch the count_experts shader. For one layer, scan a routing buffer that stores per-(token, layer) topk expert IDs and produce a `[n_experts]` u32 count buffer.

Layout assumed for `routing_buf`: slot(token, layer) starts at byte offset (token * n_layers + layer) * (2 * n_experts_used) * 4 the first n_experts_used u32s are expert IDs, the next n_experts_used u32s are f32 weights (mirrors router_output_buf packing in forward.zig:5316+).

`counts_buf` must be sized for at least `n_experts * sizeof(u32)`. Caller is responsible for clearing or overwriting it (the shader writes one element per expert, indexed by gl_WorkGroupID.x).

Returns `error.PipelineNotLoaded` if the count_experts shader is unavailable, `error.InvalidArgument` for zero token counts.

src/compute/dmmv.zig:2243

method

DmmvDispatch.recordMulMmQ4K

#
pub fn recordMulMmQ4K( self: *const DmmvDispatch, cmd: *CommandBuffer, push_desc_fn: ?PushDescriptorFn, a_buf: vk.c.VkBuffer, a_size: vk.c.VkDeviceSize, b_buf: vk.c.VkBuffer, b_size: vk.c.VkDeviceSize, d_buf: vk.c.VkBuffer, d_size: vk.c.VkDeviceSize, M: u32, N: u32, K: u32, stride_b: u32, stride_d: u32, a_offset: u32, b_offset: u32, d_offset: u32, ) !void

Effort-6 Step 1: dispatch the tiled Q4_K dense GEMM (`mul_mm_q4k.comp`). Computes D[M, N] = A[M, K] (Q4_K) × B[K, N] (f32), where B and D are column-major (B[col][k] = data_b[b_offset + col*stride_b + k], analogously for D).

Tile shape: WG = 64 threads producing a 32 × 32 output tile. Dispatch grid: ((M+31)/32) × ((N+31)/32) × 1.

Constraints: - K must be a multiple of 256 (Q4_K super-block size). - `a_offset` is in BYTES; `b_offset` and `d_offset` are in FLOATS. - Caller is responsible for any preceding clear of D.

Returns `error.PipelineNotLoaded` if mul_mm_q4k.spv isn't loaded, `error.InvalidArgument` for K-misaligned inputs.

src/compute/dmmv.zig:2307

method

DmmvDispatch.recordMulMmQ4KGateUpSwiglu

#
pub fn recordMulMmQ4KGateUpSwiglu( self: *const DmmvDispatch, cmd: *CommandBuffer, push_desc_fn: ?PushDescriptorFn, gate_buf: vk.c.VkBuffer, gate_size: vk.c.VkDeviceSize, up_buf: vk.c.VkBuffer, up_size: vk.c.VkDeviceSize, b_buf: vk.c.VkBuffer, b_size: vk.c.VkDeviceSize, d_buf: vk.c.VkBuffer, d_size: vk.c.VkDeviceSize, M: u32, N: u32, K: u32, stride_b: u32, stride_d: u32, a_offset: u32, b_offset: u32, d_offset: u32, ) !void

Tiled Q4_K batched dense FFN front-end: computes silu(gate_weight * B) * (up_weight * B) directly into D.

Ragged (M or N not multiples of 32) shapes are handled by boundary checks in the shader.

Parameters
M
Output rows (gate/up weight rows, i.e. inter_dim).
N
Token batch size (number of columns).
K
Contraction width; must be a multiple of 256.
Returns

`error.PipelineNotLoaded` if the gate+up+SwiGLU pipeline is absent, or `error.InvalidArgument` for zero/misaligned K or zero M/N.

src/compute/dmmv.zig:2364

method

DmmvDispatch.recordMulMmQ4KTail8

#
pub fn recordMulMmQ4KTail8( self: *const DmmvDispatch, cmd: *CommandBuffer, push_desc_fn: ?PushDescriptorFn, a_buf: vk.c.VkBuffer, a_size: vk.c.VkDeviceSize, b_buf: vk.c.VkBuffer, b_size: vk.c.VkDeviceSize, d_buf: vk.c.VkBuffer, d_size: vk.c.VkDeviceSize, M: u32, N: u32, K: u32, stride_b: u32, stride_d: u32, a_offset: u32, b_offset: u32, d_offset: u32, ) !void

BN=8 Q4_K GEMM for narrow token tails after the generic 32-column path.

Requires row-aligned M; N may be 1..8 and is bounds-checked in-shader.

src/compute/dmmv.zig:2419

method

DmmvDispatch.recordMulMmQ4KGateUpGeglu

#
pub fn recordMulMmQ4KGateUpGeglu( self: *const DmmvDispatch, cmd: *CommandBuffer, push_desc_fn: ?PushDescriptorFn, gate_buf: vk.c.VkBuffer, gate_size: vk.c.VkDeviceSize, up_buf: vk.c.VkBuffer, up_size: vk.c.VkDeviceSize, b_buf: vk.c.VkBuffer, b_size: vk.c.VkDeviceSize, d_buf: vk.c.VkBuffer, d_size: vk.c.VkDeviceSize, M: u32, N: u32, K: u32, stride_b: u32, stride_d: u32, a_offset: u32, b_offset: u32, d_offset: u32, ) !void

Tiled Q4_K batched Gemma dense FFN front-end: computes gelu(gate_weight * B) * (up_weight * B) directly into D.

Same binding and push layout as recordMulMmQ4KGateUpSwiglu.

src/compute/dmmv.zig:2469

method

DmmvDispatch.recordMulMmQ4KGateUpGegluFull

#
pub fn recordMulMmQ4KGateUpGegluFull( self: *const DmmvDispatch, cmd: *CommandBuffer, push_desc_fn: ?PushDescriptorFn, gate_buf: vk.c.VkBuffer, gate_size: vk.c.VkDeviceSize, up_buf: vk.c.VkBuffer, up_size: vk.c.VkDeviceSize, b_buf: vk.c.VkBuffer, b_size: vk.c.VkDeviceSize, d_buf: vk.c.VkBuffer, d_size: vk.c.VkDeviceSize, M: u32, N: u32, K: u32, stride_b: u32, stride_d: u32, a_offset: u32, b_offset: u32, d_offset: u32, ) !void

Branchless full-tile Q4_K gate/up/GEGLU GEMM for Gemma dense prefill.

Requires M and N to be multiples of 32; ragged token tails use the checked recordMulMmQ4KGateUpGeglu path.

src/compute/dmmv.zig:2525

method

DmmvDispatch.recordMulMmQ4KGateUpGegluTail8

#
pub fn recordMulMmQ4KGateUpGegluTail8( self: *const DmmvDispatch, cmd: *CommandBuffer, push_desc_fn: ?PushDescriptorFn, gate_buf: vk.c.VkBuffer, gate_size: vk.c.VkDeviceSize, up_buf: vk.c.VkBuffer, up_size: vk.c.VkDeviceSize, b_buf: vk.c.VkBuffer, b_size: vk.c.VkDeviceSize, d_buf: vk.c.VkBuffer, d_size: vk.c.VkDeviceSize, M: u32, N: u32, K: u32, stride_b: u32, stride_d: u32, a_offset: u32, b_offset: u32, d_offset: u32, ) !void

BN=8 Q4_K gate/up/GEGLU GEMM for narrow token tails after the 32-column full-tile path.

Requires row-aligned M; N may be 1..8.

src/compute/dmmv.zig:2578

method

DmmvDispatch.recordMulMmQ4KGateUpSwigluFull

#
pub fn recordMulMmQ4KGateUpSwigluFull( self: *const DmmvDispatch, cmd: *CommandBuffer, push_desc_fn: ?PushDescriptorFn, gate_buf: vk.c.VkBuffer, gate_size: vk.c.VkDeviceSize, up_buf: vk.c.VkBuffer, up_size: vk.c.VkDeviceSize, b_buf: vk.c.VkBuffer, b_size: vk.c.VkDeviceSize, d_buf: vk.c.VkBuffer, d_size: vk.c.VkDeviceSize, M: u32, N: u32, K: u32, stride_b: u32, stride_d: u32, a_offset: u32, b_offset: u32, d_offset: u32, ) !void

Branchless full-tile Q4_K gate/up/SwiGLU GEMM.

Requires M and N to be multiples of 32; ragged token tails use the checked recordMulMmQ4KGateUpSwiglu path.

src/compute/dmmv.zig:2632

method

DmmvDispatch.recordMulMmQ6K

#
pub fn recordMulMmQ6K( self: *const DmmvDispatch, cmd: *CommandBuffer, push_desc_fn: ?PushDescriptorFn, a_buf: vk.c.VkBuffer, a_size: vk.c.VkDeviceSize, b_buf: vk.c.VkBuffer, b_size: vk.c.VkDeviceSize, d_buf: vk.c.VkBuffer, d_size: vk.c.VkDeviceSize, M: u32, N: u32, K: u32, stride_b: u32, stride_d: u32, a_offset: u32, b_offset: u32, d_offset: u32, ) !void

Tiled Q6_K dense GEMM.

Same push/layout as recordMulMmQ4K. Used by Qwen3.6-27B layer-major prefill for dense-down and SSM wqkv.

src/compute/dmmv.zig:2685

method

DmmvDispatch.recordMulMmQ6KTail8

#
pub fn recordMulMmQ6KTail8( self: *const DmmvDispatch, cmd: *CommandBuffer, push_desc_fn: ?PushDescriptorFn, a_buf: vk.c.VkBuffer, a_size: vk.c.VkDeviceSize, b_buf: vk.c.VkBuffer, b_size: vk.c.VkDeviceSize, d_buf: vk.c.VkBuffer, d_size: vk.c.VkDeviceSize, M: u32, N: u32, K: u32, stride_b: u32, stride_d: u32, a_offset: u32, b_offset: u32, d_offset: u32, ) !void

BN=8 Q6_K GEMM for narrow token tails after the 32-column full-tile path.

Requires row-aligned M; N may be 1..8 and is bounds-checked in-shader.

src/compute/dmmv.zig:2737

method

DmmvDispatch.recordMulMmQ5K

#
pub fn recordMulMmQ5K( self: *const DmmvDispatch, cmd: *CommandBuffer, push_desc_fn: ?PushDescriptorFn, a_buf: vk.c.VkBuffer, a_size: vk.c.VkDeviceSize, b_buf: vk.c.VkBuffer, b_size: vk.c.VkDeviceSize, d_buf: vk.c.VkBuffer, d_size: vk.c.VkDeviceSize, M: u32, N: u32, K: u32, stride_b: u32, stride_d: u32, a_offset: u32, b_offset: u32, d_offset: u32, ) !void

Tiled Q5_K dense GEMM.

Same push/layout as recordMulMmQ4K/recordMulMmQ6K. Used by Qwen3.6-27B layer-major prefill for the SSM out projection, which otherwise falls through to the dmmv_q5k one-WG-per-row batched path.

src/compute/dmmv.zig:2788

method

DmmvDispatch.recordMulMmQ8_0

#
pub fn recordMulMmQ8_0( self: *const DmmvDispatch, cmd: *CommandBuffer, push_desc_fn: ?PushDescriptorFn, a_buf: vk.c.VkBuffer, a_size: vk.c.VkDeviceSize, b_buf: vk.c.VkBuffer, b_size: vk.c.VkDeviceSize, d_buf: vk.c.VkBuffer, d_size: vk.c.VkDeviceSize, M: u32, N: u32, K: u32, stride_b: u32, stride_d: u32, a_offset: u32, b_offset: u32, d_offset: u32, ) !void

Tiled Q8_0 dense GEMM for Qwen3.6 A3B layer-major prefill (SSM out projection).

Uses the same `MulMmQ4KPush` layout as `recordMulMmQ4K`, but K must be a multiple of 32 (not 256).

Returns

`error.PipelineNotLoaded` if the Q8_0 GEMM pipeline is absent, or `error.InvalidArgument` for K not a multiple of 32 or zero M/N.

src/compute/dmmv.zig:2841

method

DmmvDispatch.recordMulMmQ8_0FullDp4a

#
pub fn recordMulMmQ8_0FullDp4a( self: *const DmmvDispatch, cmd: *CommandBuffer, push_desc_fn: ?PushDescriptorFn, a_buf: vk.c.VkBuffer, a_size: vk.c.VkDeviceSize, b_packed_buf: vk.c.VkBuffer, b_packed_size: vk.c.VkDeviceSize, b_scale_buf: vk.c.VkBuffer, b_scale_size: vk.c.VkDeviceSize, d_buf: vk.c.VkBuffer, d_size: vk.c.VkDeviceSize, M: u32, N: u32, K: u32, a_offset: u32, d_offset: u32, ) !void

Record an int8 DP4a full-tile Q8_0 GEMM.

The activation must already be quantized with recordQuantizeActQ8. Ragged token tails stay on the f32 recordMulMmQ8_0 path at the call site.

src/compute/dmmv.zig:2892

method

DmmvDispatch.recordMulMmQ6KFull

#
pub fn recordMulMmQ6KFull( self: *const DmmvDispatch, cmd: *CommandBuffer, push_desc_fn: ?PushDescriptorFn, a_buf: vk.c.VkBuffer, a_size: vk.c.VkDeviceSize, b_buf: vk.c.VkBuffer, b_size: vk.c.VkDeviceSize, d_buf: vk.c.VkBuffer, d_size: vk.c.VkDeviceSize, M: u32, N: u32, K: u32, stride_b: u32, stride_d: u32, a_offset: u32, b_offset: u32, d_offset: u32, ) !void

Branchless full-tile Q6_K GEMM for 32-aligned token counts; the host routes only 32-aligned M/N tiles here.

Returns

`error.PipelineNotLoaded` if the full-tile pipeline is absent, or `error.InvalidArgument` for misaligned or zero dimensions.

Notes

M and N must both be multiples of 32; ragged tails must use `recordMulMmQ6K` instead.

src/compute/dmmv.zig:2943

method

DmmvDispatch.recordQuantizeActQ8

#
pub fn recordQuantizeActQ8( self: *const DmmvDispatch, cmd: *CommandBuffer, push_desc_fn: ?PushDescriptorFn, src_buf: vk.c.VkBuffer, src_size: vk.c.VkDeviceSize, dst_packed_buf: vk.c.VkBuffer, dst_packed_size: vk.c.VkDeviceSize, dst_scale_buf: vk.c.VkBuffer, dst_scale_size: vk.c.VkDeviceSize, n_tokens: u32, K: u32, ) !void

Quantize an f32 activation matrix to packed int8 + per-32-block scales (one shot, no per-tile redundancy) for the int8 DP4a dense-down GEMM.

Parameters
src_buf
token-major f32 activation [n_tokens][K].
dst_packed_buf
token-major packed int8 [n_tokens][K/4] (4 lanes/uint).
dst_scale_buf
token-major f32 scale [n_tokens][K/32].

src/compute/dmmv.zig:2996

method

DmmvDispatch.recordMulMmQ6KFullDp4a

#
pub fn recordMulMmQ6KFullDp4a( self: *const DmmvDispatch, cmd: *CommandBuffer, push_desc_fn: ?PushDescriptorFn, a_buf: vk.c.VkBuffer, a_size: vk.c.VkDeviceSize, b_packed_buf: vk.c.VkBuffer, b_packed_size: vk.c.VkDeviceSize, b_scale_buf: vk.c.VkBuffer, b_scale_size: vk.c.VkDeviceSize, d_buf: vk.c.VkBuffer, d_size: vk.c.VkDeviceSize, M: u32, N: u32, K: u32, a_offset: u32, d_offset: u32, ) !void

Record the int8 DP4a full-tile Q6_K dense-down GEMM.

Weights are Q6_K; the activation arrives pre-quantized (packed int8 + per-32-block f32 scale) from recordQuantizeActQ8. Output is token-major f32 [N][M].

src/compute/dmmv.zig:3038

method

DmmvDispatch.recordMulMmQ6KRagged72Dp4a

#
pub fn recordMulMmQ6KRagged72Dp4a( self: *const DmmvDispatch, cmd: *CommandBuffer, push_desc_fn: ?PushDescriptorFn, a_buf: vk.c.VkBuffer, a_size: vk.c.VkDeviceSize, b_packed_buf: vk.c.VkBuffer, b_packed_size: vk.c.VkDeviceSize, b_scale_buf: vk.c.VkBuffer, b_scale_size: vk.c.VkDeviceSize, d_buf: vk.c.VkBuffer, d_size: vk.c.VkDeviceSize, M: u32, N: u32, K: u32, a_offset: u32, d_offset: u32, ) !void

Guarded BN=72 int8 DP4a Q6_K dense-down GEMM for Gemma 4 31B's 65-72 token public prompt shape.

This covers the 64-column body and <=8-column ragged tail in one pass over the K=21504 down weights.

src/compute/dmmv.zig:3097

method

DmmvDispatch.recordMulMmQ6KTail8Dp4a

#
pub fn recordMulMmQ6KTail8Dp4a( self: *const DmmvDispatch, cmd: *CommandBuffer, push_desc_fn: ?PushDescriptorFn, a_buf: vk.c.VkBuffer, a_size: vk.c.VkDeviceSize, b_packed_buf: vk.c.VkBuffer, b_packed_size: vk.c.VkDeviceSize, b_scale_buf: vk.c.VkBuffer, b_scale_size: vk.c.VkDeviceSize, d_buf: vk.c.VkBuffer, d_size: vk.c.VkDeviceSize, M: u32, N: u32, K: u32, a_offset: u32, b_packed_offset: u32, b_scale_offset: u32, d_offset: u32, ) !void

BN=8 int8 DP4a Q6_K dense-down tail for Gemma 4 31B short prompts.

The packed/scaled activation descriptors are offset to the first tail token, while the output uses d_offset so the token-major destination layout remains contiguous with the full 64-token prefix.

src/compute/dmmv.zig:3150

method

DmmvDispatch.recordMulMmQ6KFullDp4aQ8_1

#
pub fn recordMulMmQ6KFullDp4aQ8_1( self: *const DmmvDispatch, cmd: *CommandBuffer, push_desc_fn: ?PushDescriptorFn, a_buf: vk.c.VkBuffer, a_size: vk.c.VkDeviceSize, b_packed_buf: vk.c.VkBuffer, b_packed_size: vk.c.VkDeviceSize, b_scale_dsum_buf: vk.c.VkBuffer, b_scale_dsum_size: vk.c.VkDeviceSize, d_buf: vk.c.VkBuffer, d_size: vk.c.VkDeviceSize, M: u32, N: u32, K: u32, a_offset: u32, d_offset: u32, ) !void

Q8_1-input variant: same Q6_K DP4a GEMM but the activation scale buffer is `vec2 b_scale_dsum[]` per 32-block (Q4_K-style layout).

The shader reads `.x` only — `.y` (dsum) is unused since Q6_K weights have no per-block bias term. Used by the Qwen3.6-27B SSM wqkv projection so it can share a single Q8_1 quantize of scratch_norm with the Q4_K z projection. Push constant stride_b_scale = K/32 (number of vec2 entries per token), same numeric value as the Q8_0 variant since the indexing is in typed-element units.

src/compute/dmmv.zig:3216

method

DmmvDispatch.recordQuantizeActQ8_1

#
pub fn recordQuantizeActQ8_1( self: *const DmmvDispatch, cmd: *CommandBuffer, push_desc_fn: ?PushDescriptorFn, src_buf: vk.c.VkBuffer, src_size: vk.c.VkDeviceSize, dst_packed_buf: vk.c.VkBuffer, dst_packed_size: vk.c.VkDeviceSize, dst_scale_dsum_buf: vk.c.VkBuffer, dst_scale_dsum_size: vk.c.VkDeviceSize, n_tokens: u32, K: u32, ) !void

Q8_1-style activation quantize: packed int8 + per-32-block (scale, dsum) for the DP4a Q4_K gate+up GEMM bias-correction term.

src/compute/dmmv.zig:3266

method

DmmvDispatch.recordMulMmQ4KGateUpSwigluFullDp4a

#
pub fn recordMulMmQ4KGateUpSwigluFullDp4a( self: *const DmmvDispatch, cmd: *CommandBuffer, push_desc_fn: ?PushDescriptorFn, gate_buf: vk.c.VkBuffer, gate_size: vk.c.VkDeviceSize, up_buf: vk.c.VkBuffer, up_size: vk.c.VkDeviceSize, b_packed_buf: vk.c.VkBuffer, b_packed_size: vk.c.VkDeviceSize, b_scale_dsum_buf: vk.c.VkBuffer, b_scale_dsum_size: vk.c.VkDeviceSize, d_buf: vk.c.VkBuffer, d_size: vk.c.VkDeviceSize, M: u32, N: u32, K: u32, a_offset: u32, d_offset: u32, ) !void

int8 DP4a full-tile Q4_K gate+up+SwiGLU GEMM for Qwen3.6-27B dense FFN prefill.

Activations arrive pre-quantized from recordQuantizeActQ8_1 (packed int8 + per-32-block (scale, dsum)). Output is silu(gate)*up, token-major f32 [N][M].

src/compute/dmmv.zig:3309

method

DmmvDispatch.recordMulMmQ4KGateUpGegluFullDp4a

#
pub fn recordMulMmQ4KGateUpGegluFullDp4a( self: *const DmmvDispatch, cmd: *CommandBuffer, push_desc_fn: ?PushDescriptorFn, gate_buf: vk.c.VkBuffer, gate_size: vk.c.VkDeviceSize, up_buf: vk.c.VkBuffer, up_size: vk.c.VkDeviceSize, b_packed_buf: vk.c.VkBuffer, b_packed_size: vk.c.VkDeviceSize, b_scale_dsum_buf: vk.c.VkBuffer, b_scale_dsum_size: vk.c.VkDeviceSize, d_buf: vk.c.VkBuffer, d_size: vk.c.VkDeviceSize, M: u32, N: u32, K: u32, a_offset: u32, d_offset: u32, ) !void

int8 DP4a full-tile Q4_K gate+up+GEGLU GEMM for Gemma dense FFN prefill.

Activations arrive pre-quantized from recordQuantizeActQ8_1 (packed int8 + per-32-block (scale, dsum)). Output is gelu(gate)*up, token-major f32 [N][M].

src/compute/dmmv.zig:3364

method

DmmvDispatch.recordMulMmQ4KGateUpSwigluFullDp4aQ8

#
pub fn recordMulMmQ4KGateUpSwigluFullDp4aQ8( self: *const DmmvDispatch, cmd: *CommandBuffer, push_desc_fn: ?PushDescriptorFn, gate_buf: vk.c.VkBuffer, gate_size: vk.c.VkDeviceSize, up_buf: vk.c.VkBuffer, up_size: vk.c.VkDeviceSize, b_packed_buf: vk.c.VkBuffer, b_packed_size: vk.c.VkDeviceSize, b_scale_dsum_buf: vk.c.VkBuffer, b_scale_dsum_size: vk.c.VkDeviceSize, d_packed_buf: vk.c.VkBuffer, d_packed_size: vk.c.VkDeviceSize, d_scale_buf: vk.c.VkBuffer, d_scale_size: vk.c.VkDeviceSize, M: u32, N: u32, K: u32, a_offset: u32, d_packed_offset: u32, d_scale_offset: u32, ) !void

int8 DP4a full-tile Q4_K gate+up+SwiGLU GEMM that emits Q8_0-style packed activation directly.

Output layout matches quantize_act_q8.comp (per-token packed int8 + per-32-block scale), so the downstream dense-down DP4a kernel can consume it directly without the standalone quantize_act_q8 dispatch + barrier. The f32 SwiGLU intermediate is never written to global memory.

src/compute/dmmv.zig:3421

method

DmmvDispatch.recordMulMmQ4KGateUpGegluFullDp4aQ8

#
pub fn recordMulMmQ4KGateUpGegluFullDp4aQ8( self: *const DmmvDispatch, cmd: *CommandBuffer, push_desc_fn: ?PushDescriptorFn, gate_buf: vk.c.VkBuffer, gate_size: vk.c.VkDeviceSize, up_buf: vk.c.VkBuffer, up_size: vk.c.VkDeviceSize, b_packed_buf: vk.c.VkBuffer, b_packed_size: vk.c.VkDeviceSize, b_scale_dsum_buf: vk.c.VkBuffer, b_scale_dsum_size: vk.c.VkDeviceSize, d_packed_buf: vk.c.VkBuffer, d_packed_size: vk.c.VkDeviceSize, d_scale_buf: vk.c.VkBuffer, d_scale_size: vk.c.VkDeviceSize, M: u32, N: u32, K: u32, a_offset: u32, b_packed_offset: u32, b_scale_offset: u32, d_packed_offset: u32, d_scale_offset: u32, ) !void

Gemma GEGLU variant of recordMulMmQ4KGateUpSwigluFullDp4aQ8.

Emits Q8_0-style packed GEGLU activation for Q6_K dense-down DP4a.

src/compute/dmmv.zig:3491

method

DmmvDispatch.recordMulMmQ4KGateUpSwigluFullDp4aQ8_1

#
pub fn recordMulMmQ4KGateUpSwigluFullDp4aQ8_1( self: *const DmmvDispatch, cmd: *CommandBuffer, push_desc_fn: ?PushDescriptorFn, gate_buf: vk.c.VkBuffer, gate_size: vk.c.VkDeviceSize, up_buf: vk.c.VkBuffer, up_size: vk.c.VkDeviceSize, b_packed_buf: vk.c.VkBuffer, b_packed_size: vk.c.VkDeviceSize, b_scale_dsum_buf: vk.c.VkBuffer, b_scale_dsum_size: vk.c.VkDeviceSize, d_packed_buf: vk.c.VkBuffer, d_packed_size: vk.c.VkDeviceSize, d_scale_dsum_buf: vk.c.VkBuffer, d_scale_dsum_size: vk.c.VkDeviceSize, M: u32, N: u32, K: u32, a_offset: u32, d_packed_offset: u32, d_scale_dsum_offset: u32, ) !void

Q4_K-down sibling of recordMulMmQ4KGateUpSwigluFullDp4aQ8.

Same fused gate+up+SwiGLU DP4a GEMM, but emits Q8_1-style activation (packed int8 + per-32-block (scale, dsum) vec2) so the downstream mul_mm_q4k_full_dp4a (Q4_K-down) consumer can skip the standalone quantize_act_q8_1 dispatch + barrier. dsum = scale * sum(int8_lanes) is computed inside the kernel via subgroupClusteredAdd cluster_size=TPR_M=8, so there's no LDS round-trip beyond the GEMM's existing barriers. Caller is responsible for sizing d_scale_dsum_buf as 2x the Q8_0 scale buffer (per-block vec2 instead of per-block float).

src/compute/dmmv.zig:3584

method

DmmvDispatch.recordMulMmQ4KGateUpGegluFullDp4aQ8_1

#
pub fn recordMulMmQ4KGateUpGegluFullDp4aQ8_1( self: *const DmmvDispatch, cmd: *CommandBuffer, push_desc_fn: ?PushDescriptorFn, gate_buf: vk.c.VkBuffer, gate_size: vk.c.VkDeviceSize, up_buf: vk.c.VkBuffer, up_size: vk.c.VkDeviceSize, b_packed_buf: vk.c.VkBuffer, b_packed_size: vk.c.VkDeviceSize, b_scale_dsum_buf: vk.c.VkBuffer, b_scale_dsum_size: vk.c.VkDeviceSize, d_packed_buf: vk.c.VkBuffer, d_packed_size: vk.c.VkDeviceSize, d_scale_dsum_buf: vk.c.VkBuffer, d_scale_dsum_size: vk.c.VkDeviceSize, M: u32, N: u32, K: u32, a_offset: u32, b_packed_offset: u32, b_scale_offset: u32, d_packed_offset: u32, d_scale_dsum_offset: u32, ) !void

Gemma GEGLU variant of recordMulMmQ4KGateUpSwigluFullDp4aQ8_1.

Emits Q8_1-style packed GEGLU activation for Q4_K dense-down DP4a.

src/compute/dmmv.zig:3654

method

DmmvDispatch.recordMulMmQ5KFullDp4a

#
pub fn recordMulMmQ5KFullDp4a( self: *const DmmvDispatch, cmd: *CommandBuffer, push_desc_fn: ?PushDescriptorFn, a_buf: vk.c.VkBuffer, a_size: vk.c.VkDeviceSize, b_packed_buf: vk.c.VkBuffer, b_packed_size: vk.c.VkDeviceSize, b_scale_dsum_buf: vk.c.VkBuffer, b_scale_dsum_size: vk.c.VkDeviceSize, d_buf: vk.c.VkBuffer, d_size: vk.c.VkDeviceSize, M: u32, N: u32, K: u32, a_offset: u32, d_offset: u32, ) !void

int8 DP4a full-tile single Q5_K GEMM (no fused activation).

Used by the Qwen3.6-27B SSM out prefill projection (M=hidden_dim, K=d_inner). Activations arrive pre-quantized (packed int8 + per-32-block (scale, dsum)) from recordQuantizeActQ8_1. Output is token-major f32 [N][M]. Same push/binding shape as recordMulMmQ4KFullDp4a — the only difference is the 5-bit weight unpack inside the kernel.

src/compute/dmmv.zig:3744

method

DmmvDispatch.recordMulMmQ4KFullDp4a

#
pub fn recordMulMmQ4KFullDp4a( self: *const DmmvDispatch, cmd: *CommandBuffer, push_desc_fn: ?PushDescriptorFn, a_buf: vk.c.VkBuffer, a_size: vk.c.VkDeviceSize, b_packed_buf: vk.c.VkBuffer, b_packed_size: vk.c.VkDeviceSize, b_scale_dsum_buf: vk.c.VkBuffer, b_scale_dsum_size: vk.c.VkDeviceSize, d_buf: vk.c.VkBuffer, d_size: vk.c.VkDeviceSize, M: u32, N: u32, K: u32, a_offset: u32, d_offset: u32, ) !void

int8 DP4a full-tile single Q4_K GEMM (no fused activation).

Used by the Qwen3.6-27B SSM z prefill projection. Activations arrive pre-quantized (packed int8 + per-32-block (scale, dsum)) from recordQuantizeActQ8_1. Output is token-major f32 [N][M].

src/compute/dmmv.zig:3796

method

DmmvDispatch.recordMulMmQ4KTail8Dp4a

#
pub fn recordMulMmQ4KTail8Dp4a( self: *const DmmvDispatch, cmd: *CommandBuffer, push_desc_fn: ?PushDescriptorFn, a_buf: vk.c.VkBuffer, a_size: vk.c.VkDeviceSize, b_packed_buf: vk.c.VkBuffer, b_packed_size: vk.c.VkDeviceSize, b_scale_dsum_buf: vk.c.VkBuffer, b_scale_dsum_size: vk.c.VkDeviceSize, d_buf: vk.c.VkBuffer, d_size: vk.c.VkDeviceSize, M: u32, N: u32, K: u32, a_offset: u32, b_packed_offset: u32, b_scale_dsum_offset: u32, d_offset: u32, ) !void

BN=8 int8 DP4a Q4_K dense-down tail for Gemma 4 31B short prompts.

Descriptor offsets select the tail token slice of the pre-quantized Q8_1 activation buffers; d_offset keeps the f32 output token-major.

src/compute/dmmv.zig:3858

method

DmmvDispatch.deinit

#
pub fn deinit(self: *DmmvDispatch) void

Destroy the loaded pipelines and descriptor pool.

Parameters
self
Dispatch wrapper to tear down in place.

src/compute/dmmv.zig:3918