Last updated: 2026-06-12
Inference Runtime
Dequant
Shared scalar GGML dequantization helpers for T-CPU kernels.
These helpers intentionally mirror the Vulkan backend's CPU diagnostic dequantization so M0 can compare host-side ZINC_RT ops against it.
13 exports shown
function
row
pub fn row(raw_data: []const u8, row_index: u32, cols: u32, tensor_type: GGMLType, output: []f32) !void Dequantize one row of a GGML tensor into f32 lanes.
Dispatches on `tensor_type` and writes the first `cols` entries of `output`. Supports the formats used by ZINC weights today: `.f32`, `.f16`, `.bf16`, `.q8_0`, `.q4_0`, `.q5_1`, `.q4_k`, `.q5_k`, `.q6_k`, and `.mxfp4`. their block size (32 for q4_0/q5_1/q8_0/mxfp4, 256 for q4_k/q5_k/q6_k). `error.InputTooSmall` when the row would overrun `raw_data`, `error.UnsupportedShape` on bad alignment, or `error.UnsupportedTensorType` for formats not handled here.
function
dotRow
pub fn dotRow( raw_data: []const u8, row_index: u32, cols: u32, tensor_type: GGMLType, input: []const f32, scratch: []f32, ) !f32 Dot one quantized row against an f32 input vector, dispatching on tensor type.
Hot formats (`f32`, `f16`, `bf16`, `q4_0`, `q8_0`, `q4_k`, `q5_k`, `q6_k`) take a fused vectorized path that streams weights and folds the dequant scales into the FMAs. Every other format falls back to dequantizing into `scratch` first and then dotting.
function
dotF32Row
pub fn dotF32Row(raw_data: []const u8, row_index: u32, cols: u32, input: []const f32) !f32 Dot one f32-packed row against `input` using a 16-wide AVX-512-friendly inner loop.
Uses four independent accumulators driven by a 4-way unroll so the FP-add chain stays short.
function
dotF16Row
pub fn dotF16Row(raw_data: []const u8, row_index: u32, cols: u32, input: []const f32) !f32 Dot one f16-packed row against an f32 input vector, promoting each weight to f32 on the fly.
function
dotBf16Row
pub fn dotBf16Row(raw_data: []const u8, row_index: u32, cols: u32, input: []const f32) !f32 Dot one bf16-packed row against an f32 input vector by zero-extending each weight into f32.
function
dotQ8_0Row
pub fn dotQ8_0Row(raw_data: []const u8, row_index: u32, cols: u32, input: []const f32) !f32 Dot one Q8_0-packed row against an f32 input vector.
Q8_0 stores 32 signed-int8 weights per block with one f16 scale; this entry point validates block alignment and bounds, then delegates to the unchecked vectorized inner loop.
function
dotQ4_0Row
pub fn dotQ4_0Row(raw_data: []const u8, row_index: u32, cols: u32, input: []const f32) !f32 Dot one Q4_0-packed row against an f32 input vector.
Q4_0 stores 32 nibble weights with a `-8` bias per block plus an f16 scale; this entry point validates block alignment and bounds, then delegates to the unchecked vectorized inner loop.
function
quantizeRowToQ4_0
pub fn quantizeRowToQ4_0(src: []const f32, dst: []u8) void Quantize one row of f32 weights into the GGML `Q4_0` block layout.
Each 32-element block is stored as one f16 scale followed by 16 packed nibble pairs (low nibble = first weight, high nibble = second weight), where each nibble encodes a value in [0, 15] representing the original weight offset by +8. Mirrors llama.cpp's `quantize_row_q4_0_ref`.
function
quantizeRowToQ8_0
pub fn quantizeRowToQ8_0(src: []const f32, dst: []u8) void Quantize one row of f32 weights into the GGML `Q8_0` block layout.
Each 32-element block is stored as one f16 scale followed by 32 signed int8 values clamped to [-127, 127]; the scale is `max(|w|) / 127`. Mirrors llama.cpp's `quantize_row_q8_0_ref`.
function
dotQ4KRow
pub fn dotQ4KRow(raw_data: []const u8, row_index: u32, cols: u32, input: []const f32) !f32 Dot one Q4_K-packed row against an f32 input vector.
Q4_K stores 256 weights per super-block as 8 sub-blocks of 32 nibbles, each with its own 6-bit scale and min packed into a 12-byte header (plus block-level f16 `d`/`dmin`); validates block alignment and bounds, then delegates to the unchecked vectorized inner loop.
function
fillInputSum32
pub fn fillInputSum32(input: []const f32, sums: []f32) void Precompute per-32-element sums of an input vector for the `WithSum32` Q4_K/Q5_K dot paths.
Those paths fold the asymmetric min subtraction `-m * sum(x_block)` out of the inner loop, so the caller fills `sums[i] = sum(input[i*32 .. (i+1)*32])` once and reuses it across many rows.
function
dotQ5KRow
pub fn dotQ5KRow(raw_data: []const u8, row_index: u32, cols: u32, input: []const f32) !f32 Dot one Q5_K-packed row against an f32 input vector.
Q5_K extends Q4_K with a 5th high-bit plane stored as 32 bytes (one bit per weight, eight 32-element sub-blocks); validates block alignment and bounds, then delegates to the unchecked vectorized inner loop.
function
dotQ6KRow
pub fn dotQ6KRow(raw_data: []const u8, row_index: u32, cols: u32, input: []const f32) !f32 Dot one Q6_K-packed row against an f32 input vector.
Q6_K packs 256 6-bit weights as a low-nibble plane plus a 2-bit-per-weight high plane, with one f16 super-block scale and eight signed-int8 per-32 sub-scales; weights are recentered by `-32`. Validates block alignment and bounds, then delegates to the unchecked vectorized inner loop.