Last updated: 2026-06-12

Inference Runtime

Dequant

All API Sections

Shared scalar GGML dequantization helpers for T-CPU kernels.

These helpers intentionally mirror the Vulkan backend's CPU diagnostic dequantization so M0 can compare host-side ZINC_RT ops against it.

13 exports shown

function

row

#
pub fn row(raw_data: []const u8, row_index: u32, cols: u32, tensor_type: GGMLType, output: []f32) !void

Dequantize one row of a GGML tensor into f32 lanes.

Dispatches on `tensor_type` and writes the first `cols` entries of `output`. Supports the formats used by ZINC weights today: `.f32`, `.f16`, `.bf16`, `.q8_0`, `.q4_0`, `.q5_1`, `.q4_k`, `.q5_k`, `.q6_k`, and `.mxfp4`. their block size (32 for q4_0/q5_1/q8_0/mxfp4, 256 for q4_k/q5_k/q6_k). `error.InputTooSmall` when the row would overrun `raw_data`, `error.UnsupportedShape` on bad alignment, or `error.UnsupportedTensorType` for formats not handled here.

Parameters

raw_data
Raw tensor bytes for the full matrix.
row_index
Zero-based row to materialize.
cols
Number of columns per row; quantized formats require this to be a multiple of
tensor_type
GGML quantization tag selecting the decode path.
output
Destination slice; must be at least `cols` long.

Returns

`error.OutputTooSmall` when `output.len < cols`, `error.EmptyInput` when `cols == 0`,

src/zinc_rt/isa/cpu_zig/dequant.zig:33

function

dotRow

#
pub fn dotRow( raw_data: []const u8, row_index: u32, cols: u32, tensor_type: GGMLType, input: []const f32, scratch: []f32, ) !f32

Dot one quantized row against an f32 input vector, dispatching on tensor type.

Hot formats (`f32`, `f16`, `bf16`, `q4_0`, `q8_0`, `q4_k`, `q5_k`, `q6_k`) take a fused vectorized path that streams weights and folds the dequant scales into the FMAs. Every other format falls back to dequantizing into `scratch` first and then dotting.

Parameters

raw_data
Raw tensor bytes for the full matrix.
row_index
Zero-based row to dot.
cols
Number of columns per row; subject to the same block-alignment constraints as `row`.
tensor_type
GGML quantization tag selecting the decode path.
input
f32 input vector of length `>= cols`.
scratch
Caller-owned scratch of length `>= cols`; only consumed on the fallback path.

Returns

The f32 dot product, or an error matching `row`'s shape and size diagnostics.

src/zinc_rt/isa/cpu_zig/dequant.zig:301

function

dotF32Row

#
pub fn dotF32Row(raw_data: []const u8, row_index: u32, cols: u32, input: []const f32) !f32

Dot one f32-packed row against `input` using a 16-wide AVX-512-friendly inner loop.

Uses four independent accumulators driven by a 4-way unroll so the FP-add chain stays short.

Parameters

raw_data
Raw f32 row-major tensor bytes.
row_index
Zero-based row to dot.
cols
Number of f32 columns in the row.
input
f32 input vector of length `>= cols`.

Returns

The f32 dot product, or `error.InputTooSmall` when `input` or `raw_data` is shorter than expected.

src/zinc_rt/isa/cpu_zig/dequant.zig:337

function

dotF16Row

#
pub fn dotF16Row(raw_data: []const u8, row_index: u32, cols: u32, input: []const f32) !f32

Dot one f16-packed row against an f32 input vector, promoting each weight to f32 on the fly.

Parameters

raw_data
Raw f16 row-major tensor bytes.
row_index
Zero-based row to dot.
cols
Number of f16 columns in the row.
input
f32 input vector of length `>= cols`.

Returns

The f32 dot product, or `error.InputTooSmall` when the input or row bytes are too short.

src/zinc_rt/isa/cpu_zig/dequant.zig:400

function

dotBf16Row

#
pub fn dotBf16Row(raw_data: []const u8, row_index: u32, cols: u32, input: []const f32) !f32

Dot one bf16-packed row against an f32 input vector by zero-extending each weight into f32.

Parameters

raw_data
Raw bf16 row-major tensor bytes.
row_index
Zero-based row to dot.
cols
Number of bf16 columns in the row.
input
f32 input vector of length `>= cols`.

Returns

The f32 dot product, or `error.InputTooSmall` when the input or row bytes are too short.

src/zinc_rt/isa/cpu_zig/dequant.zig:422

function

dotQ8_0Row

#
pub fn dotQ8_0Row(raw_data: []const u8, row_index: u32, cols: u32, input: []const f32) !f32

Dot one Q8_0-packed row against an f32 input vector.

Q8_0 stores 32 signed-int8 weights per block with one f16 scale; this entry point validates block alignment and bounds, then delegates to the unchecked vectorized inner loop.

Parameters

raw_data
Raw Q8_0 tensor bytes (34 bytes per 32-element block).
row_index
Zero-based row to dot.
cols
Number of columns; must be a multiple of 32.
input
f32 input vector of length `>= cols`.

Returns

The f32 dot product, or `error.UnsupportedShape` / `error.InputTooSmall` on misuse.

src/zinc_rt/isa/cpu_zig/dequant.zig:446

function

dotQ4_0Row

#
pub fn dotQ4_0Row(raw_data: []const u8, row_index: u32, cols: u32, input: []const f32) !f32

Dot one Q4_0-packed row against an f32 input vector.

Q4_0 stores 32 nibble weights with a `-8` bias per block plus an f16 scale; this entry point validates block alignment and bounds, then delegates to the unchecked vectorized inner loop.

Parameters

raw_data
Raw Q4_0 tensor bytes (18 bytes per 32-element block).
row_index
Zero-based row to dot.
cols
Number of columns; must be a multiple of 32.
input
f32 input vector of length `>= cols`.

Returns

The f32 dot product, or `error.UnsupportedShape` / `error.InputTooSmall` on misuse.

src/zinc_rt/isa/cpu_zig/dequant.zig:526

function

quantizeRowToQ4_0

#
pub fn quantizeRowToQ4_0(src: []const f32, dst: []u8) void

Quantize one row of f32 weights into the GGML `Q4_0` block layout.

Each 32-element block is stored as one f16 scale followed by 16 packed nibble pairs (low nibble = first weight, high nibble = second weight), where each nibble encodes a value in [0, 15] representing the original weight offset by +8. Mirrors llama.cpp's `quantize_row_q4_0_ref`.

Parameters

src
Source f32 values; length must be a positive multiple of 32.
dst
Destination byte buffer; must be at least `(src.len / 32) * 18` bytes.

Notes

Asserts (debug builds only) that alignment and size preconditions hold.

src/zinc_rt/isa/cpu_zig/dequant.zig:801

function

quantizeRowToQ8_0

#
pub fn quantizeRowToQ8_0(src: []const f32, dst: []u8) void

Quantize one row of f32 weights into the GGML `Q8_0` block layout.

Each 32-element block is stored as one f16 scale followed by 32 signed int8 values clamped to [-127, 127]; the scale is `max(|w|) / 127`. Mirrors llama.cpp's `quantize_row_q8_0_ref`.

Parameters

src
Source f32 values; length must be a positive multiple of 32.
dst
Destination byte buffer; must be at least `(src.len / 32) * 34` bytes.

Notes

Asserts (debug builds only) that alignment and size preconditions hold.

src/zinc_rt/isa/cpu_zig/dequant.zig:840

function

dotQ4KRow

#
pub fn dotQ4KRow(raw_data: []const u8, row_index: u32, cols: u32, input: []const f32) !f32

Dot one Q4_K-packed row against an f32 input vector.

Q4_K stores 256 weights per super-block as 8 sub-blocks of 32 nibbles, each with its own 6-bit scale and min packed into a 12-byte header (plus block-level f16 `d`/`dmin`); validates block alignment and bounds, then delegates to the unchecked vectorized inner loop.

Parameters

raw_data
Raw Q4_K tensor bytes (144 bytes per 256-element super-block).
row_index
Zero-based row to dot.
cols
Number of columns; must be a multiple of 256.
input
f32 input vector of length `>= cols`.

Returns

The f32 dot product, or `error.UnsupportedShape` / `error.InputTooSmall` on misuse.

src/zinc_rt/isa/cpu_zig/dequant.zig:875

function

fillInputSum32

#
pub fn fillInputSum32(input: []const f32, sums: []f32) void

Precompute per-32-element sums of an input vector for the `WithSum32` Q4_K/Q5_K dot paths.

Those paths fold the asymmetric min subtraction `-m * sum(x_block)` out of the inner loop, so the caller fills `sums[i] = sum(input[i*32 .. (i+1)*32])` once and reuses it across many rows.

Parameters

input
Input vector whose length must be a positive multiple of 32.
sums
Destination of length `>= input.len / 32`; lane `i` receives the sum of input block `i`.

src/zinc_rt/isa/cpu_zig/dequant.zig:968

function

dotQ5KRow

#
pub fn dotQ5KRow(raw_data: []const u8, row_index: u32, cols: u32, input: []const f32) !f32

Dot one Q5_K-packed row against an f32 input vector.

Q5_K extends Q4_K with a 5th high-bit plane stored as 32 bytes (one bit per weight, eight 32-element sub-blocks); validates block alignment and bounds, then delegates to the unchecked vectorized inner loop.

Parameters

raw_data
Raw Q5_K tensor bytes (176 bytes per 256-element super-block).
row_index
Zero-based row to dot.
cols
Number of columns; must be a multiple of 256.
input
f32 input vector of length `>= cols`.

Returns

The f32 dot product, or `error.UnsupportedShape` / `error.InputTooSmall` on misuse.

src/zinc_rt/isa/cpu_zig/dequant.zig:1071

function

dotQ6KRow

#
pub fn dotQ6KRow(raw_data: []const u8, row_index: u32, cols: u32, input: []const f32) !f32

Dot one Q6_K-packed row against an f32 input vector.

Q6_K packs 256 6-bit weights as a low-nibble plane plus a 2-bit-per-weight high plane, with one f16 super-block scale and eight signed-int8 per-32 sub-scales; weights are recentered by `-32`. Validates block alignment and bounds, then delegates to the unchecked vectorized inner loop.

Parameters

raw_data
Raw Q6_K tensor bytes (210 bytes per 256-element super-block).
row_index
Zero-based row to dot.
cols
Number of columns; must be a multiple of 256.
input
f32 input vector of length `>= cols`.

Returns

The f32 dot product, or `error.UnsupportedShape` / `error.InputTooSmall` on misuse.

src/zinc_rt/isa/cpu_zig/dequant.zig:1252