Last updated: 2026-06-12

Inference Runtime

Forward Zinc Rt

All API Sections

ZINC_RT forward-pass bring-up.

M1 wires a scalar forward path for the hybrid Qwen MoE+SSM model used by the RDNA migration harness, plus the first AMDGPU-CS-produced token boundary. The old no-layer tail remains as a smoke fallback for unsupported shapes.

11 exports 11 methods src/compute/forward_zinc_rt.zig

11 exports shown

constant

m0_max_decode_tokens_default

#
pub const m0_max_decode_tokens_default: u32 = 8

Decode-token budget used by the no-layer smoke tail (`generateNoLayer`) and by benchmarks that want a short, bounded run on the scalar reference path.

The full M1 forward paths (`generateScalarHybrid`, `generateScalarDense`) pass the caller-supplied `max_tokens` through unchanged and never consult this constant. Override via ZINC_RT_MAX_DECODE_TOKENS.

src/compute/forward_zinc_rt.zig:21

struct

Model

#
pub const Model = struct

Loaded GGUF model plus all the load-time scratch derived from it.

Owns the mmap'd file, the resolved per-layer tensor table, the cached F32 dequants of small norm/SSM-side tensors, and the re-quantized weight blobs the decode path streams in place of the larger source-format weights. A `Model` is built once with `load` and consumed by `generate` / `generateWithOptions`; the GPU rings get the same handle so they can read the underlying bytes too.

src/compute/forward_zinc_rt.zig:57

Methods

3

method

Model.load

#
pub fn load(path: []const u8, allocator: std.mem.Allocator) !Model

Memory-map a GGUF model file, parse its metadata, and pre-build the load-time caches the scalar decode path relies on (small F32 norm/SSM tensors, Q8_0→Q4_0 and F32→Q8_0 re-quantizations of the heavier per-token weight streams, RoPE inverse-frequency table, attention sinks, SSM conv1d kernels).

The returned handle must eventually be released with `deinit`. unreadable, the tensors don't match the expected shapes, or the GGUF metadata is malformed.

Parameters
path
Filesystem path to the GGUF model.
allocator
Owns every secondary allocation reachable from `Model`.
Returns

A fully-initialised `Model`, or an error if the file is

src/compute/forward_zinc_rt.zig:132

method

Model.deinit

#
pub fn deinit(self: *Model) void

Release every allocation reachable from `Model`: the re-quantized weight blobs, the small-tensor F32 cache, the SSM/RoPE/attention-sink scratch, the GGUF parse state, the mmap, and the file handle.

Poisons `self`; the handle must not be reused afterwards.

src/compute/forward_zinc_rt.zig:421

method

Model.validateDecodeGraph

#
pub fn validateDecodeGraph(self: *const Model, allocator: std.mem.Allocator) !DecodeGraphSummary

Emit the per-token decode IR for this model's shape and run the validator on it, returning a node/layer count summary the caller can log or assert against.

Used as an admission gate before a real decode run so a malformed graph fails loud and early instead of corrupting the scalar kernels mid-token. return.

Parameters
allocator
Used for the throwaway IR graph; released before
Returns

Summary of the emitted-and-validated decode graph.

src/compute/forward_zinc_rt.zig:498

struct

DecodeGraphSummary

#
pub const DecodeGraphSummary = struct

Counts produced by `Model.validateDecodeGraph` — how many IR nodes were emitted, and how the per-layer mix (full attention, SSM, MoE) breaks down.

Useful for asserting the lowered graph matches the model's expected shape.

src/compute/forward_zinc_rt.zig:1101

enum

DirectComputeKind

#
pub const DirectComputeKind = enum

Tag identifying which direct-compute shortcut the active tier executed for the current decode step.

`none` means the scalar host path retired the token; the other variants name the kernel the GPU ring actually ran (a first-element RMSNorm, an argmax, an argmax composed with that RMSNorm, or a row-range dequantized matvec).

src/compute/forward_zinc_rt.zig:1114

struct

BenchmarkShortcutFlags

#
pub const BenchmarkShortcutFlags = struct

Flags marking which benchmark-only fast-paths influenced the run.

The performance suite consults these to decide whether a number is comparable to the reference scalar path or whether a measurement-only shortcut was in effect (top-k forced to zero, LM-head row count capped, decode budget clamped).

src/compute/forward_zinc_rt.zig:1139

Methods

1

struct

GenerateResult

#
pub const GenerateResult = struct

Output of a `generate` / `generateWithOptions` call: the produced token stream, prefill / decode wall-clock splits, the originally-requested and effective decode budgets, and a set of direct-tier instrumentation counters that report how much of the per-token work was actually retired by the GPU ring versus by the scalar fallback.

The token slice is allocator-owned and must be released with `deinit`.

src/compute/forward_zinc_rt.zig:1156

Methods

1

method

GenerateResult.deinit

#
pub fn deinit(self: *GenerateResult, allocator: std.mem.Allocator) void

Free the produced token slice and poison the handle.

Parameters
allocator
Same allocator that was passed to `generate`.

src/compute/forward_zinc_rt.zig:1178

struct

GenerateOptions

#
pub const GenerateOptions = struct

Knobs threaded through to `generateWithOptions`.

Lets callers opt out of the per-token direct-tier admission validation when they only want the scalar reference numbers.

src/compute/forward_zinc_rt.zig:1187

function

generate

#
pub fn generate( model: *const Model, prompt_tokens: []const u32, max_tokens: u32, eos_token_id: u32, allocator: std.mem.Allocator, ) !GenerateResult

Run the full ZINC_RT forward pass against `model` with default `GenerateOptions`: prefill the prompt, validate the per-token decode IR, and decode at most `max_tokens` tokens (stopping early on `eos_token_id`).

Tries the scalar hybrid path, then the scalar dense path, then falls back to a no-layer smoke tail. Equivalent to `generateWithOptions(..., .{})`.

Parameters

model
Loaded GGUF model.
prompt_tokens
Tokenised prompt; must be non-empty.
max_tokens
Upper bound on tokens to produce after prefill.
eos_token_id
Stop-token id.
allocator
Owns the returned `GenerateResult.tokens`.

Returns

A `GenerateResult` the caller must release via its `deinit`.

src/compute/forward_zinc_rt.zig:1206

function

generateWithOptions

#
pub fn generateWithOptions( model: *const Model, prompt_tokens: []const u32, max_tokens: u32, eos_token_id: u32, allocator: std.mem.Allocator, options: GenerateOptions, ) !GenerateResult

Full ZINC_RT forward pass with caller-supplied `GenerateOptions`.

Logs the validated decode-graph summary, then picks among three paths: the scalar hybrid MoE+SSM path (for Qwen 3.6-shaped models whose tensors fully resolve via `canRunScalarHybrid`), the scalar dense attention path (for Gemma 4-shaped models via `canRunScalarDense`), and the no-layer smoke tail (everything else). The hybrid and dense paths consume the GPU ring's direct-compute results; the smoke tail is CPU-only. throwaway decode IR.

Parameters

model
Loaded GGUF model.
prompt_tokens
Tokenised prompt; must be non-empty.
max_tokens
Upper bound on tokens to produce after prefill.
eos_token_id
Stop-token id.
allocator
Owns the returned `GenerateResult.tokens` and the
options
Per-call configuration; see `GenerateOptions`.

Returns

A `GenerateResult` the caller must release via its `deinit`.

src/compute/forward_zinc_rt.zig:1231

function

initTokenizer

#
pub fn initTokenizer(model: *const Model, allocator: std.mem.Allocator) !Tokenizer

Build a `Tokenizer` from `model`'s GGUF vocab metadata.

Convenience wrapper around `Tokenizer.init` so callers don't have to reach into the `Model`'s parse state.

Parameters

model
Loaded GGUF model whose vocab table is consulted.
allocator
Owns the resulting tokenizer's vocab and id table.

Returns

A ready-to-use `Tokenizer`.

src/compute/forward_zinc_rt.zig:7252

struct

Tokenizer

#
pub const Tokenizer = struct

Minimal longest-match BPE-ish tokenizer that reads the vocab and EOS id straight out of a GGUF file.

Encodes prompts via the GPT-2 byte-to-unicode mapping and falls back to byte 0 on misses so the scalar M1 forward path always sees a well-formed token stream.

src/compute/forward_zinc_rt.zig:7260

Methods

6

method

Tokenizer.init

#
pub fn init(gf: *const gguf.GGUFFile, allocator: std.mem.Allocator) !Tokenizer

Build a tokenizer by reading `tokenizer.ggml.tokens` and `tokenizer.ggml.eos_token_id` out of `gf`.

Defaults the EOS id to `2` when the metadata key is missing.

Parameters
gf
GGUF file whose tokenizer metadata is consulted.
allocator
Owns the vocab slice and the id hash table.
Returns

A `Tokenizer` ready for `encodePrompt` / `decodeToken`.

src/compute/forward_zinc_rt.zig:7278

method

Tokenizer.encodeGemmaChat

#
pub fn encodeGemmaChat(self: *const Tokenizer, user_text: []const u8, allocator: std.mem.Allocator) !?[]u32

Wrap the user prompt with Gemma's chat-turn special tokens so the instruction-tuned model has the expected scaffolding.

Tries the Gemma 4 tokens (`<|turn>` / `<turn|>`) first, then falls back to Gemma 2/3 tokens (`<start_of_turn>` / `<end_of_turn>`). Returns null when neither set is present in the vocab (i.e. the model isn't Gemma-templated).

src/compute/forward_zinc_rt.zig:7337

method

Tokenizer.encodePrompt

#
pub fn encodePrompt(self: *const Tokenizer, text: []const u8, allocator: std.mem.Allocator) ![]u32

Encode `text` into a token id stream using a longest-match scan over the vocab.

GPT-2-flavour tokenizers (Qwen) use the byte-to-unicode mapping; SentencePiece-flavour tokenizers (Gemma) substitute spaces with ▁ and pass raw UTF-8 through. Prepends BOS when `add_bos` is set. Unmatched single bytes fall back to token id 0 so the output is always well-formed.

Parameters
text
Raw UTF-8 prompt bytes.
allocator
Owns the returned token slice.
Returns

Token ids ready to feed into `generate`.

src/compute/forward_zinc_rt.zig:7433

method

Tokenizer.decodeToken

#
pub fn decodeToken(self: *const Tokenizer, token_id: u32, buf: []u8) []const u8

Render one token id back to its UTF-8 byte form into `buf`.

For GPT-2-flavour vocabs the GPT-2 byte-to-unicode mapping is reversed; for SentencePiece-flavour vocabs the SPIECE underline (▁, U+2581) is mapped to a plain space and remaining codepoints are copied as-is. Truncates instead of erroring when `buf` is too small; returns an empty slice for out-of-range ids.

Parameters
token_id
Token id produced by `generate` or `encodePrompt`.
buf
Scratch buffer the decoded bytes are written into.
Returns

The prefix of `buf` containing the decoded bytes.

src/compute/forward_zinc_rt.zig:7476