Last updated: 2026-06-12
Inference Runtime
Forward Zinc Rt
ZINC_RT forward-pass bring-up.
M1 wires a scalar forward path for the hybrid Qwen MoE+SSM model used by the RDNA migration harness, plus the first AMDGPU-CS-produced token boundary. The old no-layer tail remains as a smoke fallback for unsupported shapes.
11 exports shown
constant
m0_max_decode_tokens_default
pub const m0_max_decode_tokens_default: u32 = 8 Decode-token budget used by the no-layer smoke tail (`generateNoLayer`) and by benchmarks that want a short, bounded run on the scalar reference path.
The full M1 forward paths (`generateScalarHybrid`, `generateScalarDense`) pass the caller-supplied `max_tokens` through unchanged and never consult this constant. Override via ZINC_RT_MAX_DECODE_TOKENS.
struct
Model
pub const Model = struct Loaded GGUF model plus all the load-time scratch derived from it.
Owns the mmap'd file, the resolved per-layer tensor table, the cached F32 dequants of small norm/SSM-side tensors, and the re-quantized weight blobs the decode path streams in place of the larger source-format weights. A `Model` is built once with `load` and consumed by `generate` / `generateWithOptions`; the GPU rings get the same handle so they can read the underlying bytes too.
Methods
3method
Model.load
pub fn load(path: []const u8, allocator: std.mem.Allocator) !Model Memory-map a GGUF model file, parse its metadata, and pre-build the load-time caches the scalar decode path relies on (small F32 norm/SSM tensors, Q8_0→Q4_0 and F32→Q8_0 re-quantizations of the heavier per-token weight streams, RoPE inverse-frequency table, attention sinks, SSM conv1d kernels).
The returned handle must eventually be released with `deinit`. unreadable, the tensors don't match the expected shapes, or the GGUF metadata is malformed.
method
Model.deinit
pub fn deinit(self: *Model) void Release every allocation reachable from `Model`: the re-quantized weight blobs, the small-tensor F32 cache, the SSM/RoPE/attention-sink scratch, the GGUF parse state, the mmap, and the file handle.
Poisons `self`; the handle must not be reused afterwards.
method
Model.validateDecodeGraph
pub fn validateDecodeGraph(self: *const Model, allocator: std.mem.Allocator) !DecodeGraphSummary Emit the per-token decode IR for this model's shape and run the validator on it, returning a node/layer count summary the caller can log or assert against.
Used as an admission gate before a real decode run so a malformed graph fails loud and early instead of corrupting the scalar kernels mid-token. return.
struct
DecodeGraphSummary
pub const DecodeGraphSummary = struct Counts produced by `Model.validateDecodeGraph` — how many IR nodes were emitted, and how the per-layer mix (full attention, SSM, MoE) breaks down.
Useful for asserting the lowered graph matches the model's expected shape.
enum
DirectComputeKind
pub const DirectComputeKind = enum Tag identifying which direct-compute shortcut the active tier executed for the current decode step.
`none` means the scalar host path retired the token; the other variants name the kernel the GPU ring actually ran (a first-element RMSNorm, an argmax, an argmax composed with that RMSNorm, or a row-range dequantized matvec).
struct
BenchmarkShortcutFlags
pub const BenchmarkShortcutFlags = struct Flags marking which benchmark-only fast-paths influenced the run.
The performance suite consults these to decide whether a number is comparable to the reference scalar path or whether a measurement-only shortcut was in effect (top-k forced to zero, LM-head row count capped, decode budget clamped).
Methods
1method
BenchmarkShortcutFlags.any
pub fn any(self: BenchmarkShortcutFlags) bool True if any benchmark shortcut was applied during the run.
struct
GenerateResult
pub const GenerateResult = struct Output of a `generate` / `generateWithOptions` call: the produced token stream, prefill / decode wall-clock splits, the originally-requested and effective decode budgets, and a set of direct-tier instrumentation counters that report how much of the per-token work was actually retired by the GPU ring versus by the scalar fallback.
The token slice is allocator-owned and must be released with `deinit`.
Methods
1method
GenerateResult.deinit
pub fn deinit(self: *GenerateResult, allocator: std.mem.Allocator) void Free the produced token slice and poison the handle.
struct
GenerateOptions
pub const GenerateOptions = struct Knobs threaded through to `generateWithOptions`.
Lets callers opt out of the per-token direct-tier admission validation when they only want the scalar reference numbers.
function
generate
pub fn generate( model: *const Model, prompt_tokens: []const u32, max_tokens: u32, eos_token_id: u32, allocator: std.mem.Allocator, ) !GenerateResult Run the full ZINC_RT forward pass against `model` with default `GenerateOptions`: prefill the prompt, validate the per-token decode IR, and decode at most `max_tokens` tokens (stopping early on `eos_token_id`).
Tries the scalar hybrid path, then the scalar dense path, then falls back to a no-layer smoke tail. Equivalent to `generateWithOptions(..., .{})`.
function
generateWithOptions
pub fn generateWithOptions( model: *const Model, prompt_tokens: []const u32, max_tokens: u32, eos_token_id: u32, allocator: std.mem.Allocator, options: GenerateOptions, ) !GenerateResult Full ZINC_RT forward pass with caller-supplied `GenerateOptions`.
Logs the validated decode-graph summary, then picks among three paths: the scalar hybrid MoE+SSM path (for Qwen 3.6-shaped models whose tensors fully resolve via `canRunScalarHybrid`), the scalar dense attention path (for Gemma 4-shaped models via `canRunScalarDense`), and the no-layer smoke tail (everything else). The hybrid and dense paths consume the GPU ring's direct-compute results; the smoke tail is CPU-only. throwaway decode IR.
function
initTokenizer
pub fn initTokenizer(model: *const Model, allocator: std.mem.Allocator) !Tokenizer Build a `Tokenizer` from `model`'s GGUF vocab metadata.
Convenience wrapper around `Tokenizer.init` so callers don't have to reach into the `Model`'s parse state.
struct
Tokenizer
pub const Tokenizer = struct Minimal longest-match BPE-ish tokenizer that reads the vocab and EOS id straight out of a GGUF file.
Encodes prompts via the GPT-2 byte-to-unicode mapping and falls back to byte 0 on misses so the scalar M1 forward path always sees a well-formed token stream.
Methods
6method
Tokenizer.init
pub fn init(gf: *const gguf.GGUFFile, allocator: std.mem.Allocator) !Tokenizer Build a tokenizer by reading `tokenizer.ggml.tokens` and `tokenizer.ggml.eos_token_id` out of `gf`.
Defaults the EOS id to `2` when the metadata key is missing.
method
Tokenizer.deinit
pub fn deinit(self: *Tokenizer) void Release the vocab slice and the id hash table, then poison the handle.
method
Tokenizer.eosId
pub fn eosId(self: *const Tokenizer) u32 Return the stop token id the caller should pass to `generate`.
method
Tokenizer.encodeGemmaChat
pub fn encodeGemmaChat(self: *const Tokenizer, user_text: []const u8, allocator: std.mem.Allocator) !?[]u32 Wrap the user prompt with Gemma's chat-turn special tokens so the instruction-tuned model has the expected scaffolding.
Tries the Gemma 4 tokens (`<|turn>` / `<turn|>`) first, then falls back to Gemma 2/3 tokens (`<start_of_turn>` / `<end_of_turn>`). Returns null when neither set is present in the vocab (i.e. the model isn't Gemma-templated).
method
Tokenizer.encodePrompt
pub fn encodePrompt(self: *const Tokenizer, text: []const u8, allocator: std.mem.Allocator) ![]u32 Encode `text` into a token id stream using a longest-match scan over the vocab.
GPT-2-flavour tokenizers (Qwen) use the byte-to-unicode mapping; SentencePiece-flavour tokenizers (Gemma) substitute spaces with ▁ and pass raw UTF-8 through. Prepends BOS when `add_bos` is set. Unmatched single bytes fall back to token id 0 so the output is always well-formed.
method
Tokenizer.decodeToken
pub fn decodeToken(self: *const Tokenizer, token_id: u32, buf: []u8) []const u8 Render one token id back to its UTF-8 byte form into `buf`.
For GPT-2-flavour vocabs the GPT-2 byte-to-unicode mapping is reversed; for SentencePiece-flavour vocabs the SPIECE underline (▁, U+2581) is mapped to a plain space and remaining codepoints are copied as-is. Truncates instead of erroring when `buf` is too small; returns an empty slice for out-of-range ids.