Last updated: 2026-06-12
Inference Runtime
Forward
Run the inference runtime: decode state, pipeline ownership, and token generation.
This module ties together model state, compute graphs, dispatch helpers, and greedy token sampling for a single active inference engine.
4 exports shown
struct
DecodeState
pub const DecodeState = struct Runtime state for the decode loop.
struct
SamplingParams
pub const SamplingParams = struct Token sampling controls shared by the decode loop and HTTP server.
Methods
1method
SamplingParams.requiresLogitsReadback
pub fn requiresLogitsReadback(self: @This()) bool Return whether the current sampling settings require CPU-visible logits.
(i.e. any path that needs more than just the argmax token index).
struct
InferenceEngine
pub const InferenceEngine = struct Central inference engine that owns the GPU resources for one active model.
Holds the loaded model, all Vulkan pipelines and intermediate activation buffers, KV-cache page pool, SSM recurrent state, and per-request profiling counters. One engine instance maps to one GPU device; it is not thread-safe.
Methods
11method
InferenceEngine.init
pub fn init( model: *Model, instance: *const Instance, gpu_config: GpuConfig, shader_dir: []const u8, allocator: std.mem.Allocator, ) !InferenceEngine Create the runtime objects needed to execute decode-time work on the GPU.
method
InferenceEngine.enableProfiling
pub fn enableProfiling(self: *InferenceEngine) !void Enable full GPU + CPU profiling.
The timestamp query pool is created in `init`, so this just flips the runtime flag. Returns an error if pool creation failed.
method
InferenceEngine.enableValidationDiagnostics
pub fn enableValidationDiagnostics(self: *InferenceEngine) void Enable the expensive CPU-vs-GPU validation readbacks used for debugging kernel correctness.
method
InferenceEngine.enableLogitsReadback
pub fn enableLogitsReadback(self: *InferenceEngine) void Preserve full logits on the host for debug dumps and diagnostic inspection.
method
InferenceEngine.recordProfilingSample
pub fn recordProfilingSample(self: *InferenceEngine) void Read back all timestamps for the current token and fold them into request-wide profiling stats.
method
InferenceEngine.decodeStep
pub fn decodeStep(self: *InferenceEngine, state: *DecodeState, token_id: u32, collect_output: bool) !void Run a single decode step for one token through all transformer layers.
embed → [per-layer: norm → QKV → RoPE → KV write → attention → O proj → residual → FFN norm → MoE routing → expert DMMVs → residual] → final norm → LM head → logits diagnostic or GPT-OSS embedding-collection paths.
method
InferenceEngine.prefillBatched
pub fn prefillBatched(self: *InferenceEngine, state: *DecodeState, prompt_tokens: []const u32) !void Experimental batched prompt prefill for the RDNA/Vulkan backend. Gated by `ZINC_BATCHED_PREFILL=1`. This is the Vulkan analogue of `forward_metal.InferenceEngine.prefillBatched`.
Routes to `prefillA3bProduction` or `prefillQwen36DenseFfnPrefix` when the model and prompt length match those specialized paths; otherwise falls back to the `canUseBatchedPrefillRdna`-gated batched body or `prefillBatch` (per-token serial path). Set `ZINC_BATCHED_PREFILL=0` to force the serial fallback or `=validate` to run both paths and diff the last-token logits. Intel Arc devices require `ZINC_INTEL_BATCHED_PREFILL=1` to opt in.
method
InferenceEngine.prefillBatch
pub fn prefillBatch(self: *InferenceEngine, state: *DecodeState, prompt_tokens: []const u32) !void Process all prompt tokens sequentially (one token per GPU submission) to populate KV cache and SSM state before the first decode step.
an active KV-page allocation for continuation prefill.
method
InferenceEngine.sampleGreedy
pub fn sampleGreedy(self: *const InferenceEngine) u32 Sample a token greedily.
Uses GPU argmax when available, otherwise falls back to CPU scan.
method
InferenceEngine.sample
pub fn sample(self: *const InferenceEngine, state: *const DecodeState, params: SamplingParams, random: std.Random) u32 Sample the next token using greedy argmax or stochastic sampling depending on `params`.
Delegates to `sampleGreedy` when no logit readback is needed; otherwise reads staged logits from the host and applies temperature, top-p, top-k, and repetition penalty.
method
InferenceEngine.deinit
pub fn deinit(self: *InferenceEngine) void Release GPU buffers, graphs, command objects, and dispatch helpers owned by the engine.
function
generate
pub fn generate( engine: *InferenceEngine, prompt_tokens: []const u32, max_tokens: u32, eos_token_id: u32, allocator: std.mem.Allocator, ) ![]u32 Run single-request inference: prefill the prompt, decode greedily, and return generated token IDs.