Last updated: 2026-06-12

Inference Runtime

Forward

All API Sections

Run the inference runtime: decode state, pipeline ownership, and token generation.

This module ties together model state, compute graphs, dispatch helpers, and greedy token sampling for a single active inference engine.

4 exports 14 methods src/compute/forward.zig

4 exports shown

struct

DecodeState

#
pub const DecodeState = struct

Runtime state for the decode loop.

src/compute/forward.zig:129

Methods

2

method

DecodeState.init

#
pub fn init(allocator: std.mem.Allocator) DecodeState

Initialize decode state for a fresh generation request.

Parameters
allocator
Allocator used for the generated token list.
Returns

A DecodeState positioned at token index zero with an empty output buffer.

src/compute/forward.zig:142

method

DecodeState.deinit

#
pub fn deinit(self: *DecodeState) void

Release the generated token buffer owned by the decode state.

Parameters
self
Decode state to tear down in place.
Notes

After this call the state is invalid and should not be reused.

src/compute/forward.zig:154

struct

SamplingParams

#
pub const SamplingParams = struct

Token sampling controls shared by the decode loop and HTTP server.

src/compute/forward.zig:161

Methods

1

method

SamplingParams.requiresLogitsReadback

#
pub fn requiresLogitsReadback(self: @This()) bool

Return whether the current sampling settings require CPU-visible logits.

(i.e. any path that needs more than just the argmax token index).

Returns

`true` when temperature, top-p, or repetition-penalty are active

src/compute/forward.zig:170

struct

InferenceEngine

#
pub const InferenceEngine = struct

Central inference engine that owns the GPU resources for one active model.

Holds the loaded model, all Vulkan pipelines and intermediate activation buffers, KV-cache page pool, SSM recurrent state, and per-request profiling counters. One engine instance maps to one GPU device; it is not thread-safe.

src/compute/forward.zig:992

Methods

11

method

InferenceEngine.init

#
pub fn init( model: *Model, instance: *const Instance, gpu_config: GpuConfig, shader_dir: []const u8, allocator: std.mem.Allocator, ) !InferenceEngine

Create the runtime objects needed to execute decode-time work on the GPU.

Parameters
model
Loaded model weights and metadata.
instance
Active Vulkan instance and logical device.
gpu_config
Derived GPU tuning parameters for the selected device.
shader_dir
Directory containing compiled SPIR-V shader binaries.
allocator
Allocator used for graphs, staging state, and temporary setup structures.
Returns

An initialized inference engine ready to prefill prompts and run decode steps.

Notes

This allocates shared descriptor pools, staging buffers, intermediate activations, and dispatch wrappers up front.

src/compute/forward.zig:1564

method

InferenceEngine.enableProfiling

#
pub fn enableProfiling(self: *InferenceEngine) !void

Enable full GPU + CPU profiling.

The timestamp query pool is created in `init`, so this just flips the runtime flag. Returns an error if pool creation failed.

src/compute/forward.zig:3369

method

InferenceEngine.enableValidationDiagnostics

#
pub fn enableValidationDiagnostics(self: *InferenceEngine) void

Enable the expensive CPU-vs-GPU validation readbacks used for debugging kernel correctness.

src/compute/forward.zig:3376

method

InferenceEngine.enableLogitsReadback

#
pub fn enableLogitsReadback(self: *InferenceEngine) void

Preserve full logits on the host for debug dumps and diagnostic inspection.

src/compute/forward.zig:3381

method

InferenceEngine.recordProfilingSample

#
pub fn recordProfilingSample(self: *InferenceEngine) void

Read back all timestamps for the current token and fold them into request-wide profiling stats.

src/compute/forward.zig:3681

method

InferenceEngine.decodeStep

#
pub fn decodeStep(self: *InferenceEngine, state: *DecodeState, token_id: u32, collect_output: bool) !void

Run a single decode step for one token through all transformer layers.

embed → [per-layer: norm → QKV → RoPE → KV write → attention → O proj → residual → FFN norm → MoE routing → expert DMMVs → residual] → final norm → LM head → logits diagnostic or GPT-OSS embedding-collection paths.

Parameters
state
Decode state carrying the current token position and generated token history.
token_id
Vocabulary index of the token to embed and feed forward.
collect_output
When `true`, the engine accumulates layer outputs needed by
Returns

`error.ContextLengthExceeded` when `state.position` is at capacity.

src/compute/forward.zig:6255

method

InferenceEngine.prefillBatched

#
pub fn prefillBatched(self: *InferenceEngine, state: *DecodeState, prompt_tokens: []const u32) !void

Experimental batched prompt prefill for the RDNA/Vulkan backend. Gated by `ZINC_BATCHED_PREFILL=1`. This is the Vulkan analogue of `forward_metal.InferenceEngine.prefillBatched`.

Routes to `prefillA3bProduction` or `prefillQwen36DenseFfnPrefix` when the model and prompt length match those specialized paths; otherwise falls back to the `canUseBatchedPrefillRdna`-gated batched body or `prefillBatch` (per-token serial path). Set `ZINC_BATCHED_PREFILL=0` to force the serial fallback or `=validate` to run both paths and diff the last-token logits. Intel Arc devices require `ZINC_INTEL_BATCHED_PREFILL=1` to opt in.

Parameters
state
Decode state for the current request.
prompt_tokens
Tokenized prompt sequence to prefill.

src/compute/forward.zig:22214

method

InferenceEngine.prefillBatch

#
pub fn prefillBatch(self: *InferenceEngine, state: *DecodeState, prompt_tokens: []const u32) !void

Process all prompt tokens sequentially (one token per GPU submission) to populate KV cache and SSM state before the first decode step.

an active KV-page allocation for continuation prefill.

Parameters
state
Decode state; must start at position 0 for a fresh request or have
prompt_tokens
Tokenized input sequence to prefill. No-op when empty.
Notes

This is the per-token serial path. For the experimental batched variant see `prefillBatched`.

src/compute/forward.zig:22930

method

InferenceEngine.sampleGreedy

#
pub fn sampleGreedy(self: *const InferenceEngine) u32

Sample a token greedily.

Uses GPU argmax when available, otherwise falls back to CPU scan.

Returns

The vocabulary index of the highest-logit token.

src/compute/forward.zig:23579

method

InferenceEngine.sample

#
pub fn sample(self: *const InferenceEngine, state: *const DecodeState, params: SamplingParams, random: std.Random) u32

Sample the next token using greedy argmax or stochastic sampling depending on `params`.

Delegates to `sampleGreedy` when no logit readback is needed; otherwise reads staged logits from the host and applies temperature, top-p, top-k, and repetition penalty.

Parameters
state
Decode state supplying the generated-token history for repetition penalty.
params
Sampling hyper-parameters controlling temperature and nucleus filtering.
random
Random source for stochastic sampling; unused on the greedy path.
Returns

The sampled vocabulary token index.

src/compute/forward.zig:23607

method

InferenceEngine.deinit

#
pub fn deinit(self: *InferenceEngine) void

Release GPU buffers, graphs, command objects, and dispatch helpers owned by the engine.

src/compute/forward.zig:24096

function

generate

#
pub fn generate( engine: *InferenceEngine, prompt_tokens: []const u32, max_tokens: u32, eos_token_id: u32, allocator: std.mem.Allocator, ) ![]u32

Run single-request inference: prefill the prompt, decode greedily, and return generated token IDs.

Parameters

engine
Initialized inference engine.
prompt_tokens
Tokenized prompt that seeds the prefill pass.
max_tokens
Maximum number of decode tokens to emit after prefill.
allocator
Allocator used for transient decode state and the returned token slice.

Returns

A heap-allocated slice containing only the generated continuation tokens.

Notes

Generation stops early when the sampled token equals `eos_token_id`.

src/compute/forward.zig:24265