Last updated: 2026-06-12

API Server

Runtime

All API Sections

Backend-specific server runtime aliases and wrappers.

Keeps the HTTP/routes layer shared across Vulkan and Metal backends.

20 exports 0 methods src/server/runtime.zig

20 exports shown

constant

is_metal

#
pub const is_metal = gpu.is_metal

Whether the active GPU backend is Apple Metal.

src/server/runtime.zig:8

constant

is_vulkan

#
pub const is_vulkan = gpu.is_vulkan

Whether the active GPU backend is Vulkan.

src/server/runtime.zig:10

constant

supports_model_management

#
pub const supports_model_management = gpu.is_vulkan or gpu.is_metal

Whether the backend supports loading/unloading models at runtime.

src/server/runtime.zig:12

constant

supports_sampling_controls

#
pub const supports_sampling_controls = gpu.is_vulkan or gpu.is_metal

Whether the backend supports temperature, top-p, top-k, and repetition penalty.

src/server/runtime.zig:14

constant

supports_runtime_profiling

#
pub const supports_runtime_profiling = gpu.is_vulkan or gpu.is_metal

Whether the backend supports GPU kernel profiling during inference.

src/server/runtime.zig:16

constant

tokenizer_mod

#
pub const tokenizer_mod = @import("../model/tokenizer.zig")

Tokenizer module (shared across all backends).

src/server/runtime.zig:19

constant

forward_mod

#
pub const forward_mod = if (gpu.is_metal) @import("../compute/forward_metal.zig") else @import("../compute/forward.zig")

Forward-pass module, selected by the active backend.

src/server/runtime.zig:21

constant

loader_mod

#
pub const loader_mod = if (gpu.is_metal) @import("../model/loader_metal.zig") else @import("../model/loader.zig")

Model-loading module, selected by the active backend.

src/server/runtime.zig:23

constant

model_manager_mod

#
pub const model_manager_mod = if (gpu.is_metal) @import("model_manager_metal.zig") else @import("model_manager.zig")

Model-manager module, selected by the active backend.

src/server/runtime.zig:25

constant

InferenceEngine

#
pub const InferenceEngine = forward_mod.InferenceEngine

Backend-specific inference engine that runs the forward pass.

src/server/runtime.zig:28

constant

DecodeState

#
pub const DecodeState = forward_mod.DecodeState

Per-sequence decode state (KV cache position, token history, etc.).

src/server/runtime.zig:30

constant

Model

#
pub const Model = loader_mod.Model

Loaded model handle (weights, hyperparams, GGUF metadata).

src/server/runtime.zig:32

constant

ModelManager

#
pub const ModelManager = model_manager_mod.ModelManager

Manages loading, unloading, and switching between models at runtime.

src/server/runtime.zig:34

constant

SamplingParams

#
pub const SamplingParams = forward_mod.SamplingParams

Token sampling parameters (shared across Vulkan and Metal backends).

src/server/runtime.zig:37

function

enableLogitsReadback

#
pub fn enableLogitsReadback(_engine: *InferenceEngine) void

Enable logits readback from GPU so sampling can inspect raw logits.

On Metal (UMA) logits are always CPU-accessible, so this is a no-op.

Parameters

_engine
Inference engine whose readback mode is updated.

src/server/runtime.zig:42

function

logitsReadbackEnabled

#
pub fn logitsReadbackEnabled(_engine: *const InferenceEngine) bool

Return whether logits readback is currently enabled on the engine.

Always returns `true` on Metal because UMA makes logits CPU-accessible without an explicit readback step.

Parameters

_engine
Inference engine to query.

Returns

`true` if the engine will expose logits after each decode step.

src/server/runtime.zig:54

function

setLogitsReadbackEnabled

#
pub fn setLogitsReadbackEnabled(_engine: *InferenceEngine, _enabled: bool) void

Set the logits readback intent flag on backends that can elide full logit materialization.

src/server/runtime.zig:64

function

enableProfiling

#
pub fn enableProfiling(_engine: *InferenceEngine) !void

Enable GPU kernel profiling on the inference engine.

Supported on both Vulkan and Metal backends; calls the backend's own `enableProfiling` method.

Parameters

_engine
Inference engine to configure for profiling.

Returns

An error if the backend's profiling setup fails (e.g. out of GPU resources).

src/server/runtime.zig:76

function

decodeStep

#
pub fn decodeStep( _engine: *InferenceEngine, _state: *DecodeState, _token_id: u32, _collect_output: bool, ) !void

Run a single autoregressive decode step, advancing the KV cache by one token.

Parameters

_engine
Inference engine that owns the model weights and KV cache.
_state
Per-sequence decode state tracking position and token history.
_token_id
Input token to feed into the model for this step.
_collect_output
When `true`, copy output logits to CPU (Vulkan only; ignored on Metal where logits are always accessible).

Returns

An error if the GPU submission or synchronisation fails.

src/server/runtime.zig:90

function

sample

#
pub fn sample( _engine: *const InferenceEngine, _state: *const DecodeState, _params: SamplingParams, _random: std.Random, ) u32

Sample the next token from the model's logit distribution.

Parameters

_engine
Inference engine holding the current logits.
_state
Decode state used to retrieve generated-token history for repetition penalty.
_params
Sampling configuration (temperature, top-p, top-k, repetition penalty, etc.).
_random
PRNG source for stochastic sampling.

Returns

The sampled token ID.

src/server/runtime.zig:109