Last updated: 2026-06-12

Inference Runtime

Memory Plan

All API Sections

Shared runtime memory accounting helpers for Vulkan and Metal backends.

The helpers in this module turn model dimensions plus backend-specific runtime characteristics into a comparable memory budget so diagnostics, server load policy, and inference engines size context and KV consistently.

12 exports 9 methods src/gpu/memory_plan.zig

12 exports shown

struct

RuntimeMemoryProfile

#
pub const RuntimeMemoryProfile = struct

Backend-agnostic breakdown of runtime memory as: - fixed bytes that do not scale with context length - bytes that scale linearly per token

src/gpu/memory_plan.zig:15

Methods

9

method

RuntimeMemoryProfile.deviceLocalContextBytes

#
pub fn deviceLocalContextBytes(self: @This(), context_tokens: u32) u64

Return device-local bytes consumed by context-scaled runtime state.

src/gpu/memory_plan.zig:28

method

RuntimeMemoryProfile.hostVisibleContextBytes

#
pub fn hostVisibleContextBytes(self: @This(), context_tokens: u32) u64

Return host-visible bytes consumed by context-scaled runtime state.

src/gpu/memory_plan.zig:33

method

RuntimeMemoryProfile.runtimeDeviceLocalBytes

#
pub fn runtimeDeviceLocalBytes(self: @This(), context_tokens: u32) u64

Return total device-local runtime bytes for the requested context length.

src/gpu/memory_plan.zig:38

method

RuntimeMemoryProfile.runtimeHostVisibleBytes

#
pub fn runtimeHostVisibleBytes(self: @This(), context_tokens: u32) u64

Return total host-visible runtime bytes for the requested context length.

src/gpu/memory_plan.zig:43

method

RuntimeMemoryProfile.runtimeUnifiedBytes

#
pub fn runtimeUnifiedBytes(self: @This(), context_tokens: u32) u64

Return total unified-memory runtime bytes for the requested context length.

src/gpu/memory_plan.zig:48

method

RuntimeMemoryProfile.totalDeviceLocalBytes

#
pub fn totalDeviceLocalBytes(self: @This(), weights_bytes: u64, context_tokens: u32) u64

Return weights plus device-local runtime bytes for the requested context length.

src/gpu/memory_plan.zig:53

method

RuntimeMemoryProfile.totalUnifiedBytes

#
pub fn totalUnifiedBytes(self: @This(), weights_bytes: u64, context_tokens: u32) u64

Return weights plus unified runtime bytes for the requested context length.

src/gpu/memory_plan.zig:58

method

RuntimeMemoryProfile.maxContextTokensForDeviceLocalBudget

#
pub fn maxContextTokensForDeviceLocalBudget( self: @This(), weights_bytes: u64, budget_bytes: u64, ceiling: u32, ) u32

Return the largest context that fits within a device-local memory budget.

Parameters
weights_bytes
Size of model weights already placed on the device.
budget_bytes
Total device-local memory budget available.
ceiling
Architectural context-length ceiling (e.g. from GGUF).
Returns

Token count clamped to both the budget and `ceiling`.

src/gpu/memory_plan.zig:68

method

RuntimeMemoryProfile.maxContextTokensForUnifiedBudget

#
pub fn maxContextTokensForUnifiedBudget( self: @This(), weights_bytes: u64, budget_bytes: u64, ceiling: u32, ) u32

Return the largest context that fits within a unified-memory budget.

Combines device-local and host-visible costs (both fixed and per-token) when computing available room, suitable for backends with a single unified address space such as Metal.

Parameters
weights_bytes
Size of model weights counted against the budget.
budget_bytes
Total unified-memory budget available.
ceiling
Architectural context-length ceiling.
Returns

Token count clamped to both the budget and `ceiling`.

src/gpu/memory_plan.zig:92

function

effectiveContextCeiling

#
pub fn effectiveContextCeiling(config: ModelConfig, requested_context_length: ?u32) u32

Clamp the requested context length against the model's declared context limit.

Parameters

config
Model configuration supplying `context_length` as the hard ceiling.
requested_context_length
Optional caller-supplied cap; `null` means use the model ceiling.

Returns

The smaller of `config.context_length` and the requested cap.

src/gpu/memory_plan.zig:113

function

applyRequestedContextLimit

#
pub fn applyRequestedContextLimit(config: *ModelConfig, requested_context_length: ?u32) void

Apply the requested context cap directly to a mutable model config.

Mutates `config.context_length` in place so that downstream code reading the config sees the clamped value without needing to carry a separate cap.

Parameters

config
Config to mutate; `context_length` is lowered if necessary.
requested_context_length
Optional cap; ignored when larger than the current ceiling.

src/gpu/memory_plan.zig:123

function

requestedContextTokens

#
pub fn requestedContextTokens(config: ModelConfig, requested_context_length: ?u32, backend_cap: u32) u32

Return the runtime context target after applying both model and backend caps.

Applies the model ceiling first (`effectiveContextCeiling`), then further clamps to the backend's hardware-derived limit.

Parameters

config
Model configuration providing the architectural ceiling.
requested_context_length
Optional user-supplied context length cap.
backend_cap
Hardware or driver limit reported by the backend.

Returns

Final context length clamped to all three bounds.

src/gpu/memory_plan.zig:135

function

remainingContextTokens

#
pub fn remainingContextTokens(used_context_tokens: u32, context_capacity_tokens: u32) u32

Return how many context slots remain available for a request.

Uses saturating subtraction so the result is 0 rather than wrapping when `used_context_tokens` exceeds the capacity.

Parameters

used_context_tokens
Tokens already committed in the current context window.
context_capacity_tokens
Total context window size in tokens.

Returns

Remaining free slots, clamped to 0 on overflow.

src/gpu/memory_plan.zig:146

function

clampedCompletionTokens

#
pub fn clampedCompletionTokens( used_context_tokens: u32, requested_completion_tokens: u32, context_capacity_tokens: u32, ) u32

Clamp requested completion tokens against the remaining context budget.

Parameters

used_context_tokens
Tokens already in the context window.
requested_completion_tokens
Caller-requested number of new tokens to generate.
context_capacity_tokens
Total context capacity.

Returns

Actual completion quota, never exceeding the remaining room.

src/gpu/memory_plan.zig:156

function

requestContextTarget

#
pub fn requestContextTarget( used_context_tokens: u32, requested_completion_tokens: u32, context_capacity_tokens: u32, ) u32

Return the total context target needed for prompt plus completion work.

Adds the clamped completion quota to the tokens already used, then caps the result at `context_capacity_tokens` to prevent over-allocation.

Parameters

used_context_tokens
Tokens already occupying the context window.
requested_completion_tokens
Tokens the caller wants to generate.
context_capacity_tokens
Hard upper bound on the context window size.

Returns

Total tokens to allocate, bounded by capacity.

src/gpu/memory_plan.zig:172

struct

RequestBudget

#
pub const RequestBudget = struct

Completion-token budget and resulting context target for one request.

src/gpu/memory_plan.zig:184

function

requestBudget

#
pub fn requestBudget( used_context_tokens: u32, requested_completion_tokens: u32, context_capacity_tokens: u32, ) RequestBudget

Compute the clamped completion budget and resulting context target for one request.

Combines `clampedCompletionTokens` and `requestContextTarget` into a single call so callers get both values in one pass. total context window size to allocate for this request.

Parameters

used_context_tokens
Tokens already committed in the context window.
requested_completion_tokens
Tokens the caller wants to generate.
context_capacity_tokens
Total context capacity.

Returns

`RequestBudget` with the actual completion quota and the

src/gpu/memory_plan.zig:198

function

profile

#
pub fn profile(config: ModelConfig) RuntimeMemoryProfile

Build the backend-agnostic runtime memory profile for a normalized model config.

Computes all fixed and per-token memory costs from model dimensions, including attention buffers, FFN/MoE scratch buffers, SSM convolution and state buffers, and KV-cache scaling. The returned profile does not include model weights.

Parameters

config
Model configuration with dimensions, expert counts, and SSM parameters.

Returns

`RuntimeMemoryProfile` capturing fixed overhead and per-token KV cost.

src/gpu/memory_plan.zig:225

constant

auto_context_floor_tokens

#
pub const auto_context_floor_tokens: u32 = 4096

vLLM-style floor for auto-sized context: never fall below this even if the memory math suggests less — the server would otherwise become unusable.

src/gpu/memory_plan.zig:298

function

autoContextTokensForDeviceBudget

#
pub fn autoContextTokensForDeviceBudget( profile_value: RuntimeMemoryProfile, weights_bytes: u64, device_budget_bytes: u64, architectural_ceiling: u32, ) u32

Derive a context length from an available device-memory budget.

Analogous to vLLM's `determine_available_memory` → `get_num_blocks` flow, adapted for a contiguous (non-paged) KV cache: 1. Reserve a utilization fraction of the device budget. 2. Subtract model weights and fixed runtime overhead. 3. Divide the leftover by the per-token KV cost. 4. Clamp to the architectural ceiling from GGUF and floor at 4096 so tiny or very-tight budgets still produce a usable server.

src/gpu/memory_plan.zig:324