Last updated: 2026-06-12
Inference Runtime
Memory Plan
Shared runtime memory accounting helpers for Vulkan and Metal backends.
The helpers in this module turn model dimensions plus backend-specific runtime characteristics into a comparable memory budget so diagnostics, server load policy, and inference engines size context and KV consistently.
12 exports shown
struct
RuntimeMemoryProfile
pub const RuntimeMemoryProfile = struct Backend-agnostic breakdown of runtime memory as: - fixed bytes that do not scale with context length - bytes that scale linearly per token
Methods
9method
RuntimeMemoryProfile.deviceLocalContextBytes
pub fn deviceLocalContextBytes(self: @This(), context_tokens: u32) u64 Return device-local bytes consumed by context-scaled runtime state.
method
RuntimeMemoryProfile.hostVisibleContextBytes
pub fn hostVisibleContextBytes(self: @This(), context_tokens: u32) u64 Return host-visible bytes consumed by context-scaled runtime state.
method
RuntimeMemoryProfile.runtimeDeviceLocalBytes
pub fn runtimeDeviceLocalBytes(self: @This(), context_tokens: u32) u64 Return total device-local runtime bytes for the requested context length.
method
RuntimeMemoryProfile.runtimeHostVisibleBytes
pub fn runtimeHostVisibleBytes(self: @This(), context_tokens: u32) u64 Return total host-visible runtime bytes for the requested context length.
method
RuntimeMemoryProfile.runtimeUnifiedBytes
pub fn runtimeUnifiedBytes(self: @This(), context_tokens: u32) u64 Return total unified-memory runtime bytes for the requested context length.
method
RuntimeMemoryProfile.totalDeviceLocalBytes
pub fn totalDeviceLocalBytes(self: @This(), weights_bytes: u64, context_tokens: u32) u64 Return weights plus device-local runtime bytes for the requested context length.
method
RuntimeMemoryProfile.totalUnifiedBytes
pub fn totalUnifiedBytes(self: @This(), weights_bytes: u64, context_tokens: u32) u64 Return weights plus unified runtime bytes for the requested context length.
method
RuntimeMemoryProfile.maxContextTokensForDeviceLocalBudget
pub fn maxContextTokensForDeviceLocalBudget( self: @This(), weights_bytes: u64, budget_bytes: u64, ceiling: u32, ) u32 Return the largest context that fits within a device-local memory budget.
method
RuntimeMemoryProfile.maxContextTokensForUnifiedBudget
pub fn maxContextTokensForUnifiedBudget( self: @This(), weights_bytes: u64, budget_bytes: u64, ceiling: u32, ) u32 Return the largest context that fits within a unified-memory budget.
Combines device-local and host-visible costs (both fixed and per-token) when computing available room, suitable for backends with a single unified address space such as Metal.
function
effectiveContextCeiling
pub fn effectiveContextCeiling(config: ModelConfig, requested_context_length: ?u32) u32 Clamp the requested context length against the model's declared context limit.
function
applyRequestedContextLimit
pub fn applyRequestedContextLimit(config: *ModelConfig, requested_context_length: ?u32) void Apply the requested context cap directly to a mutable model config.
Mutates `config.context_length` in place so that downstream code reading the config sees the clamped value without needing to carry a separate cap.
function
requestedContextTokens
pub fn requestedContextTokens(config: ModelConfig, requested_context_length: ?u32, backend_cap: u32) u32 Return the runtime context target after applying both model and backend caps.
Applies the model ceiling first (`effectiveContextCeiling`), then further clamps to the backend's hardware-derived limit.
function
remainingContextTokens
pub fn remainingContextTokens(used_context_tokens: u32, context_capacity_tokens: u32) u32 Return how many context slots remain available for a request.
Uses saturating subtraction so the result is 0 rather than wrapping when `used_context_tokens` exceeds the capacity.
function
clampedCompletionTokens
pub fn clampedCompletionTokens( used_context_tokens: u32, requested_completion_tokens: u32, context_capacity_tokens: u32, ) u32 Clamp requested completion tokens against the remaining context budget.
function
requestContextTarget
pub fn requestContextTarget( used_context_tokens: u32, requested_completion_tokens: u32, context_capacity_tokens: u32, ) u32 Return the total context target needed for prompt plus completion work.
Adds the clamped completion quota to the tokens already used, then caps the result at `context_capacity_tokens` to prevent over-allocation.
struct
RequestBudget
pub const RequestBudget = struct Completion-token budget and resulting context target for one request.
function
requestBudget
pub fn requestBudget( used_context_tokens: u32, requested_completion_tokens: u32, context_capacity_tokens: u32, ) RequestBudget Compute the clamped completion budget and resulting context target for one request.
Combines `clampedCompletionTokens` and `requestContextTarget` into a single call so callers get both values in one pass. total context window size to allocate for this request.
function
profile
pub fn profile(config: ModelConfig) RuntimeMemoryProfile Build the backend-agnostic runtime memory profile for a normalized model config.
Computes all fixed and per-token memory costs from model dimensions, including attention buffers, FFN/MoE scratch buffers, SSM convolution and state buffers, and KV-cache scaling. The returned profile does not include model weights.
constant
auto_context_floor_tokens
pub const auto_context_floor_tokens: u32 = 4096 vLLM-style floor for auto-sized context: never fall below this even if the memory math suggests less — the server would otherwise become unusable.
function
autoContextTokensForDeviceBudget
pub fn autoContextTokensForDeviceBudget( profile_value: RuntimeMemoryProfile, weights_bytes: u64, device_budget_bytes: u64, architectural_ceiling: u32, ) u32 Derive a context length from an available device-memory budget.
Analogous to vLLM's `determine_available_memory` → `get_num_blocks` flow, adapted for a contiguous (non-paged) KV cache: 1. Reserve a utilization fraction of the device budget. 2. Subtract model weights and fixed runtime overhead. 3. Divide the leftover by the per-token KV cost. 4. Clamp to the architectural ceiling from GGUF and floor at 4096 so tiny or very-tight budgets still produce a usable server.