Last updated: 2026-06-12
Model Format & Loading
Loader
Build runtime model state from GGUF metadata and GPU-resident tensor buffers.
This module translates an on-disk GGUF file into the normalized model configuration and uploaded tensors consumed by the inference runtime.
13 exports shown
constant
Architecture
pub const Architecture = config_mod.Architecture Supported model architectures (re-exported from config.zig).
constant
ModelConfig
pub const ModelConfig = config_mod.ModelConfig Normalized model dimensions and hyperparameters (re-exported from config.zig).
struct
ModelInspection
pub const ModelInspection = struct Summary returned by `inspectModel`: config plus file and tensor size statistics.
struct
LoadedTensor
pub const LoadedTensor = struct A tensor descriptor paired with the GPU buffer that stores its contents.
function
isMoEExpertTensor
pub fn isMoEExpertTensor(name: []const u8) bool Return true if this GGUF tensor name designates a fused MoE expert weight tensor.
Matches the four suffixes emitted by GGUF for sparse-MoE architectures: `ffn_gate_exps.weight`, `ffn_up_exps.weight`, `ffn_down_exps.weight`, and `ffn_down_exps_scale.weight` (Q4_K_M variants only). Dense tensors and non-expert MoE tensors (router gate, attention, embeddings, etc.) are not matched.
function
computeOffloadDecision
pub fn computeOffloadDecision( override: OffloadOverride, total_tensor_bytes: u64, offloadable_tensor_bytes: u64, vram_budget_bytes: u64, ) bool Compute whether MoE expert tensors should be offloaded to host RAM, without mutating any global state.
Suitable for unit tests and for use by the side-effecting `decideOffloadForLoad` wrapper.
function
decideOffloadForLoad
pub fn decideOffloadForLoad(total_tensor_bytes: u64, offloadable_tensor_bytes: u64, vram_budget_bytes: u64) bool Decide whether to offload MoE expert tensors for the next model load and cache the decision in `offload_state`.
Honors `ZINC_OFFLOAD_MOE_EXPERTS` if set, otherwise auto-decides: - If the full model fits in VRAM (with headroom for KV/runtime): no offload. - If the model only fits with MoE experts in host RAM: enable offload. - If neither fits: don't enable (let allocation fail with a clear OOM instead of pretending to fit).
function
offloadEnabled
pub fn offloadEnabled() bool Return whether MoE expert tensors should be in host-visible memory for the currently-loaded model.
Reads the cached decision from `decideOffloadForLoad`. Returns false until a load has happened.
function
shouldOffloadToHost
pub fn shouldOffloadToHost(name: []const u8) bool Return true if this tensor should be allocated in host-visible memory rather than device-local VRAM.
Returns true only when MoE expert offload is enabled (see `offloadEnabled`) and the tensor name matches an expert weight suffix.
struct
Model
pub const Model = struct Runtime model state backed by a memory-mapped GGUF file and uploaded tensor buffers.
Methods
1method
Model.deinit
pub fn deinit(self: *Model, instance: *const Instance) void Release tensor buffers, GGUF metadata, and the backing file mapping owned by the model.
function
inspectConfig
pub fn inspectConfig(path: []const u8, allocator: std.mem.Allocator) !ModelConfig Inspect a GGUF file and extract only the normalized model configuration.
function
inspectModel
pub fn inspectModel(path: []const u8, allocator: std.mem.Allocator) !ModelInspection Inspect a GGUF file and return a `ModelInspection` containing the normalized model config, the on-disk file size, total tensor byte count, offloadable (MoE expert) tensor byte count, tensor count, and metadata key count.
Does not allocate GPU resources or upload tensors.
function
load
pub fn load( path: []const u8, instance: *const Instance, cmd_pool: *const CommandPool, allocator: std.mem.Allocator, ) !Model Load a GGUF model: memory-map the file, parse headers, and DMA tensors to GPU VRAM.