Last updated: 2026-06-12

Model Format & Loading

Loader

All API Sections

Build runtime model state from GGUF metadata and GPU-resident tensor buffers.

This module translates an on-disk GGUF file into the normalized model configuration and uploaded tensors consumed by the inference runtime.

13 exports 1 methods src/model/loader.zig

13 exports shown

constant

Architecture

#
pub const Architecture = config_mod.Architecture

Supported model architectures (re-exported from config.zig).

src/model/loader.zig:17

constant

ModelConfig

#
pub const ModelConfig = config_mod.ModelConfig

Normalized model dimensions and hyperparameters (re-exported from config.zig).

src/model/loader.zig:20

struct

ModelInspection

#
pub const ModelInspection = struct

Summary returned by `inspectModel`: config plus file and tensor size statistics.

src/model/loader.zig:23

struct

LoadedTensor

#
pub const LoadedTensor = struct

A tensor descriptor paired with the GPU buffer that stores its contents.

src/model/loader.zig:36

function

isMoEExpertTensor

#
pub fn isMoEExpertTensor(name: []const u8) bool

Return true if this GGUF tensor name designates a fused MoE expert weight tensor.

Matches the four suffixes emitted by GGUF for sparse-MoE architectures: `ffn_gate_exps.weight`, `ffn_up_exps.weight`, `ffn_down_exps.weight`, and `ffn_down_exps_scale.weight` (Q4_K_M variants only). Dense tensors and non-expert MoE tensors (router gate, attention, embeddings, etc.) are not matched.

src/model/loader.zig:49

function

computeOffloadDecision

#
pub fn computeOffloadDecision( override: OffloadOverride, total_tensor_bytes: u64, offloadable_tensor_bytes: u64, vram_budget_bytes: u64, ) bool

Compute whether MoE expert tensors should be offloaded to host RAM, without mutating any global state.

Suitable for unit tests and for use by the side-effecting `decideOffloadForLoad` wrapper.

Parameters

override
Explicit env-var override (.force_on/.force_off) or .auto.
total_tensor_bytes
Total size of all model tensors in bytes.
offloadable_tensor_bytes
Bytes belonging to MoE expert tensors only.
vram_budget_bytes
Reported VRAM capacity in bytes (from Vulkan device).

Returns

true if expert tensors should be placed in host-visible memory.

src/model/loader.zig:87

function

decideOffloadForLoad

#
pub fn decideOffloadForLoad(total_tensor_bytes: u64, offloadable_tensor_bytes: u64, vram_budget_bytes: u64) bool

Decide whether to offload MoE expert tensors for the next model load and cache the decision in `offload_state`.

Honors `ZINC_OFFLOAD_MOE_EXPERTS` if set, otherwise auto-decides: - If the full model fits in VRAM (with headroom for KV/runtime): no offload. - If the model only fits with MoE experts in host RAM: enable offload. - If neither fits: don't enable (let allocation fail with a clear OOM instead of pretending to fit).

Parameters

total_tensor_bytes
Total size of all model tensors in bytes.
offloadable_tensor_bytes
Bytes belonging to MoE expert tensors only.
vram_budget_bytes
Reported VRAM capacity in bytes (from Vulkan device).

Returns

true if expert tensors will be placed in host-visible memory.

src/model/loader.zig:119

function

offloadEnabled

#
pub fn offloadEnabled() bool

Return whether MoE expert tensors should be in host-visible memory for the currently-loaded model.

Reads the cached decision from `decideOffloadForLoad`. Returns false until a load has happened.

src/model/loader.zig:138

function

shouldOffloadToHost

#
pub fn shouldOffloadToHost(name: []const u8) bool

Return true if this tensor should be allocated in host-visible memory rather than device-local VRAM.

Returns true only when MoE expert offload is enabled (see `offloadEnabled`) and the tensor name matches an expert weight suffix.

Parameters

name
GGUF tensor name to classify.

Returns

true if the tensor belongs to a MoE expert and offload is active.

src/model/loader.zig:147

struct

Model

#
pub const Model = struct

Runtime model state backed by a memory-mapped GGUF file and uploaded tensor buffers.

src/model/loader.zig:152

Methods

1

method

Model.deinit

#
pub fn deinit(self: *Model, instance: *const Instance) void

Release tensor buffers, GGUF metadata, and the backing file mapping owned by the model.

Parameters
self
Model instance to tear down in place.
instance
Unused; accepted for call-site symmetry with other deinit patterns.

src/model/loader.zig:169

function

inspectConfig

#
pub fn inspectConfig(path: []const u8, allocator: std.mem.Allocator) !ModelConfig

Inspect a GGUF file and extract only the normalized model configuration.

Parameters

path
Path to the GGUF file on disk.
allocator
Allocator used for the parsed metadata structures.

Returns

A ModelConfig derived from GGUF metadata without uploading tensors to the GPU.

src/model/loader.zig:460

function

inspectModel

#
pub fn inspectModel(path: []const u8, allocator: std.mem.Allocator) !ModelInspection

Inspect a GGUF file and return a `ModelInspection` containing the normalized model config, the on-disk file size, total tensor byte count, offloadable (MoE expert) tensor byte count, tensor count, and metadata key count.

Does not allocate GPU resources or upload tensors.

Parameters

path
Path to the GGUF file on disk.
allocator
Allocator used for the parsed GGUF metadata structures.

Returns

A `ModelInspection` with config and tensor/file size statistics.

src/model/loader.zig:491

function

load

#
pub fn load( path: []const u8, instance: *const Instance, cmd_pool: *const CommandPool, allocator: std.mem.Allocator, ) !Model

Load a GGUF model: memory-map the file, parse headers, and DMA tensors to GPU VRAM.

Parameters

path
Path to the GGUF file on disk.
instance
Active Vulkan instance used for buffer allocation.
cmd_pool
Command pool used for staging copy operations.
allocator
Allocator used for metadata, tensor lists, and temporary state.

Returns

A fully populated Model with parsed metadata and uploaded tensors.

src/model/loader.zig:536