Last updated: 2026-06-12
Model Format & Loading
Loader Metal
Metal-specific model loading — zero-copy via mmap + newBufferWithBytesNoCopy.
This replaces the Vulkan loader's staging-buffer DMA with direct mmap wrapping.
7 exports shown
struct
ModelInspection
pub const ModelInspection = struct Summary returned by `inspectModel`: config plus file and tensor size statistics.
struct
LoadedTensor
pub const LoadedTensor = struct A tensor descriptor paired with a Metal buffer holding its weight data (mmap-wrapped or copied).
struct
Model
pub const Model = struct Runtime model state backed by a memory-mapped GGUF file and zero-copy Metal buffers.
Methods
1method
Model.deinit
pub fn deinit(self: *Model) void Release Metal buffers, GGUF metadata, and the backing file mapping.
function
residentWeightBytes
pub fn residentWeightBytes(model: *const Model) u64 Returns the total byte count of model weights that are resident as Metal resources.
Copied tensor arenas replace their mmap-backed tensors in the GPU-visible working set, so arena bytes are counted once and aliased per-tensor handles are skipped to avoid double-counting.
function
inspectConfig
pub fn inspectConfig(path: []const u8, allocator: std.mem.Allocator) !ModelConfig Parse a GGUF file's metadata and return the derived `ModelConfig` without touching the GPU.
The file is memory-mapped and unmapped before returning; no Metal resources are created.
function
inspectModel
pub fn inspectModel(path: []const u8, allocator: std.mem.Allocator) !ModelInspection Parse a GGUF file and return a `ModelInspection` with size statistics and the derived config.
Computes the total raw byte size of all tensor payloads stored in the file. No GPU resources are created; the file mapping is released before returning.
function
load
pub fn load( path: []const u8, metal_ctx: ?*shim.MetalCtx, allocator: std.mem.Allocator, ) !Model Load a GGUF model file and return a `Model` backed by zero-copy Metal buffers.
Each tensor's data is wrapped in a `newBufferWithBytesNoCopy` Metal buffer over the mmap'd file region. For model architectures that benefit from it (e.g. dense Gemma layers), select tensors are copied into pre-allocated Metal arenas to avoid UMA pressure from mixed mmap/Metal page-fault patterns. All weight buffers are registered with an `MTLResidencySet` on macOS 15+ to prevent paging between layers. supported architecture (qwen2, qwen2_moe, qwen35, mistral, mamba, jamba, gemma).