Last updated: 2026-06-12
API Server
Model Manager Metal
Metal-backed active-model runtime state for the HTTP server.
The server uses this manager to load one Metal model at a time, track the tokenizer and runtime allocations that belong to it, and project fit/status information back into the OpenAI-compatible model-management endpoints.
5 exports shown
struct
LoadSpec
pub const LoadSpec = struct Identifies a model to load: a GGUF file path and optional managed-catalog id.
struct
ModelSummary
pub const ModelSummary = struct Compact view of one catalog entry for the HTTP `/v1/models` response.
struct
ModelCatalogView
pub const ModelCatalogView = struct Snapshot of the full model catalog, filtered by the current GPU profile.
Methods
1method
ModelCatalogView.deinit
pub fn deinit(self: *ModelCatalogView, allocator: std.mem.Allocator) void Free the owned catalog summary slice.
struct
LoadedResources
pub const LoadedResources = struct All GPU and host resources for a loaded model: weights, tokenizer, and inference engine.
struct
ModelManager
pub const ModelManager = struct Thread-safe manager for the currently active model on the Metal backend.
Handles loading, hot-swapping, catalog queries, and VRAM budget enforcement.
Methods
11method
ModelManager.init
pub fn init( spec: LoadSpec, device: *const MetalDevice, allocator: std.mem.Allocator, ) !ModelManager Create a manager and eagerly load the model described by `spec`.
Acquires the per-device GPU process lock before touching Metal resources.
method
ModelManager.initEmpty
pub fn initEmpty( device: *const MetalDevice, requested_context_length: ?u32, allocator: std.mem.Allocator, ) ModelManager Create an idle manager with no model currently loaded.
The GPU process lock is not acquired until the first model is activated. `null` lets the memory planner auto-size the context window.
method
ModelManager.deinit
pub fn deinit(self: *ModelManager) void Tear down the active model, if any, and release the GPU process lock.
method
ModelManager.currentResources
pub fn currentResources(self: *ModelManager) ?*LoadedResources Return the currently loaded resource bundle, or null when idle.
method
ModelManager.activeDisplayName
pub fn activeDisplayName(self: *ModelManager) []const u8 Return the active model display name, or `"none"` when no model is loaded.
method
ModelManager.catalogProfile
pub fn catalogProfile(self: *const ModelManager) []const u8 Return the catalog profile string used for the active Metal device.
method
ModelManager.currentMemoryUsage
pub fn currentMemoryUsage(self: *ModelManager) MemoryUsage Snapshot memory usage for the active model, or zeroes when idle.
method
ModelManager.collectCatalogView
pub fn collectCatalogView(self: *ModelManager, allocator: std.mem.Allocator, include_all: bool) !ModelCatalogView Build a catalog view annotated with install, fit, and active-model status.
Entries are filtered to those that are supported on the current GPU profile and fit within the VRAM budget unless `include_all` is true. The currently-active model is always included even if its static VRAM estimate exceeds the live budget. Unrecognised loaded models (raw GGUF files with no catalog entry) appear as a synthetic entry with `managed = false`.
method
ModelManager.supportsManagedEntry
pub fn supportsManagedEntry(self: *ModelManager, entry: catalog_mod.CatalogEntry, allocator: std.mem.Allocator) bool Return whether a managed catalog entry is supported and fits on the active device.
method
ModelManager.activateManagedModel
pub fn activateManagedModel(self: *ModelManager, model_id: []const u8, persist_active: bool) !void Load the specified catalog model and make it the active inference target.
Hot-swaps the previous model if one is loaded. Validates that the model is installed, supported on this GPU profile, and fits within the VRAM budget before touching Metal.
method
ModelManager.removeManagedModel
pub fn removeManagedModel(self: *ModelManager, model_id: []const u8, force: bool) !RemoveResult Unload and delete a managed model from both GPU memory and the model store on disk.
If the model is currently loaded and `force` is false, returns `error.ModelLoadedInGpu`. With `force` true the model is evicted from GPU memory before deletion.