Last updated: 2026-06-12
Primary Reference
ZINC Zig API
This is the main documentation surface for the codebase: generated from src/**/*.zig, grouped by
functionality, and linked with stable module and symbol anchors.
Find Code
Search modules, symbols, and methods
API Section
CLI & Entrypoints
Startup, argument parsing, and the top-level process path that wires model loading, tokenization, and generation together.
Build Info
Build metadata exported by `build.zig` for CLI version reporting.
- version
- commit
- target
- optimize
CLI
CLI entrypoints for configuring ZINC and starting local inference.
- is_debug_mode
- std_options
- myLogFn
- Config
Zig-struct-analyzer
Generated struct-layout probe used by the site Zig API docs.
- main
CLI
ZINC_RT backend entrypoint.
- std_options
- main
API Section
Model Format & Loading
GGUF parsing, metadata normalization, and the runtime structures that move weights from disk into GPU-resident buffers.
Config
Platform-independent model types shared by Vulkan and Metal backends.
- Architecture
- ModelConfig
- parseArchitecture
GGUF
Parse GGUF container files and expose the metadata needed by the loader.
- GGUF_MAGIC
- GGUFVersion
- GGUFType
- GGMLType
Loader Cuda
CUDA-specific model loading — mmap the GGUF, upload every tensor to the NVIDIA device, and expose them to the CUDA forward pass.
- LoadedTensor
- Model
- inspectConfig
- dequantRow
Loader Metal
Metal-specific model loading — zero-copy via mmap + newBufferWithBytesNoCopy.
- ModelInspection
- LoadedTensor
- Model
- residentWeightBytes
Loader
Build runtime model state from GGUF metadata and GPU-resident tensor buffers.
- Architecture
- ModelConfig
- ModelInspection
- LoadedTensor
API Section
Tokenization
Prompt and output text conversion between UTF-8 strings and token IDs used by the decode loop.
API Section
Decode Planning
Static graph construction and dependency ordering for the per-token compute work that the runtime records and submits.
Graph
Represent decode work as a dependency graph that can be topologically ordered.
- ExecDomain
- BottleneckKind
- HardwareInfo
- OpType
Architecture
Build static decode graphs for the supported model families.
- buildDecodeGraph
- buildDecodeGraphDetailed
Graph
Shape-static ZINC_RT IR graph builder.
- BufferId
- NodeId
- max_bindings
- BindingList
Op
ZINC_RT IR opcode definitions.
- Stage
- Milestone
- Opcode
- Info
Verify
ZINC_RT IR verifier entrypoints.
- graph
API Section
Inference Runtime
Decode state, pipeline ownership, command recording, and token sampling inside the active inference loop.
Bench Hot Decode
Hot-path decode kernel microbenchmarks.
- main
Bench Support
Shared helpers for benchmark and standalone runner entrypoints.
- metal_device
- metal_loader
- metal_buffer
- metal_command
Forward Cuda Gemma
CUDA forward pass for the dense gemma4 transformer (Gemma 4 31B-it).
- ForwardGemma
Forward Cuda
CUDA forward pass for the dense `qwen35` hybrid-SSM model (Qwen 3.5 9B).
- ForwardCuda
Forward Metal
Metal inference engine — decode loop for Apple Silicon.
- CommandEncoderMode
- runtime_context_cap
- DecodeState
- GenerateMetrics
Forward Zinc Rt
ZINC_RT forward-pass bring-up.
- m0_max_decode_tokens_default
- Model
- DecodeGraphSummary
- DirectComputeKind
Forward
Run the inference runtime: decode state, pipeline ownership, and token generation.
- DecodeState
- SamplingParams
- InferenceEngine
- generate
Interface
GPU backend abstraction — comptime-resolved, zero runtime overhead.
- is_metal
- is_cuda
- is_vulkan
- backend
Memory Plan
Shared runtime memory accounting helpers for Vulkan and Metal backends.
- RuntimeMemoryProfile
- effectiveContextCeiling
- applyRequestedContextLimit
- requestedContextTokens
Process Lock
Cross-process GPU reservation lock keyed by backend and selected device.
- Backend
- ProcessLock
- AcquireError
- lockPath
Engine
ZINC_RT — the ZINC Runtime.
- Tier
- Options
- Engine
- parseTier
Fast Pool
Low-overhead worker pool for the T-CPU decode matvec fan-out.
- max_workers
- Task
- FastPool
Dequant
Shared scalar GGML dequantization helpers for T-CPU kernels.
- row
- dotRow
- dotF32Row
- dotF16Row
Embed
T-CPU EMBED implementation.
- Params
- run
Flash Attn
T-CPU flash attention (single-query decode) implementation.
- Params
- run
Lm Head
T-CPU LM_HEAD implementation.
- Params
- run
Matvec
T-CPU matrix-vector projection implementation.
- Params
- run
Mod
Pure Zig T-CPU opcode implementations.
- rms_norm
- swiglu
- argmax
- dequant
Moe Gate Topk
T-CPU MOE_GATE_TOPK implementation.
- RoutingRule
- Params
- run
Residual Rms Norm
T-CPU residual add + RMS norm implementation.
- Params
- run
Rms Norm
T-CPU RMS_NORM implementation.
- Params
- run
Rope
T-CPU RoPE (Rotary Positional Embedding) implementation.
- Params
- run
Sigmoid Mul
T-CPU sigmoid-gated multiply implementation.
- Params
- run
Swiglu
T-CPU SwiGLU implementation.
- Params
- run
Vadd
T-CPU element-wise vector addition implementation.
- Params
- run
Kmd
Thin AMDGPU kernel-driver queries used by direct ZINC_RT tiers.
- QueryStatus
- ComputeUserqInfo
- QueryResult
- AMDGPU_HW_IP_COMPUTE
Lib
ZINC_RT reference-runtime module.
- engine
- ir_op
- ir_graph
- kmd
Cpu
T-CPU ring backend.
- CpuRing
Cs
AMDGPU DRM command-submission (CS) path — bring-up of the RADV / radeonsi PM4 submission foundation.
- default_render_node
- AMDGPU_HW_IP_GFX
- AMDGPU_HW_IP_COMPUTE
- AMDGPU_CTX_OP_ALLOC_CTX
Kfd
AMDGPU KFD (`/dev/kfd`) bring-up for the T1 PM4-direct tier.
- default_render_node
- kfd_device_node
- topology_nodes_dir
- min_kfd_major
Mod
Backend-neutral packet batch types for ZINC_RT rings.
- Packet
- PacketBatch
Packet List
Dynamic packet list for building per-token decode sequences.
- PacketList
Packet
PM4 packet builder shared by direct AMD ZINC_RT tiers.
- Error
- sh_reg_num_thread_x
- sh_reg_pgm_lo
- sh_reg_pgm_rsrc1
Umq
AMDGPU user-mode queue (T2) availability and create/free smoke gate.
- min_linux_major
- min_linux_minor
- default_render_node
- user_queue_param_path
API Section
Sampling
Logit post-processing, argmax helpers, and token-selection controls layered on top of the decode runtime.
API Section
Shader Dispatch
Typed wrappers around the compute shaders that prepare push constants, descriptor layouts, and per-op dispatch dimensions.
Attention
Wrap the flash-attention compute shader and its dispatch parameters.
- FlashAttnPush
- FlashAttnBatchedPush
- FlashAttnSplitMergePush
- AttentionDispatch
DMMV
Wrap the decode-time matrix-vector shader family used for projection ops.
- DmmvPushConstants
- BatchDmmvPushConstants
- MoeDmmvPushConstants
- MoeBatchedDmmvPushConstants
Elementwise
Wrap the fused element-wise shader family used by the decode loop.
- RmsNormPush
- SwigluPush
- DeinterleavePush
- SigmoidMulPush
Runtime Assets
Runtime asset discovery for installed and source-tree ZINC layouts.
- ShaderKind
- resolveShaderDir
- resolveShaderDirFrom
API Section
Hardware Detection
Vendor and architecture heuristics that translate raw Vulkan properties into tuning defaults for AMD, NVIDIA, and Intel GPUs.
Diagnostics Metal
Apple Silicon diagnostics and managed-model fit reporting for Metal.
- Options
- ManagedModelInfo
- UnifiedFitEstimate
- run
Diagnostics
Vulkan system diagnostics (`zinc --check`).
- Options
- ManagedModelInfo
- FitEstimate
- run
GPU Detect
Inspect the selected Vulkan device and derive architecture-specific tuning defaults.
- GpuVendor
- GpuConfig
- detect
API Section
Vulkan Runtime
Low-level Vulkan setup, memory allocation, buffers, pipelines, and command submission primitives used throughout the engine.
Buffer
Allocate Vulkan buffers used by weights, intermediates, and staging copies.
- Buffer
- copyBuffer
Command
Create reusable compute command pools and command buffers.
- CommandPool
- CommandBuffer
Instance
Initialize Vulkan, select a compute-capable device, and expose memory utilities.
- PushDescriptorFn
- DeviceCapabilities
- Instance
Pipeline
Load SPIR-V compute shaders into Vulkan pipelines.
- Pipeline
- SpecConst
- PipelineOptions
- createFromSpirv
API Section
Metal Runtime
Low-level Metal device discovery, buffers, pipelines, and command submission primitives used by the Apple Silicon backend.
Buffer
Metal buffer wrapper — shared-mode GPU buffers with zero-copy mmap support.
- MetalBuffer
- createBuffer
- createPrivateBuffer
- wrapMmap
C
Shared C import for the Metal shim — all Metal modules import from here to ensure type identity across compilation units.
- shim
Command
Metal command buffer wrapper — dispatch recording and GPU synchronization.
- CommandEncoderMode
- MetalCommand
- beginCommand
- beginCommandWithMode
Device
Metal device wrapper — macOS Apple Silicon GPU backend.
- GpuFamily
- MetalCapabilities
- MetalDevice
Kernel Timing
Per-kernel Metal dispatch timing probe — default-off, env-flag-gated.
- enabled
- Entry
- enable
- reset
Pipeline
Metal compute pipeline wrapper — MSL source or precompiled metallib.
- MetalPipeline
- createPipeline
- createPipelineFromLib
- freePipeline
API Section
Managed Models
Catalog metadata, cache management, model downloads, and active-selection helpers used by the CLI and server.
Catalog
Curated catalog of ZINC-supported managed GGUF models.
- CatalogStatus
- CatalogEntry
- FitState
- apple_silicon_profile
Managed
Managed model cache, active-model selection, and download helpers.
- RuntimePaths
- ModelFit
- ActiveSelection
- CachedGpuProfile
Model Manager
Managed active-model runtime state for the HTTP server and CLI startup.
- LoadSpec
- ModelSummary
- ModelCatalogView
- LoadedResources
API Section
Scheduler
Continuous batching scheduler, paged KV cache management, and request lifecycle tracking for concurrent inference serving.
KV Cache
Paged KV cache manager for concurrent request serving.
- KvPage
- KvPagePool
Request
Request lifecycle management for concurrent inference serving.
- RequestState
- GenerationParams
- Request
Scheduler
Continuous-batching scheduler groundwork for concurrent inference requests.
- Scheduler
API Section
API Server
OpenAI-compatible HTTP server, route dispatch, SSE streaming, and session management for serving inference over the network.
Http
Minimal HTTP/1.1 server for the OpenAI-compatible inference API.
- Connection
- Method
- Server
Model Manager Metal
Metal-backed active-model runtime state for the HTTP server.
- LoadSpec
- ModelSummary
- ModelCatalogView
- LoadedResources
Model Manager Runtime
Backend-selected model manager for the HTTP server.
- LoadSpec
- ModelManager
Routes
Route dispatcher and endpoint handlers for the OpenAI-compatible API.
- toolCallingEnabled
- ServerState
- handleConnection
Runtime
Backend-specific server runtime aliases and wrappers.
- is_metal
- is_vulkan
- supports_model_management
- supports_sampling_controls
API Section
Tool Calling
Tool-use protocol helpers: chat-template-aware tool definitions, argument parsing, and response formatting for function-calling-capable models.
API Section
CUDA Runtime
CUDA device discovery, context management, device buffers, NVRTC-compiled pipelines, and stream-based command submission for the NVIDIA backend.
Buffer
CUDA buffer wrapper — device-local allocations with optional pinned staging.
- CudaBuffer
- createBuffer
- createBufferStaged
- uploadMmap
C
Shared C import for the CUDA shim — all CUDA backend modules import from here to ensure type identity across compilation units (mirrors src/metal/c.zig).
- shim
Command
CUDA command wrapper — kernel dispatch and stream/event synchronization (mirrors src/metal/command.zig).
- CudaCommand
- beginCommand
Device
CUDA device wrapper — NVIDIA GPU backend (mirrors src/metal/device.zig).
- CudaCapabilities
- CudaDevice
Pipeline
CUDA compute pipeline wrapper — NVRTC-compiled CUfunction (mirrors src/metal/pipeline.zig).
- CudaPipeline
- createPipeline
- createPipelineFromImage
- setMaxDynamicShared
Smoke
Standalone smoke test for the ZINC CUDA backend Zig wrapper layer. Drives the GPU entirely through device.zig / buffer.zig / pipeline.zig / command.zig (which wrap cuda_shim.c) — proving the Zig<->CUDA seam: device select, staged buffers + H2D/D2H, NVRTC runtime compile, the buffers+push dispatch ABI, and both sync and async commit paths.
- main
Dbg Cuda
CUDA forward-pass debug harness for the qwen35 hybrid-SSM model. Two modes:
- main
Loadtest Cuda
Standalone load-test for src/model/loader_cuda.zig.
- main
Run Cuda
Standalone CUDA greedy-decode driver for the qwen35 forward pass.
- main
Developer Entry Points
Start with the hot paths
Agent Access
Machine-Readable Entry Points
Agents should prefer the generated Zig API exports instead of scraping HTML. The JSON export is the canonical
structured surface; llms.txt points callers at the right docs in the right order.
JSON Export
Structured machine-readable dump of sections, modules, symbols, methods, doc tags, anchors, and source links.
Plain Text Export
Single text artifact of the Zig API for retrieval-oriented agents and lightweight ingestion.
LLMs Index
Root agent entrypoint that tells clients to use the Zig API first and points to supplemental docs second.
Supplemental
Architecture and Narrative Docs
These pages support the API reference with higher-level specs, tuning notes, and protocol descriptions. They are intentionally secondary to the generated Zig reference.
Technical Spec
Architecture notes, scheduling model, model support, and system design context behind the code reference.
HTTP API Reference
OpenAI-compatible request and response shapes for the serving interface built on top of the Zig runtime.
RDNA4 Tuning
Architecture-specific performance findings and profiling notes that explain the GPU assumptions in the implementation.
TurboQuant
Background on the KV-cache compression approach that the runtime is designed to accommodate.