Last updated: 2026-05-24
ZINC Technical Specification#
Last updated: 2026-04-17
ZINC is a local-first LLM inference engine written primarily in Zig. It reads GGUF directly, runs single-model inference through a CLI and OpenAI-compatible HTTP API, and targets two GPU backends:
- Vulkan on Linux, primarily for AMD RDNA3/RDNA4
- Metal on macOS for Apple Silicon
This page is a living architecture document for the current implementation. Where the repository contains prototype code or forward-looking work, that is called out explicitly instead of being presented as already shipped behavior.
1. Current State At A Glance#
| Area | Current state |
|---|---|
| Backend selection | Compile-time: Linux builds the Vulkan path, macOS builds the Metal path |
| Model format | Native GGUF parsing and loading |
| Active model policy | One model loaded into runtime memory at a time |
| Supported serving | CLI, built-in chat UI, /v1/chat/completions, /v1/completions, /v1/models, /v1/models/pull, /v1/models/activate, /v1/models/remove, /health |
| Concurrency model | HTTP is concurrent, but generation is still serialized behind one engine lock |
| Scheduler status | src/scheduler/* contains groundwork for continuous batching and page allocation, but it is not the main serving hot path today |
| KV cache | Vulkan uses a paged 16-token layout; Metal currently favors a contiguous fast path with the same conceptual interface |
| Context planning | Runtime memory is budgeted through a shared planner; today's engine is still effectively tuned around a 4096 token context cap |
| KV compression | Metal has a q8_0 KV option in the runtime; TurboQuant is specified elsewhere but is not integrated into the main runtime yet |
| Graph tooling | Decode graphs can be exported as JSON and DOT with per-op cost annotations |
2. Design Goals#
ZINC is optimized around a small number of strong constraints:
- keep the runtime mostly Zig
- own GGUF parsing and model configuration directly
- keep the decode path explicit enough to profile and tune at kernel level
- keep the higher-level UX stable across backends
- make performance work observable through graph export, diagnostics, and microbenchmarks
That leads to a split architecture:
- shared upper layers for tokenization, GGUF parsing, model catalog, HTTP routes, diagnostics, and memory planning
- backend-specific substrate for device discovery, buffers, pipeline compilation, command submission, and kernel code
- model-family-specific decode planning driven from normalized GGUF metadata instead of hardcoded per-model scripts
3. System Layout#
CLI / HTTP server / chat UI
-> tokenizer + chat templates
-> managed model catalog + model manager
-> GGUF parser + model config
-> decode graph builder + graph export
-> backend runtime
-> Vulkan: SPIR-V compute shaders, device-local VRAM, command buffers
-> Metal: MSL kernels, unified memory, Objective-C shim
-> sampling + streaming + health / memory reporting
3.1 Shared modules#
The shared layers are centered in these files:
src/model/gguf.zigparses metadata and tensor layouts from GGUFsrc/model/config.zignormalizes architecture-specific metadata into a runtime configsrc/model/tokenizer.zigowns vocab, merges, chat templates, and thinking-toggle handlingsrc/model/catalog.zigdefines the curated managed-model catalogsrc/model/managed.zighandles downloads, cache layout, active selection, and fit checkssrc/server/routes.zigserves the HTTP API and built-in chat UIsrc/gpu/memory_plan.zigcomputes fixed and per-token runtime memory costs for both backends
3.2 Backend-specific substrate#
The backend switch happens in src/gpu/interface.zig at compile time:
- Linux => Vulkan backend
- macOS => Metal backend
This keeps the inactive backend out of the build and lets the upper layers call one runtime interface without paying for a large runtime abstraction.
4. Backend Architecture#
4.1 Vulkan Path#
The Vulkan runtime lives primarily in:
src/vulkan/instance.zigsrc/vulkan/buffer.zigsrc/vulkan/pipeline.zigsrc/vulkan/command.zigsrc/compute/forward.zigsrc/shaders/*.comp
Key properties of the Vulkan path:
- weights are uploaded into device-local VRAM
- compute kernels are handwritten GLSL 460 shaders compiled to SPIR-V
- decode work is recorded against explicit command buffers
- per-dispatch variability is carried through push constants and descriptor bindings
- AMD-specific tuning is built around wave64 and bandwidth-sensitive decode kernels
The Vulkan engine is where the most explicit static-graph and paged-KV machinery exists today.
4.2 Metal Path#
The Metal runtime lives primarily in:
src/metal/device.zigsrc/metal/buffer.zigsrc/metal/pipeline.zigsrc/metal/command.zigsrc/metal/shim.msrc/model/loader_metal.zigsrc/compute/forward_metal.zigsrc/shaders/metal/*.metal
Key properties of the Metal path:
- model files are wrapped into
MTLBuffers through zero-copymmap - the Objective-C boundary is isolated to one shim file
- MSL sources are compiled into compute pipelines at startup
- unified memory changes the buffer strategy and removes the Vulkan-style upload path
- Metal keeps the same high-level inference model while using backend-specific kernels and buffer policy
On Apple Silicon, logits are CPU-visible through UMA, which simplifies sampling and debug readback compared with Vulkan.
4.3 Shared Runtime Behavior#
Despite the backend split, both paths share the same user-facing model:
- same CLI entrypoint
- same HTTP routes
- same managed-model catalog and active-model selection
- same tokenizer and chat template logic
- same high-level concepts: prompt prefill, token-by-token decode, KV cache tracking, sampling, health reporting
5. Model Loading And Managed Models#
5.1 GGUF Loading#
ZINC reads GGUF directly instead of depending on an external runtime. The loader stack is responsible for:
- parsing tensor metadata and offsets
- inferring model architecture and normalized dimensions
- resolving tokenizer assets from GGUF metadata
- mapping tensor names to runtime operators
At a high level:
GGUF file
-> parse header + metadata + tensor table
-> derive ModelConfig
-> expose TensorInfo for runtime lookup
-> construct backend-specific weight storage
5.2 Vulkan Loader#
The Vulkan loader in src/model/loader.zig memory-maps the GGUF file and uploads tensors into device-local buffers. The runtime then reads weights from VRAM during decode.
This path is optimized for discrete GPUs where explicit upload cost is worth paying in exchange for device-local execution.
5.3 Metal Loader#
The Metal loader in src/model/loader_metal.zig maps the GGUF file and wraps those regions directly with newBufferWithBytesNoCopy. That fits Apple Silicon's unified-memory model better than mirroring the Vulkan upload flow.
5.4 Managed Model Catalog#
ZINC now ships a curated managed-model catalog in src/model/catalog.zig. Each entry includes:
- stable short id
- display name
- release date
- download URL and optional sha256 pin
- estimated VRAM requirement
- tested GPU profiles
- whether the model's thinking toggle is stable enough to expose in the UI
The managed-model flow is backed by src/model/managed.zig and is used by both CLI and server model management. It provides:
zinc model listzinc model pull <id>zinc model use <id>zinc model activezinc model rm <id>
The HTTP server exposes the same lifecycle through /v1/models, /v1/models/pull, /v1/models/activate, and /v1/models/remove.
5.5 One-Model-At-A-Time Policy#
The runtime model manager (src/server/model_manager.zig and src/server/model_manager_metal.zig) keeps one active model bundle loaded at a time:
- model
- tokenizer
- inference engine
- memory-usage accounting
Hot-swapping models is supported, but swaps are serialized with generation because the engine is not yet multi-tenant in the serving hot path.
6. Supported Model Families#
The current runtime is designed around the model families ZINC is actively validating:
- Qwen3 / Qwen3.5 / Qwen3.6
- Gemma 4
At the execution-model level, that means ZINC handles:
- dense transformer layers
- MoE feed-forward blocks
- SSM-hybrid paths used by Qwen3.5-style models and related experimental families
- model-specific MoE routing rules such as per-expert scale weighting
The architecture-normalization layer is in src/model/architecture.zig and src/model/config.zig.
7. Decode Planning And Graph IR#
ZINC does not treat decode as an opaque loop. It has an explicit graph/planning layer:
src/compute/graph.zigsrc/model/architecture.zig
The graph builder emits a logical decode graph with per-node metadata such as:
- operation type
- layer index
- execution domain
- workgroup counts
- estimated read/write/weight bytes
- approximate FLOPs
- host synchronization requirements
This graph is used for:
- understanding what a model family will execute
- exporting JSON and DOT reports
- bottleneck analysis and hotspot ranking
- validating whether a change actually alters the planned decode structure
The graph is not just a visualization artifact. It is the design-level representation of the decode pipeline.
7.1 Graph Families#
The graph builder currently emits different logical plans for:
- standard transformer decode
- MoE decode
- SSM-hybrid decode
- Gemma-specific dense and MoE variants
7.2 Exported Artifacts#
The CLI can export:
- JSON graph reports
- Graphviz DOT
The JSON report is rich enough for downstream tooling such as tools/render_graph_report.ts, which groups nodes into hotspots, op mix, bottleneck mix, and critical-path summaries.
8. Inference Runtime#
The runtime is centered around InferenceEngine:
src/compute/forward.zigfor Vulkansrc/compute/forward_metal.zigfor Metal
The engine owns:
- decode buffers and scratch space
- KV cache storage
- per-layer pipelines and dispatch helpers
- model-family-specific fast paths
- sampling and logits readback support
8.1 Decode State#
Per-request decode state tracks:
- current token position
- generated tokens
- stop/completion progress
- backend-specific runtime state needed to advance the sequence
8.2 High-Level Token Flow#
For a single generated token, the logical flow is:
token id
-> embedding lookup / dequant
-> per-layer input normalization
-> attention or SSM body
-> FFN or MoE body
-> residual accumulation
-> final normalization
-> LM head projection
-> logits readback or greedy argmax
-> next token
8.3 Attention Layers#
Attention layers perform the usual decode-time sequence:
- project Q, K, and V
- apply RoPE to the rotated dimensions
- write the new K/V vectors into cache storage
- run flash attention over the cached sequence
- apply output projection
- add residual
Grouped-query attention is supported in the flash-attention path.
8.4 SSM / Hybrid Layers#
The hybrid runtime includes SSM-oriented operators such as:
ssm_conv1dssm_delta_netssm_gated_norm
These are used in the Qwen3.5-style hybrid path and related architecture variants parsed by the loader.
8.5 FFN And MoE Execution#
For MoE layers, the runtime supports a split between routing and expert execution:
- gate projection produces router logits
- the runtime selects top experts and normalized weights
- gate/up projections run for selected experts
- SwiGLU or model-specific activation is applied
- down projections run
- outputs are accumulated back into the hidden state
Important detail: ZINC currently supports both a fast GPU-routed path and CPU-assisted fallbacks depending on backend and tensor format availability.
Examples:
- Vulkan can keep more of the MoE route fully on GPU when quant-specific kernels plus
softmax_topkand weighted accumulation are available - otherwise it falls back to router readback plus CPU
topKSoftmax - Metal has a batched MoE path and quant-specific expert kernels, but still falls back where coverage is incomplete or validation requires it
8.6 Sampling#
The runtime supports:
- greedy decoding
- temperature
- top-p
- top-k
- repetition penalty
On Vulkan, logits readback is explicitly controlled. On Metal, UMA makes logits CPU-visible by default.
9. GPU Kernel Library#
9.1 DMMV Is The Decode Workhorse#
The most important kernel family is decode matmul-vector:
Q4_KQ5_KQ6_KQ8_0F16F32
These kernels back:
- attention projections
- FFN projections
- LM head
- MoE expert projections
The kernel library also includes MoE-specialized variants and batch-oriented variants for paths where multiple expert outputs are accumulated together.
9.2 Fused Elementwise Operators#
The elementwise library exists to avoid turning decode into a long chain of tiny memory-bound kernels. Current fused or dedicated operators include:
- RMS norm
- SwiGLU
- RoPE
- deinterleave
- sigmoid-mul
- scale-accumulate
- sigmoid-scale-accumulate
- softmax-topk
- argmax
- KV cache write
- SSM conv / delta / gated norm
On Metal, additional batched operators exist for MoE accumulation and SwiGLU. On Vulkan, the equivalent strategy is expressed through GLSL compute shaders and explicit command-buffer recording.
9.3 Flash Attention#
Flash attention is implemented as a dedicated kernel family rather than a naïve attention loop. The runtime handles:
- single-token decode attention
- paged or contiguous KV traversal depending on backend path
- grouped-query attention
- online softmax-style accumulation
9.4 Execution Strategy Differences By Backend#
Vulkan#
The Vulkan runtime tries to keep a token step inside as few submissions as possible. Important properties:
- explicit command-buffer recording
- push constants for per-dispatch parameters
- stable decode structure from a preplanned graph
- host sync only at unavoidable boundaries such as CPU-assisted MoE routing or sampling
Metal#
The Metal runtime uses:
- runtime MSL compilation
- unified-memory buffers
- batched MoE fast paths for supported quant formats
- backend-specific pipeline capability queries for threadgroup tuning
The abstraction is similar, but the memory model and optimization priorities differ sharply from Vulkan.
10. KV Cache And Memory Planning#
10.1 Vulkan KV Cache#
The Vulkan runtime uses a paged KV layout with 16-token pages. The core pieces are:
- per-layer K/V storage
- a page table buffer used by flash attention
- a page-pool allocator
This is the most explicit realization of ZINC's paged-KV design today.
10.2 Metal KV Cache#
Metal keeps the same conceptual contract but currently favors a contiguous fast path. The flash-attention interface still understands page-table inputs, but the current engine can select a contiguous traversal mode for better practicality on Apple Silicon.
Metal also includes a q8_0 KV-cache option for model families where the quality/performance tradeoff is acceptable.
10.3 Shared Memory Planner#
src/gpu/memory_plan.zig is the shared source of truth for runtime memory budgeting. It splits memory into:
- fixed runtime bytes
- bytes per context token
- device-local / private bytes
- host-visible bytes
That planner feeds:
- diagnostics
- managed-model fit estimation
- active-model load policy
/health/v1/models
This matters because the numbers shown by the CLI and server are intended to come from the same accounting model, not a collection of ad hoc estimates.
10.4 Current Context Policy#
Even though some GGUFs advertise much larger theoretical contexts, the current public runtime is still effectively planned around a 4096 token operating cap. That affects:
- memory reservation
- health reporting
- managed-model fit estimates
- practical serving behavior
11. Serving Layer#
The HTTP server is implemented in:
src/server/http.zigsrc/server/routes.zigsrc/server/runtime.zigsrc/server/model_manager.zigsrc/server/model_manager_metal.zigsrc/server/session.zig
11.1 Current Endpoints#
| Method | Path | Status |
|---|---|---|
GET |
/ and /chat |
Built-in chat UI |
GET |
/health |
Implemented |
GET |
/v1/models |
Implemented |
POST |
/v1/models/pull |
Implemented |
POST |
/v1/models/activate |
Implemented |
POST |
/v1/models/remove |
Implemented |
POST |
/v1/chat/completions |
Implemented |
POST |
/v1/completions |
Implemented |
ZINC does not currently expose /v1/embeddings, so older drafts of this spec that listed it were ahead of the implementation.
11.2 Generation Concurrency#
The server accepts overlapping requests and reports queue depth, but generation is still serialized behind ServerState.generation_mutex.
That means:
- multiple clients can connect concurrently
- queued work is tracked and reported
/healthremains useful during load- decode itself is still one-active-generation-at-a-time
This is the most important serving limitation to understand today.
11.3 Chat Template And Thinking Control#
The chat route is not a thin transport wrapper. It also owns:
- system-prompt insertion when needed
- model-specific chat templating
- thinking-toggle handling
- normalization and sanitization of assistant output
- streaming stop detection
Managed catalog entries additionally carry a thinking_stable bit so the UI and API can avoid exposing a toggle that produces poor results on a given model.
11.4 Session Reuse Cache#
The chat server now includes a small session reuse cache keyed by session id and model path. Its goal is to avoid redoing the full transcript prefill when a new request extends an already-known conversation prefix.
This is not continuous batching, but it does give the server a concrete form of incremental prompt reuse.
11.5 Health And Model Introspection#
/health and /v1/models expose real runtime counters such as:
- active requests
- queued requests
- active context tokens
- memory used vs budget
- weights bytes
- runtime bytes
- reserved context bytes
- active model information
- managed-model download progress
That makes the HTTP surface useful for both user UX and performance/debug workflows.
12. Scheduler Status#
The repository contains:
src/scheduler/scheduler.zigsrc/scheduler/kv_cache.zigsrc/scheduler/request.zig
These modules describe the intended shape of:
- request-slot management
- prefilling vs decoding phases
- KV-page allocation
But today they should be understood as groundwork, not as the main runtime scheduling mechanism used by the production server path.
In other words:
- ZINC already has serving
- ZINC does not yet have full production continuous batching
- the scheduler code exists to support that direction
13. Deliberate Non-Claims And Near-Term Work#
The repository now has more real implementation detail than it did when this page was first written, but some areas are still explicitly incomplete:
- no
/v1/embeddingsendpoint yet - no production continuous batching in the serving hot path yet
- no multi-GPU execution path
- no fully integrated TurboQuant runtime yet, despite CLI parsing and graph/placeholders for it
- Metal prefill is functional but still not the same thing as a separately optimized large-batch prefill kernel family
Those are active engineering directions, not hidden features.
14. Related Docs#
- Running ZINC for CLI, server mode, and managed-model usage
- API Reference for HTTP request/response details
- Development Guide for build, test, graph export, and profiling workflow
- Apple Silicon Metal Enablement for the full Metal port narrative
- TurboQuant Spec for the forward-looking KV compression design
- RDNA4 Tuning Guide for AMD-specific performance work
- Intel GPU Reference for Arc B-series hardware, memory-bandwidth, and opcode notes