Last updated: 2026-05-18
Apple Metal Reference#
Last updated: 2026-03-31
This document is the backend-facing Metal reference for ZINC on Apple Silicon. It focuses on the supported public API surface we should optimize against in shipping code, the runtime queries that matter, and the practical kernel and submission rules that follow from Apple GPU behavior.
Related docs:
docs/APPLE_SILICON_REFERENCE.mdcovers chip generations, MLX/TensorOps context, public capability families, and opcode-surface boundaries.docs/APPLE_SILICON_METAL_ENABLEMENT.mdcovers the implementation history and file-by-file bring-up details of the current Metal backend.
Scope#
This document is about the public Metal contract that matters for ZINC:
- the runtime objects we build around
- the capability and tuning queries we should trust
- the memory and submission model on Apple Silicon
- the kernel-level features worth using for inference
- the profiling and validation loop that keeps optimization work safe
It is not a guide to AGX machine code, ANE internals, or MLX APIs.
Mental Model#
The simplest durable model is:
- Metal is the low-level GPU runtime and shader interface.
- MSL is the shader language we write kernels in.
- MPS / TensorOps / Metal 4 tensor APIs are higher-level public acceleration surfaces that may matter on newer families, especially Apple10 / M5.
- MLX is a higher-level array/runtime layer that can target Metal, but it is not the backend contract ZINC is built on.
For ZINC, shipping backend work should target:
MTLDeviceMTLCommandQueueMTLCommandBufferMTLComputeCommandEncoderMTLBufferMTLComputePipelineState- Metal capability tables and pipeline/device properties
Runtime Objects That Matter#
These are the runtime objects worth reasoning about directly in backend code:
| Metal object | Why ZINC cares |
|---|---|
MTLDevice |
device identity, family support, unified-memory behavior, threadgroup limits, working-set hints |
MTLCommandQueue |
submission path for decode and prefill work |
MTLCommandBuffer |
batching boundary; too many of these can materially hurt throughput |
MTLComputeCommandEncoder |
where dependent compute dispatches are recorded and ordered |
MTLBuffer |
all important model/runtime data paths are buffer-centric |
MTLComputePipelineState |
thread execution width, max threads, compiled kernel validity, specialization surface |
In this repo, these concepts are wrapped through:
src/metal/shim.hsrc/metal/shim.msrc/metal/device.zigsrc/metal/buffer.zigsrc/metal/pipeline.zigsrc/metal/command.zig
Runtime Queries We Should Trust#
These are the minimum runtime signals we should use for backend decisions:
MTLDevice.supportsFamily(.apple7/.apple8/.apple9/.apple10)MTLDevice.supportsFamily(.mac2)MTLDevice.hasUnifiedMemoryMTLDevice.maxThreadgroupMemoryLengthMTLDevice.recommendedMaxWorkingSetSizeMTLDevice.supportsRaytracingMTLComputePipelineState.threadExecutionWidthMTLComputePipelineState.maxTotalThreadsPerThreadgroupMTLComputePipelineState.staticThreadgroupMemoryLength
Practical interpretation:
- Apple9 means M3/M4-class behavior for public GPU-family gating.
- Apple10 means M5-class behavior and justifies TensorOps / cooperative-tensor investigation.
- Unified memory means we should not cargo-cult discrete-GPU staging patterns.
- Pipeline properties are more trustworthy than chip-name guesses for threadgroup sizing.
Family-Level Guidance#
Apple7 / Apple8#
- Treat these as earlier Apple Silicon Metal families with the same broad compute model.
- Prefer straightforward compute kernels, shared buffers, and conservative threadgroup sizing.
- Do not assume newer tensor-specific public acceleration paths exist.
Apple9#
- Treat this as the main M3/M4-class inference target.
- Favor tuned Metal compute kernels, 32-lane simdgroup reductions, and careful
threadgroupstaging. - Do not try to distinguish M3 from M4 by public Metal family alone.
Apple10#
- Treat this as the first family where TensorOps / Metal 4 tensor resources deserve dedicated attention.
- Large GEMMs, batched expert matmuls, and prefill-heavy paths are the first candidates for a second fast path.
- Decode remains heavily bandwidth-sensitive; TensorOps are not a blanket replacement for every kernel.
Memory Model on Apple Silicon#
The Apple Silicon Metal path should assume:
- unified CPU/GPU memory
- cheap CPU visibility of shared buffers
- buffer-centric compute rather than texture-centric tricks
That leads to different implementation defaults than Vulkan on Linux:
- prefer shared buffers for runtime state and readback-visible data
- avoid inventing separate staging buffers unless they are measurably needed
- use
newBufferWithBytesNoCopystyle wrapping where it materially reduces unnecessary copies - treat
recommendedMaxWorkingSetSizeas a budget hint for model and scratch residency
Practical consequences for ZINC:
- logits and router buffers can often be read directly without a dedicated readback staging path
- GGUF loading should lean into wrapped or shared memory rather than discrete upload choreography
- KV cache and expert scratch planning should be working-set-aware, not copied from a discrete-GPU design
Submission Model#
For inference throughput, command-submission shape matters a lot on Metal.
The rule of thumb is:
- batch dependent work into as few command buffers as correctness allows
- keep related dependent dispatches inside one compute encoder when possible
- use in-encoder barriers when buffer visibility is the real dependency
- avoid per-expert or per-small-op fragmentation if a fused or batched path is possible
The failure mode to watch for is not just slow kernels. It is:
- too many command buffers
- too many commits
- too many synchronization points between otherwise small compute steps
For ZINC, this matters especially for:
- MoE routing and expert execution
- decode-step pipelines with many small dependent kernels
- server mode, where request scheduling can amplify submission overhead
Kernel Design Guidance#
The public MSL features most worth using for inference are:
threadgroupmemory for staging vectors, tiles, and accumulatorsthreadgroup_barrier(...)for ordered shared-memory phases- simdgroup collectives such as
simd_sum,simd_max, shuffle, and broadcast operations [[thread_position_in_grid]][[thread_position_in_threadgroup]][[thread_index_in_simdgroup]][[simdgroup_index_in_threadgroup]]halfis the active precision path;bfloatis supported by MSL on Apple-family GPUs but ZINC's Metal shaders do not currently emit bf16 kernelssimdgroup_matrix/ TensorOps-style paths on the families where the public stack supports them
Useful kernel heuristics:
- start from workgroup sizes that are multiples of
threadExecutionWidth - on Apple9/Apple10, one-simdgroup-per-row is a strong first try for decode-style DMMV
- use
maxTotalThreadsPerThreadgroupas a hard limit, not a target - account for
staticThreadgroupMemoryLengthbefore adding more dynamic threadgroup storage - if threadgroup memory is tight, favor cache-resident or simdgroup-centric designs over large shared-memory tiles
Features To De-Prioritize#
These are public features, but they are not where inference work should go first:
- hardware ray tracing
- mesh shading
- render-pipeline-specific tuning
- texture-centric strategies unless a kernel is actually texture-backed
They can be useful as generation clues, but not as first-order LLM acceleration paths.
ZINC-Specific Guidance#
For this codebase, the Metal backend should keep the following priorities:
- keep backend selection at comptime
- keep Objective-C isolated to the shim layer
- exploit unified memory instead of mimicking Vulkan staging
- tune submission count as aggressively as kernel math
- validate fast paths against CPU-reference outputs
- only branch on Apple10-specific tensor paths when they actually win
The code areas most relevant to this reference are:
src/metal/shim.hsrc/metal/shim.msrc/metal/device.zigsrc/metal/buffer.zigsrc/metal/pipeline.zigsrc/metal/command.zigsrc/model/loader_metal.zigsrc/compute/forward_metal.zigsrc/shaders/metal/*.metalsrc/diagnostics_metal.zig
Profiling and Validation#
Metal optimization work is only useful if it stays measurable and correct.
Minimum loop:
- inspect capability logs and pipeline properties
- profile command-buffer count, commits, and kernel timings
- compare output against CPU-reference or known-good paths
- keep changes only if they improve the intended metric without breaking correctness
Useful repo touchpoints:
src/diagnostics_metal.zigbenchmarks/metal_inference.zigtools/benchmark_api.mjs- the Metal-oriented tests run under
zig build test
Note:
- negative pipeline tests may print Metal compiler errors to stderr during
zig build testeven when the suite passes
Boundaries#
Always:
- optimize against public Metal/MSL behavior first
- rely on measured pipeline/device properties rather than chip marketing names
- keep CPU-reference kernel tests close to performance-sensitive changes
Ask first:
- backend model changes that break parity with shared higher-level runtime paths
- new Apple10-only tensor paths that materially complicate maintenance or portability
Do not:
- treat AGX shader ISA as a stable product contract
- assume MLX implies a specific low-level execution path
- assume M3 and M4 are distinct public Metal families