Last updated: 2026-06-12

Inference Runtime

Fast Pool

All API Sections

Low-overhead worker pool for the T-CPU decode matvec fan-out.

Replaces std.Thread.Pool's spawnWg/waitAndWork pattern for the matvec hot path. Each dispatch posts up to N typed tasks to per-worker atomic slots and spins on a per-slot done-sequence — no heap allocation, no mutex, no condvar. Workers are persistent and spin (with brief yields) between dispatches; total CPU footprint is bounded by `n_workers` cores.

Why bother: the Qwen3.6-MoE+SSM decode dispatches ~200 matvec/fan-out barriers per token through std.Thread.Pool. Each spawnWg does a heap allocation on the global smp_allocator (closure container), takes the pool mutex twice, and signals a condvar; each waitAndWork takes the mutex and waits on a ResetEvent futex. The atomic-only path here is orders of magnitude cheaper per barrier (~tens of ns vs ~µs).

API: build a small `Task` array, call `dispatchAndRun(&tasks)`. The caller runs `tasks[0]` on the main thread; `tasks[1..n]` are posted to workers 0..n-1. Returns when all tasks have completed.

3 exports 4 methods src/zinc_rt/fast_pool.zig

3 exports shown

constant

max_workers

#
pub const max_workers: usize = 16

Upper bound on worker threads supported by a single `FastPool`.

The slot table is sized to this constant so dispatches stay branch-free and cache-friendly; this matches the decode matvec scheduler's current maximum so high-core hosts do not silently fall back to std.Thread.Pool.

src/zinc_rt/fast_pool.zig:27

struct

Task

#
pub const Task = struct

Single unit of work posted into the pool.

`fn_` is invoked with `ctx` on either the calling thread (task 0) or a worker thread (tasks 1..). The caller owns the storage `ctx` points at and must keep it alive until `dispatchAndRun` returns.

src/zinc_rt/fast_pool.zig:47

struct

FastPool

#
pub const FastPool = struct

Persistent worker pool that fans matvec barriers out across N threads.

Slots are cache-line aligned and communicated via release/acquire atomics; see the module doc for the rationale and benchmark numbers.

src/zinc_rt/fast_pool.zig:55

Methods

4

method

FastPool.init

#
pub fn init(self: *Self, allocator: std.mem.Allocator, n_workers: usize) !void

Spawn `n_workers` persistent worker threads bound to this pool.

Initializes the slot table, then launches each worker on `workerMain`. the stack. or a spawn error from `std.Thread.spawn`.

Parameters
self
Pool storage; written in place so callers can keep it on
allocator
Used for the `threads` array only.
n_workers
Worker thread count; must be in `1..=max_workers`.
Returns

`error.InvalidWorkerCount` when `n_workers` is out of range,

src/zinc_rt/fast_pool.zig:72

method

FastPool.deinit

#
pub fn deinit(self: *Self) void

Signal shutdown, wake every worker, join all threads, and free state.

Bumps each slot's `seq` after raising the shutdown flag so workers observing a spin-loop step out and exit promptly.

src/zinc_rt/fast_pool.zig:106

method

FastPool.dispatchAndRun

#
pub fn dispatchAndRun(self: *Self, tasks: []const Task) void

Post `tasks[1..]` to worker slots and run `tasks[0]` on the calling thread.

Spins on each worker's done-sequence until all tasks have completed, then returns. `tasks[1..n]` are posted to workers 0..n-1 via atomic slot writes. assert, not a returned error. Passing an empty slice is a no-op.

Parameters
tasks
Slice of tasks to execute. `tasks[0]` runs on the caller;
Notes

`tasks.len` must be `<= n_workers + 1`; the assertion is a hard

src/zinc_rt/fast_pool.zig:125

method

FastPool.executorCount

#
pub fn executorCount(self: *const Self) usize

Return the total number of execution contexts available: worker threads plus the calling thread.

Use this value to size task arrays passed to `dispatchAndRun`.

Returns

`n_workers + 1`.

src/zinc_rt/fast_pool.zig:174