Last updated: 2026-06-12
Inference Runtime
Fast Pool
Low-overhead worker pool for the T-CPU decode matvec fan-out.
Replaces std.Thread.Pool's spawnWg/waitAndWork pattern for the matvec hot path. Each dispatch posts up to N typed tasks to per-worker atomic slots and spins on a per-slot done-sequence — no heap allocation, no mutex, no condvar. Workers are persistent and spin (with brief yields) between dispatches; total CPU footprint is bounded by `n_workers` cores.
Why bother: the Qwen3.6-MoE+SSM decode dispatches ~200 matvec/fan-out barriers per token through std.Thread.Pool. Each spawnWg does a heap allocation on the global smp_allocator (closure container), takes the pool mutex twice, and signals a condvar; each waitAndWork takes the mutex and waits on a ResetEvent futex. The atomic-only path here is orders of magnitude cheaper per barrier (~tens of ns vs ~µs).
API: build a small `Task` array, call `dispatchAndRun(&tasks)`. The caller runs `tasks[0]` on the main thread; `tasks[1..n]` are posted to workers 0..n-1. Returns when all tasks have completed.
3 exports shown
constant
max_workers
pub const max_workers: usize = 16 Upper bound on worker threads supported by a single `FastPool`.
The slot table is sized to this constant so dispatches stay branch-free and cache-friendly; this matches the decode matvec scheduler's current maximum so high-core hosts do not silently fall back to std.Thread.Pool.
struct
Task
pub const Task = struct Single unit of work posted into the pool.
`fn_` is invoked with `ctx` on either the calling thread (task 0) or a worker thread (tasks 1..). The caller owns the storage `ctx` points at and must keep it alive until `dispatchAndRun` returns.
struct
FastPool
pub const FastPool = struct Persistent worker pool that fans matvec barriers out across N threads.
Slots are cache-line aligned and communicated via release/acquire atomics; see the module doc for the rationale and benchmark numbers.
Methods
4method
FastPool.init
pub fn init(self: *Self, allocator: std.mem.Allocator, n_workers: usize) !void Spawn `n_workers` persistent worker threads bound to this pool.
Initializes the slot table, then launches each worker on `workerMain`. the stack. or a spawn error from `std.Thread.spawn`.
method
FastPool.deinit
pub fn deinit(self: *Self) void Signal shutdown, wake every worker, join all threads, and free state.
Bumps each slot's `seq` after raising the shutdown flag so workers observing a spin-loop step out and exit promptly.
method
FastPool.dispatchAndRun
pub fn dispatchAndRun(self: *Self, tasks: []const Task) void Post `tasks[1..]` to worker slots and run `tasks[0]` on the calling thread.
Spins on each worker's done-sequence until all tasks have completed, then returns. `tasks[1..n]` are posted to workers 0..n-1 via atomic slot writes. assert, not a returned error. Passing an empty slice is a no-op.
method
FastPool.executorCount
pub fn executorCount(self: *const Self) usize Return the total number of execution contexts available: worker threads plus the calling thread.
Use this value to size task arrays passed to `dispatchAndRun`.