Last updated: 2026-06-12

Inference Runtime

Flash Attn

All API Sections

T-CPU flash attention (single-query decode) implementation.

Computes scaled dot-product attention over a KV cache for one query token.

2 exports shown

struct

Params

#
pub const Params = struct

Inputs and outputs for one single-query flash attention call.

Parameters

q
Query vector packed `[n_heads, head_dim]` for the current decode token.
kv_k
Key cache packed `[seq_len, n_kv_heads, head_dim]`.
kv_v
Value cache packed `[seq_len, n_kv_heads, head_dim]`.
output
Attention output packed `[n_heads, head_dim]`.
n_heads
Number of query heads.
n_kv_heads
Number of key/value heads (GQA: `n_heads / n_kv_heads` queries share each KV head).
head_dim
Per-head feature dimension.
seq_len
Number of cached key/value positions to attend over.
attn_sinks
Per-head sink logits added to the softmax denominator; NaN disables a head's sink.
scratch_scores
Caller-owned scratch of length `>= seq_len` for raw scores.
scratch_probs
Caller-owned scratch of length `>= seq_len` for exponentiated weights.

src/zinc_rt/isa/cpu_zig/flash_attn.zig:18

function

run

#
pub fn run(params: Params) !void

Compute single-query scaled dot-product attention with optional per-head softmax sinks.

For each query head: dots `q` against the cached keys, applies a `1/sqrt(head_dim)` scale, max-subtracted softmax (folding in the sink if finite), then writes the value-weighted sum to the matching slot of `output`. GQA is supported via `q_per_kv = n_heads / n_kv_heads`. slots are smaller than `seq_len` or `head_dim` is zero, otherwise void.

Parameters

params
Query, KV cache, attention sinks, scratch buffers, and output slice; see `Params`.

Returns

`error.EmptyInput` when query or output is empty, `error.ShapeMismatch` when scratch

src/zinc_rt/isa/cpu_zig/flash_attn.zig:39