Last updated: 2026-06-12
Inference Runtime
Flash Attn
T-CPU flash attention (single-query decode) implementation.
Computes scaled dot-product attention over a KV cache for one query token.
2 exports shown
struct
Params
pub const Params = struct Inputs and outputs for one single-query flash attention call.
function
run
pub fn run(params: Params) !void Compute single-query scaled dot-product attention with optional per-head softmax sinks.
For each query head: dots `q` against the cached keys, applies a `1/sqrt(head_dim)` scale, max-subtracted softmax (folding in the sink if finite), then writes the value-weighted sum to the matching slot of `output`. GQA is supported via `q_per_kv = n_heads / n_kv_heads`. slots are smaller than `seq_len` or `head_dim` is zero, otherwise void.