Last updated: 2026-06-12
Scheduler
Request
Request lifecycle management for concurrent inference serving.
Each incoming API request maps to a Request that tracks its state through prefill, decode, and completion phases.
3 exports shown
enum
RequestState
pub const RequestState = enum Request processing state machine.
Valid transitions: pending → prefilling → decoding → completed, with cancelled reachable from any active state and failed reachable from prefilling or decoding.
struct
GenerationParams
pub const GenerationParams = struct Generation parameters from the API request.
struct
Request
pub const Request = struct A single inference request with its lifecycle state.
Methods
5method
Request.init
pub fn init(allocator: std.mem.Allocator, id: u64, prompt_tokens: []const u32, params: GenerationParams) Request Create a new request in the pending state with the given prompt and parameters.
method
Request.transition
pub fn transition(self: *Request, new_state: RequestState) !void Advance the request through the state machine.
method
Request.appendToken
pub fn appendToken(self: *Request, token: u32) !void Append a generated token and record the first-token timestamp if unset.
method
Request.shouldStop
pub fn shouldStop(self: *const Request, eos_token_id: u32) bool Check if generation should stop (max_tokens reached or EOS token emitted).
method
Request.deinit
pub fn deinit(self: *Request) void Release the generated token buffer owned by this request.