Last updated: 2026-06-12
Scheduler
Scheduler
Continuous-batching scheduler groundwork for concurrent inference requests.
Today this module owns request slot accounting only. The HTTP serving hot path still serializes generation behind ServerState.generation_mutex; the batched prefill/decode dispatch loop is not wired yet.
1 exports shown
struct
Scheduler
pub const Scheduler = struct Fixed-capacity pool of request slots used to track concurrent inference requests.
Each slot holds at most one active `Request`; slots are reused once released.
Methods
8method
Scheduler.init
pub fn init(allocator: std.mem.Allocator, max_parallel: u32) !Scheduler Initialize the scheduler with a fixed number of concurrent request slots.
method
Scheduler.submit
pub fn submit(self: *Scheduler, prompt_tokens: []const u32, params: GenerationParams) !u32 Submit a new request and assign it to the first free slot.
method
Scheduler.isFull
pub fn isFull(self: *const Scheduler) bool Check if all slots are occupied.
method
Scheduler.activeCount
pub fn activeCount(self: *const Scheduler) u32 Get the number of active (non-null) requests.
method
Scheduler.pendingPrefill
pub fn pendingPrefill(self: *Scheduler) []u32 Return slot IDs of requests in the prefilling state.
method
Scheduler.activeDecoding
pub fn activeDecoding(self: *Scheduler) []u32 Return slot IDs of requests in the decoding state.
method
Scheduler.release
pub fn release(self: *Scheduler, slot_id: u32) void Release a completed or cancelled request's slot, freeing its resources.
method
Scheduler.deinit
pub fn deinit(self: *Scheduler) void Tear down all active requests and free the slot array.