Last updated: 2026-06-12
Scheduler
KV Cache
Paged KV cache manager for concurrent request serving.
Manages a pool of fixed-size pages that are allocated per-request and freed on completion or cancellation. Each page maps to a contiguous region of the GPU KV cache buffer, giving each request non-overlapping token storage.
2 exports shown
struct
KvPage
pub const KvPage = struct A single page in the KV cache pool.
Each page maps to a contiguous region of the GPU KV buffer.
struct
KvPagePool
pub const KvPagePool = struct Pool-based allocator for KV cache pages.
Tracks which pages are free and which are owned by active requests.
Methods
7method
KvPagePool.init
pub fn init(allocator: std.mem.Allocator, total_pages: u32, page_size: u32) !KvPagePool Initialize a page pool with the given number of pages and tokens per page.
method
KvPagePool.allocPages
pub fn allocPages(self: *KvPagePool, request_id: u64, count: u32) ![]u32 Allocate `count` pages for a request and stamp them with `request_id`.
method
KvPagePool.freePages
pub fn freePages(self: *KvPagePool, request_id: u64) void Free all pages owned by a request, returning them to the free list.
Performs a linear scan over all pages; O(total_pages).
method
KvPagePool.positionBase
pub fn positionBase(self: *const KvPagePool, page_ids: []const u32) u32 Return the token position base for a request's first allocated page.
Computed as `page_ids[0] * page_size`, which guarantees non-overlapping token storage across requests since each page_id maps to a disjoint range.
method
KvPagePool.maxContext
pub fn maxContext(self: *const KvPagePool, page_count: u32) u32 Maximum context length (in tokens) that fits in `page_count` allocated pages.
method
KvPagePool.freeCount
pub fn freeCount(self: *const KvPagePool) u32 Number of free pages currently available for allocation.
method
KvPagePool.deinit
pub fn deinit(self: *KvPagePool) void Release the page array and free list back to the allocator.