Last updated: 2026-05-28
ZINC Serving API Reference#
ZINC exposes a local OpenAI-compatible HTTP API. Point OpenAI-style clients at the server by changing the base URL.
# Start the server. Plain server mode defaults to port 8080.
./zig-out/bin/zinc --model-id qwen35-9b-q4k-m -p 8080
export OPENAI_BASE_URL=http://localhost:8080/v1
zinc chat is a convenience command for the built-in chat UI and defaults to
port 9090 unless -p is provided.
Status & known gaps (read first)#
Before you integrate, know these limits up front. Each is a deliberate state of the engine today, not a bug to file:
- Decode is serialized. Concurrent HTTP requests are accepted, but generation runs behind a single engine lock — request N+1 queues until request N finishes. Throughput scales with one stream, not N. Continuous batching is on the roadmap; until then, do not architect around parallel streams.
/v1/embeddingsis not implemented. ZINC is a chat/completion engine today; there is no embedding endpoint and no plan to fake one. If you need embeddings for RAG, run them in a separate process (llama.cpp'sllama-embedding, a hosted API, etc.) and only point chat at ZINC./v1/completionsis non-streaming and ignores sampling controls. Use/v1/chat/completionsfor anything that needs streaming,temperature,top_p, orenable_thinking.- Client-supplied
stopsequences are not honored. Chat generation stops on the model/template end markers (<|im_end|>, etc.) only. - No
/v1/audio/*, no image inputs, no fine-tuning endpoints, no batch API. These are out of scope; if you need them, ZINC is not the right engine.
Everything else listed in the endpoint table below works as documented.
Compatibility Notes#
The API intentionally follows the OpenAI response shapes where practical, but ZINC is still a local single-process engine:
/v1/chat/completionssupports streaming SSE, non-streaming responses,temperature,top_p,enable_thinking, optionalsession_idprompt reuse, and OpenAI-style tool calling for ChatML/Qwen-family templates./v1/completionsis currently a raw, non-streaming text-completion path. It accepts common compatibility fields, but sampling controls andstreamare not applied on this endpoint yet.- Request-provided
stopsequences are not implemented yet. Chat generation still stops on model/template end markers such as<|im_end|>. - Generation runs under one engine lock. Concurrent HTTP requests are accepted, but decode is serialized.
- The
modelrequest field is honored for managed models: if it names an installed catalog model different from the active model, ZINC attempts to activate it before generation.
Endpoints#
| Method | Path | Description |
|---|---|---|
| POST | /v1/chat/completions |
Chat inference, streaming and non-streaming |
| POST | /v1/completions |
Raw text completion, non-streaming |
| GET | /v1/models |
Managed catalog, install state, active model, memory status |
| POST | /v1/models/pull |
Download an installed-compatible managed model asynchronously |
| POST | /v1/models/activate |
Activate an installed managed model |
| POST | /v1/models/remove |
Remove a cached managed model, optionally unloading it first |
| GET | /health |
Health, queue counters, uptime, GPU memory/context status |
| GET | / or /chat |
Built-in chat UI |
POST /v1/chat/completions#
Generate a chat completion from a message list.
Request#
{
"model": "qwen35-9b-q4k-m",
"session_id": "chat-123",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
],
"max_tokens": 256,
"temperature": 0.7,
"top_p": 0.9,
"enable_thinking": false,
"stream": true
}
| Field | Type | Default | Description |
|---|---|---|---|
model |
string | current model | Managed model id to use. If installed and different from the active model, the server activates it before generation. |
session_id |
string | empty | Optional prompt-reuse key. Reusing a session id lets ZINC reuse the matching prefix for append-only chat histories. |
messages |
array | required | OpenAI-style messages. content may be a string, null, or an array of text content blocks. Non-text blocks are ignored. |
max_tokens |
integer | 256 | Maximum generated tokens before context-budget clamping. |
temperature |
float | 0.0 | Sampling temperature. 0 uses greedy decoding. Values are clamped to 0..2. |
top_p |
float | 1.0 | Nucleus sampling threshold, clamped to 0..1. |
enable_thinking |
boolean | model default | For templates that support it, request an open thinking block. Catalog entries with unstable thinking can force this off. |
stream |
boolean | false | Enable Server-Sent Events streaming. |
tools |
array | empty | OpenAI function-tool definitions. Rendered for ChatML/Qwen-style templates when tool calling is enabled. |
tool_choice |
string or object | auto |
"none" suppresses tool injection. "auto" and forced function-object choices are currently treated as automatic tool choice. |
Supported message roles:
| Role | Description |
|---|---|
system |
System prompt. |
user |
User message. |
assistant |
Previous assistant response. May include historical tool_calls. |
tool |
Tool result message with tool_call_id; rendered back into ChatML tool-response blocks for Qwen-style templates. |
Tool Calling#
Tool calling is enabled by default and can be disabled with:
ZINC_TOOL_CALLING=0 ./zig-out/bin/zinc --model-id qwen35-9b-q4k-m
ZINC never executes tools. It only renders tool definitions into the prompt,
parses model-emitted <tool_call>...</tool_call> blocks, returns structured
tool_calls, and accepts subsequent role: "tool" result messages from the
client.
Example request:
{
"model": "qwen35-9b-q4k-m",
"messages": [{"role": "user", "content": "What time is it in UTC?"}],
"tools": [
{
"type": "function",
"function": {
"name": "get_time",
"description": "Return the current time for a timezone.",
"parameters": {
"type": "object",
"properties": {
"timezone": {"type": "string"}
},
"required": ["timezone"]
}
}
}
],
"tool_choice": "auto"
}
Non-streaming tool-call response shape:
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1711500000,
"model": "Qwen 3.5 9B Q4_K_M",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": null,
"tool_calls": [
{
"id": "call_0",
"type": "function",
"function": {
"name": "get_time",
"arguments": "{\"timezone\":\"UTC\"}"
}
}
]
},
"finish_reason": "tool_calls"
}
],
"usage": {
"prompt_tokens": 128,
"completion_tokens": 12,
"total_tokens": 140
}
}
Response#
Non-streaming responses use the OpenAI chat-completion shape:
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1711500000,
"model": "Qwen 3.5 9B Q4_K_M",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "The capital of France is Paris."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 24,
"completion_tokens": 8,
"total_tokens": 32
}
}
When stream: true, the server responds with Content-Type: text/event-stream. The first chunk includes the assistant role, content chunks
follow, and the final chunk carries finish_reason.
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1711500000,"model":"Qwen 3.5 9B Q4_K_M","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1711500000,"model":"Qwen 3.5 9B Q4_K_M","choices":[{"index":0,"delta":{"content":"Paris"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1711500000,"model":"Qwen 3.5 9B Q4_K_M","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}
data: [DONE]
Streaming tool calls are emitted atomically as delta.tool_calls chunks, with
finish_reason: "tool_calls" in the final chunk when a tool call was produced.
curl#
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen35-9b-q4k-m",
"messages": [{"role": "user", "content": "What is 2+2?"}],
"max_tokens": 64
}'
curl -N http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen35-9b-q4k-m",
"messages": [{"role": "user", "content": "Explain gravity briefly."}],
"max_tokens": 128,
"stream": true
}'
POST /v1/completions#
Generate from a raw text prompt without applying a chat template.
Request#
{
"model": "qwen35-9b-q4k-m",
"prompt": "The capital of France is",
"max_tokens": 64
}
| Field | Type | Default | Description |
|---|---|---|---|
model |
string | current model | Managed model id to use; may activate an installed model before generation. |
prompt |
string | required | Raw text prompt. |
max_tokens |
integer | 256 | Maximum generated tokens before context-budget clamping. |
temperature |
float | accepted | Parsed for compatibility but not applied by this endpoint yet. |
stream |
boolean | accepted | Parsed for compatibility but ignored; responses are non-streaming JSON. |
Response#
{
"id": "cmpl-abc123",
"object": "text_completion",
"created": 1711500000,
"model": "Qwen 3.5 9B Q4_K_M",
"choices": [
{
"index": 0,
"text": " Paris.",
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 6,
"completion_tokens": 2,
"total_tokens": 8
}
}
GET /v1/models#
List the managed catalog for the server's current GPU profile, including install state, active model, fit decisions, download progress, and active GPU memory usage. ZINC loads one model into memory at a time.
Response#
{
"object": "list",
"profile": "amd-rdna4-32gb",
"active_memory_used_bytes": 22300000000,
"active_memory_budget_bytes": 34359738368,
"active_memory_weights_bytes": 21000000000,
"active_memory_runtime_bytes": 1300000000,
"active_context_reserved_bytes": 536870912,
"active_context_active_bytes": 131072,
"active_context_tokens": 128,
"active_context_capacity_tokens": 4096,
"data": [
{
"id": "qwen36-35b-a3b-q4k-xl",
"object": "model",
"created": 1711500000,
"owned_by": "zinc",
"context_length": 4096,
"display_name": "Qwen3.6 35B-A3B UD Q4_K_XL",
"release_date": "2026-04-15",
"homepage_url": "https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF",
"family": "Qwen 3.6",
"quantization": "Q4_K_XL",
"size_bytes": 21000000000,
"installed": true,
"active": true,
"managed": true,
"supported_on_current_gpu": true,
"fits_current_gpu": true,
"required_vram_bytes": 23106019926,
"required_vram_with_offload_bytes": 23106019926,
"requires_offload_to_fit": false,
"fit_source": "catalog",
"status": "supported",
"supports_thinking_toggle": false,
"downloading": false,
"download_phase": "idle",
"downloaded_bytes": 0,
"download_total_bytes": 0,
"download_error": ""
}
]
}
POST /v1/models/pull#
Download a supported managed model into the local cache. Downloads run in a
background thread; poll GET /v1/models for progress.
Request#
{
"model": "qwen35-9b-q4k-m"
}
Responses#
Already installed:
{
"object": "model.pull",
"id": "qwen35-9b-q4k-m",
"state": "installed"
}
Started download:
{
"object": "model.pull",
"id": "qwen35-9b-q4k-m",
"state": "downloading",
"downloaded_bytes": 0,
"download_total_bytes": 0
}
If the same model is already downloading, the server returns 202 with the
current state, downloaded_bytes, and download_total_bytes.
POST /v1/models/activate#
Activate an installed managed model in a running server. The model must exist in the local managed cache and fit the current GPU budget.
Request#
{
"model": "qwen35-9b-q4k-m"
}
Response#
{
"object": "model.activation",
"id": "qwen35-9b-q4k-m",
"active": true
}
POST /v1/models/remove#
Remove an installed managed model from the local cache. If the target model is
currently loaded, the request fails by default. Set force: true to unload it
first and then remove the cached files.
Request#
{
"model": "qwen35-9b-q4k-m",
"force": false
}
Response#
{
"object": "model.remove",
"id": "qwen35-9b-q4k-m",
"removed": true,
"unloaded_from_gpu": false,
"cleared_active_selection": true,
"deleted_model": true,
"deleted_manifest": true,
"removed_dir": true
}
Without force, a loaded target returns 409 Conflict and the cached files
remain untouched. After a forced remove, the server may have no model loaded;
generation endpoints then return 503 until another model is activated or the
server is restarted with a model.
GET /health#
Server health check for monitoring and local UI state.
Response#
{
"status": "ok",
"model": "Qwen 3.5 9B Q4_K_M",
"active_requests": 1,
"queued_requests": 0,
"uptime_seconds": 3600,
"gpu_memory_used_bytes": 10100000000,
"gpu_memory_budget_bytes": 34359738368,
"gpu_memory_weights_bytes": 8900000000,
"gpu_memory_runtime_bytes": 1200000000,
"gpu_context_reserved_bytes": 536870912,
"gpu_context_active_bytes": 131072,
"gpu_context_tokens": 128,
"gpu_context_capacity_tokens": 4096
}
model is "none" when the server is running without a loaded model.
Error Responses#
Errors use an OpenAI-style envelope:
{
"error": {
"message": "Field 'messages' is required",
"type": "invalid_request_error",
"code": 400
}
}
| Status | Type | When |
|---|---|---|
| 400 | invalid_request_error |
Malformed JSON, missing required fields, invalid parameters, unknown model id for management operations, unsupported or non-fitting model activation. |
| 400 | context_length_exceeded |
Prompt exceeds the active model context capacity. |
| 404 | model_not_found |
Generation requested an unknown or not-installed managed model. |
| 404 | not_found |
Unknown endpoint. |
| 409 | invalid_request_error |
GPU already reserved, another model download is in progress, or removing the loaded model without force: true. |
| 501 | unsupported_operation |
Model-management endpoint used on a build/backend without management support. |
| 503 | service_unavailable |
Generation requested while no model is loaded. |
| 500 | internal_error |
Tokenization, GPU, inference, or response-size failure. |
CORS And Connection Handling#
- All endpoints include
Access-Control-Allow-Origin: *. OPTIONSrequests return200 {}for CORS preflight.- Streaming responses use HTTP/1.1 chunked transfer encoding (
Content-Type: text/event-stream,Cache-Control: no-cache,Connection: keep-alive,Transfer-Encoding: chunked). - Client disconnection during streaming stops generation promptly.
- Non-streaming JSON responses send
Connection: close. The HTML routes (/,/chat) also sendConnection: close.
Server Configuration#
./zig-out/bin/zinc [options]
-m, --model <path> Path to GGUF model file
--model-id <id> Managed model id from the built-in catalog/cache
-p, --port <port> Server port (default: 8080; `zinc chat` defaults to 9090)
-d, --device <id> GPU device index (default: 0)
-c, --context <size> Context length (default: 4096)
--parallel <n> Max queued/concurrent HTTP requests (default: 4)
--kv-quant <bits> TurboQuant KV cache compression: 0/2/3/4 (default: 0=off)
--prompt <text> Single prompt mode (CLI only; does not start the server)
-n, --max-tokens <n> CLI generation token cap
--chat Apply the model chat template in CLI prompt mode
--raw Do not auto-apply chat templates in CLI prompt mode
Supported Models#
ZINC loads GGUF models and supports the quantization families used by the managed catalog:
| Format | Supported | Notes |
|---|---|---|
| Q4_K | Yes | Primary K-quant path |
| Q5_K | Yes | K-quant path |
| Q6_K | Yes | K-quant path |
| Q8_0 | Yes | Common attention and projection path |
| F16 | Yes | Small tensors and some model data |
| F32 | Yes | Baseline/scalar tensor path |
Run ./zig-out/bin/zinc model list or GET /v1/models for the exact catalog
entries that are supported and fit on the current machine.