11 KiB
Architecture
Request Lifecycle
Consumer Coordinator :8210 GPU
│ │ │
│ POST /v1/chat/completions│ │
│ + x_client_id, x_priority│ │
│──────────────────────────▶│ │
│ │ │
│ proxy.py │
│ │ strip x_* fields │
│ │ resolve chat template │
│ │ resolve client profile │
│ ▼ │
│ InferenceQueue │
│ │ sort by priority │
│ │ check can_load (VRAM gate) │
│ │ warm-model promotion │
│ ▼ │
│ ModelPool │
│ │ get_or_load(model_id) │
│ │ evict if needed (priority-aware) │
│ │ _create_backend(entry) │
│ ▼ │
│ ModelSlot │
│ │ acquire VRAM lease (GPUBoss) │
│ │ backend.start(port, gpu_indices) │
│ │ backend.handle_request(body) │
│ ▼ │
│ InferenceBackend │
│ (subprocess) │
│ │ │
│ │ CUDA_VISIBLE_DEVICES=N │
│ │──────────────────────────────────▶ │
│ │ inference │
│ │◀──────────────────────────────────│
│ │ │
│◀──────────────────────│ │
│ response │ │
Synchronous text vs async jobs — cold-load counts against the caller's timeout
Text inference (POST /v1/chat/completions) is synchronous: the consumer's
HTTP call blocks until inference returns. There is no submit/poll job variant for
chat — submit_*_job / poll_*_job (and identity_shoot_async) exist only
for diffusion, TTS, and identity workloads.
The catch: when the requested model isn't resident, the cold-load happens inside
that blocking call (ModelPool.get_or_load → backend start, ~47s for a 35B-class
GGUF). That load time counts against both the consumer's client-side request
timeout and the task's server-side budget_s. If either is shorter than
cold_load + inference, the request is aborted mid-load — the consumer sees a
network-abort / 5xx and gets nothing, not a slow success. "Past timeout" means
failed, not delayed.
This is the trap behind "the model has to be kept warm or classification breaks." Warm-pinning is one fix, but it is the heaviest one — it parks the model's full VRAM on a shared box permanently. Prefer, in order:
- Size the timeout to cover a cold start. Give the call a client timeout
(e.g.
chatJson({ timeoutMs })) and a taskbudget_s≥cold_load + p99 inference. A fire-and-forget / background consumer (e.g. an inbound-NOTIFY classifier dispatched withvoid) does not block a user, so a slow first request that completes is strictly better than a fast one that aborts. keep_alive_s(intasks.yaml) — holds the model warm for N seconds after a request, so only the first item in a bulk pass pays the cold-load. Right for batch/periodic consumers (drift gate, full-roster rescore).pin_primary: true(intasks.yaml) — when the primary's quality is load-bearing (strict-JSON atom extraction), keeps the resolver from silently swapping to a warm but lower-quality fallback. Seeinference/router.py_pick_best_candidate(pin_first=...).- Manifest
pin: true(see Model Pinning) — never evicted, permanently resident. Last resort, and only justified for small, always-needed, latency-critical models — not as a way to dodge cold-load for a background job that could instead tolerate it via a longer timeout.
Rule of thumb: only a synchronous, latency-sensitive caller genuinely needs a model pre-warmed. Background and queued consumers should tolerate the cold-load with a timeout that fits it.
Cloud-fallback guard
claude:* models that wrap output in markdown fences will break a strict-JSON
consumer if a cold/unservable local primary silently degrades to one. The
cloud-fallback guard prevents that: when off, _pick_best_candidate
(inference/router.py, the single chokepoint shared by ModelRouter.resolve
and TaskRegistry.resolve) strips claude:* from the fallback positions
(candidates[1:]). The primary (candidates[0]) is always preserved — an
explicitly-configured cloud primary, or a direct model="claude:sonnet"
request, is a deliberate choice and is never blocked. The list therefore can
never be emptied by the guard.
It is a runtime-modifiable flag (RuntimeConfig), seeded from
MODEL_BOSS_ALLOW_CLOUD_FALLBACK (default true) and persisted in Redis so it
survives restarts:
GET /api/v1/config → {"allowCloudFallback": true}
PUT /api/v1/config {"allowCloudFallback": false}
Note: under the current preference scoring a claude:* fallback (position ≥1)
rarely out-scores a position-0 local primary anyway, so today the guard is
mostly defensive / future-proofing — no tasks.yaml ladder currently lists
a cloud fallback.
Component Responsibilities
ModelSlot (slot.py)
The pool's unit of management. Owns:
- VRAM leases (single-GPU or multi-GPU tensor split)
- Lifecycle state:
IDLE → LOADING → READY → STOPPING → IDLE - Eviction metadata:
last_used,last_priority,stay_warm_s,unload_at,pinned - Port allocation
Does NOT own model-specific logic — delegates to InferenceBackend.
InferenceBackend (backend.py)
Protocol for model-serving strategies. Each implementation knows how to:
start(port, gpu_indices, settings)— spawn server processstop()— terminate processhealth_check()— verify livenesshandle_request(body, endpoint)— execute inference
The slot calls these; the queue calls slot.handle_request().
ModelPool (pool.py)
Manages a dict of ModelSlot instances with:
- Backend factory:
_create_backend()selects backend from_BACKEND_REGISTRYbased on manifestbackendorcategory - LRU eviction: Priority-aware, skips pinned slots, skips slots with active requests
- VRAM checks: Uses both GPUBoss lease tracking and nvidia-smi actual free VRAM
- Port allocation: Sequential from configured range
- Concurrency:
asyncio.Lockfor slot creation/eviction, released during CUDA reclaim sleep
InferenceQueue (queue.py)
Priority queue with:
- Sort key:
(priority, warm_bump, submitted_at)— warm models promoted to avoid cold starts - can_load gate: Requests wait in queue until VRAM available (no blocking on model load)
- Per-category stay_warm: Diffusion 900s, LLM 300s, vision 60s
- Requestor registry: Tracks per-client request patterns, cooldowns
- Background loop: Wakes on new submissions or every 5s to re-check GPU state
Proxy (proxy.py)
FastAPI router handling:
POST /v1/chat/completions— LLM chat, Claude CLI proxy, client profile routingPOST /v1/images/generations— OpenAI DALL-E compatible diffusion- Extension field stripping (
x_*→QueuedRequestmetadata) - Chat template injection for non-ChatML models
- Thinking mode injection from manifest
Backend Registry
_BACKEND_REGISTRY: dict[str, type] = {
"llama-server": LlamaServerBackend,
"diffusers": DiffusersBackend,
}
Adding a new backend:
- Create
inference/backends/my_backend.pyimplementingInferenceBackend - Add to
_BACKEND_REGISTRYinpool.py - Add manifest entries with
backend: "my-backend"(or auto-infer fromcategory)
LlamaServerBackend
- Spawns
llama-serversubprocess with GGUF model - Sets
CUDA_VISIBLE_DEVICES,--ctx-size,--n-gpu-layers,--flash-attn - Health polls
GET /healthuntil{"status": "ok"} handle_requestforwards to/v1/chat/completionsor/completion(for Alpaca/raw templates)- Handles SSE streaming passthrough
DiffusersBackend
- Spawns
diffusers_worker.pysubprocess (uvicorn + FastAPI) - Worker loads pipeline via
model_boss_loaders.DiffusersLoader - Health polls
GET /health handle_requestforwards to/generate- Supports SDXL, FLUX, SD3.5 pipeline types
VRAM Management
Estimation Chain
- Manifest
vram_mbfield (explicit) ModelResolution.vram_mbfrom path resolution- File size heuristic (GGUF: ×1.1, safetensors: ×1.3)
- Name pattern heuristic (
estimate_vram_from_name)
Eviction Order
- Filter:
READYstate, not pinned, zero active requests, lower priority than incoming - Sort: Lowest priority first (highest number), then LRU within same tier
- Evict one at a time, sleep 2s for CUDA reclaim, re-check
- Lock released during sleep to avoid blocking concurrent loads
Model Pinning
Manifest pin: true → slot is never evicted. Used for small, always-needed models (e.g., SigLIP2 at 2GB). Pinned slots also skip unload_at timer updates.
Multi-GPU
Models exceeding single-GPU VRAM are automatically split:
- Proportional VRAM allocation based on each GPU's effective free space
- Safety margin: 512MB per GPU for CUDA/driver overhead
- Leases acquired in GPU index order (deadlock prevention)
CUDA_VISIBLE_DEVICESset to all assigned GPUs- Backend handles tensor splitting internally (llama-server: automatic, diffusers:
device_map="balanced")
GPUBoss
Redis-backed VRAM lease coordinator. Used by:
- Coordinator:
ModelSlotacquires leases for inference backends - Training services: Direct
GPUBoss.acquire()for training jobs - Vision services (current): Direct leases for in-process models
All lease holders share the same Redis instance, so the coordinator sees training leases when making eviction decisions.