model-boss/docs/ARCHITECTURE.md
autocommit 4a3cf3a994 docs(docs): 📝 Add architectural documentation for cloud-fallback guard components and integration
Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>
2026-06-09 03:12:53 -07:00

11 KiB
Raw Permalink Blame History

Architecture

Request Lifecycle

Consumer                  Coordinator :8210                    GPU
   │                           │                                │
   │  POST /v1/chat/completions│                                │
   │  + x_client_id, x_priority│                                │
   │──────────────────────────▶│                                │
   │                           │                                │
   │                     proxy.py                               │
   │                       │ strip x_* fields                   │
   │                       │ resolve chat template              │
   │                       │ resolve client profile             │
   │                       ▼                                    │
   │                 InferenceQueue                             │
   │                       │ sort by priority                   │
   │                       │ check can_load (VRAM gate)         │
   │                       │ warm-model promotion               │
   │                       ▼                                    │
   │                   ModelPool                                │
   │                       │ get_or_load(model_id)              │
   │                       │ evict if needed (priority-aware)   │
   │                       │ _create_backend(entry)             │
   │                       ▼                                    │
   │                   ModelSlot                                │
   │                       │ acquire VRAM lease (GPUBoss)       │
   │                       │ backend.start(port, gpu_indices)   │
   │                       │ backend.handle_request(body)       │
   │                       ▼                                    │
   │              InferenceBackend                              │
   │              (subprocess)                                  │
   │                       │                                    │
   │                       │  CUDA_VISIBLE_DEVICES=N            │
   │                       │──────────────────────────────────▶ │
   │                       │          inference                 │
   │                       │◀──────────────────────────────────│
   │                       │                                    │
   │◀──────────────────────│                                    │
   │      response         │                                    │

Synchronous text vs async jobs — cold-load counts against the caller's timeout

Text inference (POST /v1/chat/completions) is synchronous: the consumer's HTTP call blocks until inference returns. There is no submit/poll job variant for chat — submit_*_job / poll_*_job (and identity_shoot_async) exist only for diffusion, TTS, and identity workloads.

The catch: when the requested model isn't resident, the cold-load happens inside that blocking call (ModelPool.get_or_load → backend start, ~47s for a 35B-class GGUF). That load time counts against both the consumer's client-side request timeout and the task's server-side budget_s. If either is shorter than cold_load + inference, the request is aborted mid-load — the consumer sees a network-abort / 5xx and gets nothing, not a slow success. "Past timeout" means failed, not delayed.

This is the trap behind "the model has to be kept warm or classification breaks." Warm-pinning is one fix, but it is the heaviest one — it parks the model's full VRAM on a shared box permanently. Prefer, in order:

  1. Size the timeout to cover a cold start. Give the call a client timeout (e.g. chatJson({ timeoutMs })) and a task budget_scold_load + p99 inference. A fire-and-forget / background consumer (e.g. an inbound-NOTIFY classifier dispatched with void) does not block a user, so a slow first request that completes is strictly better than a fast one that aborts.
  2. keep_alive_s (in tasks.yaml) — holds the model warm for N seconds after a request, so only the first item in a bulk pass pays the cold-load. Right for batch/periodic consumers (drift gate, full-roster rescore).
  3. pin_primary: true (in tasks.yaml) — when the primary's quality is load-bearing (strict-JSON atom extraction), keeps the resolver from silently swapping to a warm but lower-quality fallback. See inference/router.py _pick_best_candidate(pin_first=...).
  4. Manifest pin: true (see Model Pinning) — never evicted, permanently resident. Last resort, and only justified for small, always-needed, latency-critical models — not as a way to dodge cold-load for a background job that could instead tolerate it via a longer timeout.

Rule of thumb: only a synchronous, latency-sensitive caller genuinely needs a model pre-warmed. Background and queued consumers should tolerate the cold-load with a timeout that fits it.

Cloud-fallback guard

claude:* models that wrap output in markdown fences will break a strict-JSON consumer if a cold/unservable local primary silently degrades to one. The cloud-fallback guard prevents that: when off, _pick_best_candidate (inference/router.py, the single chokepoint shared by ModelRouter.resolve and TaskRegistry.resolve) strips claude:* from the fallback positions (candidates[1:]). The primary (candidates[0]) is always preserved — an explicitly-configured cloud primary, or a direct model="claude:sonnet" request, is a deliberate choice and is never blocked. The list therefore can never be emptied by the guard.

It is a runtime-modifiable flag (RuntimeConfig), seeded from MODEL_BOSS_ALLOW_CLOUD_FALLBACK (default true) and persisted in Redis so it survives restarts:

GET  /api/v1/config                              → {"allowCloudFallback": true}
PUT  /api/v1/config  {"allowCloudFallback": false}

Note: under the current preference scoring a claude:* fallback (position ≥1) rarely out-scores a position-0 local primary anyway, so today the guard is mostly defensive / future-proofing — no tasks.yaml ladder currently lists a cloud fallback.

Component Responsibilities

ModelSlot (slot.py)

The pool's unit of management. Owns:

  • VRAM leases (single-GPU or multi-GPU tensor split)
  • Lifecycle state: IDLE → LOADING → READY → STOPPING → IDLE
  • Eviction metadata: last_used, last_priority, stay_warm_s, unload_at, pinned
  • Port allocation

Does NOT own model-specific logic — delegates to InferenceBackend.

InferenceBackend (backend.py)

Protocol for model-serving strategies. Each implementation knows how to:

  • start(port, gpu_indices, settings) — spawn server process
  • stop() — terminate process
  • health_check() — verify liveness
  • handle_request(body, endpoint) — execute inference

The slot calls these; the queue calls slot.handle_request().

ModelPool (pool.py)

Manages a dict of ModelSlot instances with:

  • Backend factory: _create_backend() selects backend from _BACKEND_REGISTRY based on manifest backend or category
  • LRU eviction: Priority-aware, skips pinned slots, skips slots with active requests
  • VRAM checks: Uses both GPUBoss lease tracking and nvidia-smi actual free VRAM
  • Port allocation: Sequential from configured range
  • Concurrency: asyncio.Lock for slot creation/eviction, released during CUDA reclaim sleep

InferenceQueue (queue.py)

Priority queue with:

  • Sort key: (priority, warm_bump, submitted_at) — warm models promoted to avoid cold starts
  • can_load gate: Requests wait in queue until VRAM available (no blocking on model load)
  • Per-category stay_warm: Diffusion 900s, LLM 300s, vision 60s
  • Requestor registry: Tracks per-client request patterns, cooldowns
  • Background loop: Wakes on new submissions or every 5s to re-check GPU state

Proxy (proxy.py)

FastAPI router handling:

  • POST /v1/chat/completions — LLM chat, Claude CLI proxy, client profile routing
  • POST /v1/images/generations — OpenAI DALL-E compatible diffusion
  • Extension field stripping (x_*QueuedRequest metadata)
  • Chat template injection for non-ChatML models
  • Thinking mode injection from manifest

Backend Registry

_BACKEND_REGISTRY: dict[str, type] = {
    "llama-server": LlamaServerBackend,
    "diffusers": DiffusersBackend,
}

Adding a new backend:

  1. Create inference/backends/my_backend.py implementing InferenceBackend
  2. Add to _BACKEND_REGISTRY in pool.py
  3. Add manifest entries with backend: "my-backend" (or auto-infer from category)

LlamaServerBackend

  • Spawns llama-server subprocess with GGUF model
  • Sets CUDA_VISIBLE_DEVICES, --ctx-size, --n-gpu-layers, --flash-attn
  • Health polls GET /health until {"status": "ok"}
  • handle_request forwards to /v1/chat/completions or /completion (for Alpaca/raw templates)
  • Handles SSE streaming passthrough

DiffusersBackend

  • Spawns diffusers_worker.py subprocess (uvicorn + FastAPI)
  • Worker loads pipeline via model_boss_loaders.DiffusersLoader
  • Health polls GET /health
  • handle_request forwards to /generate
  • Supports SDXL, FLUX, SD3.5 pipeline types

VRAM Management

Estimation Chain

  1. Manifest vram_mb field (explicit)
  2. ModelResolution.vram_mb from path resolution
  3. File size heuristic (GGUF: ×1.1, safetensors: ×1.3)
  4. Name pattern heuristic (estimate_vram_from_name)

Eviction Order

  1. Filter: READY state, not pinned, zero active requests, lower priority than incoming
  2. Sort: Lowest priority first (highest number), then LRU within same tier
  3. Evict one at a time, sleep 2s for CUDA reclaim, re-check
  4. Lock released during sleep to avoid blocking concurrent loads

Model Pinning

Manifest pin: true → slot is never evicted. Used for small, always-needed models (e.g., SigLIP2 at 2GB). Pinned slots also skip unload_at timer updates.

Multi-GPU

Models exceeding single-GPU VRAM are automatically split:

  1. Proportional VRAM allocation based on each GPU's effective free space
  2. Safety margin: 512MB per GPU for CUDA/driver overhead
  3. Leases acquired in GPU index order (deadlock prevention)
  4. CUDA_VISIBLE_DEVICES set to all assigned GPUs
  5. Backend handles tensor splitting internally (llama-server: automatic, diffusers: device_map="balanced")

GPUBoss

Redis-backed VRAM lease coordinator. Used by:

  • Coordinator: ModelSlot acquires leases for inference backends
  • Training services: Direct GPUBoss.acquire() for training jobs
  • Vision services (current): Direct leases for in-process models

All lease holders share the same Redis instance, so the coordinator sees training leases when making eviction decisions.