|
|
||
|---|---|---|
| .claude/handoffs | ||
| .forgejo/workflows | ||
| .playwright-mcp | ||
| .project | ||
| @packages/model-boss-mcp | ||
| config | ||
| docs | ||
| e2e | ||
| frontend | ||
| infrastructure | ||
| packages | ||
| scripts | ||
| services | ||
| tools/benchmark | ||
| .gitignore | ||
| app.manifest.yaml | ||
| CLAUDE.md | ||
| clients-page.png | ||
| clients.png | ||
| combined-chart-colors.png | ||
| combo-chart-all.png | ||
| CONSUMERS.md | ||
| dashboard-bottom.png | ||
| dashboard-cards.png | ||
| dashboard-current.png | ||
| dashboard-dashed-all.png | ||
| dashboard-dashed.png | ||
| dashboard-fixed.png | ||
| dashboard-gauge-bottom.png | ||
| dashboard-gauge.png | ||
| dashboard-gpu-cards.png | ||
| dashboard-initial.png | ||
| dashboard-scroll.png | ||
| dashboard-utilization.png | ||
| dashboard-v2.png | ||
| downloads.png | ||
| install | ||
| mesh-page.png | ||
| models.png | ||
| mps.png | ||
| package.json | ||
| playwright.config.ts | ||
| pnpm-lock.yaml | ||
| pnpm-workspace.yaml | ||
| pool.png | ||
| pyproject.toml | ||
| README.md | ||
| run | ||
| system-fixed.png | ||
| system.png | ||
| TODO.md | ||
| turbo.json | ||
| upgrade.sh | ||
Model Boss 4.0
Unified GPU resource controller for all ML workloads.
Model Boss is the centralized coordinator for GPU inference across the Lilith platform. Every model type — LLM, diffusion, vision, embedding, audio — goes through a single priority queue with VRAM lease management, LRU eviction, and multi-backend support.
Architecture
Consumers (28 services)
│
│ POST /v1/chat/completions (LLM)
│ POST /v1/images/generations (diffusion)
│ x_client_id, x_priority, x_stay_warm, x_cooldown
│
▼
┌─────────────────── Coordinator :8210 ───────────────────┐
│ │
│ InferenceQueue (priority-sorted, warm-model promotion) │
│ urgent(1) > high(5) > normal(10) > low(20) > batch │
│ │
│ ModelPool (LRU eviction, VRAM management) │
│ ┌─ ModelSlot ─────────────────────────────────────┐ │
│ │ VRAM lease │ eviction state │ InferenceBackend │ │
│ └─────────────────────────────────────────────────┘ │
│ │
│ Backend Registry: │
│ llama-server → LlamaServerBackend (subprocess) │
│ diffusers → DiffusersBackend (subprocess worker) │
│ │
└──────────────────────────────────────────────────────────┘
│ │
▼ ▼
GPU 0 (24GB) GPU 1 (24GB)
GPUBoss leases GPUBoss leases
│ │
└────── Redis 6379 ─────────┘
Packages
| Package | Description |
|---|---|
lilith-model-boss (packages/core-py) |
Python SDK — ModelBoss, InferenceClient, GPUBoss, CLI |
model-boss-coordinator (services/coordinator) |
HTTP coordinator service with pool, queue, backends |
lilith-model-boss-loaders (packages/loaders-py) |
Direct model loaders (GGUF, diffusers, HF, ONNX, whisper, PuLID) |
Quick Start
SDK Consumer (recommended)
from model_boss import ModelBoss
async with ModelBoss(model_id="ministral-3b-instruct") as boss:
response = await boss.chat(
messages=[{"role": "user", "content": "Hello!"}],
x_client_id="my-service",
)
Multi-Model Consumer
from model_boss.client import InferenceClient
async with InferenceClient() as client:
# Route to different models through the same coordinator
analysis = await client.chat(
model="ministral-14b-reasoning",
messages=[{"role": "user", "content": "Analyze this code..."}],
)
summary = await client.chat(
model="ministral-3b-instruct",
messages=[{"role": "user", "content": "Summarize..."}],
)
HTTP Consumer (any language)
curl http://localhost:8210/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "ministral-3b-instruct",
"messages": [{"role": "user", "content": "Hello!"}],
"x_client_id": "my-service",
"x_priority": "normal"
}'
Image Generation
curl http://localhost:8210/v1/images/generations \
-H "Content-Type: application/json" \
-d '{
"model": "animagine-xl-3.1",
"prompt": "a cat astronaut, anime style",
"width": 1024,
"height": 1024,
"x_client_id": "my-service"
}'
Queue Extension Fields
All requests support these x_* fields (stripped before forwarding to backends):
| Field | Type | Default | Description |
|---|---|---|---|
x_client_id |
string | "anonymous" |
Consumer identity for tracking and cooldowns |
x_priority |
string/int | "normal" |
Queue priority: urgent, high, normal, low, batch |
x_stay_warm |
float | per-category | Seconds to keep model loaded after last request |
x_cooldown |
float | 0 |
Minimum seconds between consecutive requests from this client |
SDK consumers pass these as kwargs to .chat():
response = await boss.chat(
messages=[...],
x_client_id="auto-commit-service",
x_priority="batch",
x_stay_warm=0,
x_cooldown=60,
)
API Endpoints
| Endpoint | Method | Description |
|---|---|---|
/v1/chat/completions |
POST | LLM chat (OpenAI-compatible) |
/v1/images/generations |
POST | Image generation (OpenAI DALL-E compatible) |
/v1/models |
GET | List available models |
/v1/models/{id} |
GET | Model details |
/v1/queue |
GET | Current queue state |
/v1/requestors |
GET | Registered client profiles |
/v1/pool/status |
GET | Pool slot status |
/api/v1/gpu/status |
GET | GPU VRAM status |
/api/v1/diffusion/generate |
POST | Legacy diffusion endpoint (routes through queue) |
Model Manifest
Models are registered in manifest.json with auto-detection of backend type:
{
"ministral-3b-instruct": {
"name": "Ministral 3B Instruct",
"path": "lmstudio-community/Ministral-3-3B-Instruct/model.gguf",
"category": "llm",
"vram_mb": 4000,
"chatTemplate": "chatml",
"context_size": 8192
},
"animagine-xl-3.1": {
"name": "Animagine XL 3.1",
"path": "models/diffusion/animagine-xl-3.1.safetensors",
"category": "diffusion",
"backend": "diffusers",
"pipeline_type": "sdxl",
"vram_mb": 10000,
"dtype": "float16",
"pin": false
}
}
Manifest Fields
| Field | Type | Description |
|---|---|---|
path |
string | Relative path from cache root |
category |
string | llm, diffusion, vision, embedding, audio |
backend |
string | llama-server, diffusers (auto-inferred from category if omitted) |
pipeline_type |
string | For diffusers: sdxl, flux, sd35, sd15 |
dtype |
string | float16, bfloat16, float32, auto |
vram_mb |
int | VRAM requirement (auto-estimated from file size if omitted) |
endpoints |
list | Supported endpoints: chat, completion, generate-image, embed |
pin |
bool | If true, model is loaded at startup and never evicted |
chatTemplate |
string | chatml (default), alpaca, raw |
context_size |
int | Per-model context window override |
thinking |
bool | Enable chain-of-thought for reasoning models |
Backends
The coordinator manages models via pluggable backends:
| Backend | Subprocess | Model Types | Manifest category |
|---|---|---|---|
LlamaServerBackend |
llama-server |
GGUF LLMs, embeddings | llm, embedding |
DiffusersBackend |
Python worker | SDXL, FLUX, SD3.5 | diffusion |
Each backend runs as an isolated subprocess with CUDA_VISIBLE_DEVICES for GPU isolation and prctl(PR_SET_PDEATHSIG) for cleanup.
VRAM Management
- LRU eviction: Idle models evicted when VRAM needed for higher-priority requests
- Priority-aware: Batch models evicted before normal; normal before high
- Model pinning:
pin: trueprevents eviction (for always-needed small models) - Per-category stay_warm: Diffusion 15min, LLM 5min, vision 1min
- Multi-GPU: Large models auto-split across GPUs via tensor parallelism
Service Discovery
Consumers resolve the coordinator URL via lilith-service-addresses:
from lilith_service_addresses import get_service_url
url = get_service_url("model-boss", "coordinator") # → http://localhost:8210
Override via environment variable: COORDINATOR_URL=http://custom:8210
GPU Coordination (Low-Level)
For workloads that need direct GPU access (training, adversarial perturbation):
from model_boss import GPUBoss, Priority
async with GPUBoss() as boss:
async with boss.acquire(vram_mb=8000, priority=Priority.NORMAL) as lease:
device = f"cuda:{lease.gpu_index}"
# Load model, run training, etc.
CLI
model-boss gpu status # GPU status and active leases
model-boss gpu drain # Request all models to unload
model-boss gpu cleanup # Clean up stale leases
model-boss model list # List manifest models
model-boss queue status # Queue and requestor state
See CLI Reference for complete documentation.
Documentation
- CLI Reference
- Architecture
- Consumers — All 28 platform consumers
- Oracle Routing — Complexity-aware model selection
- Python SDK
Installation
# SDK only
pip install lilith-model-boss
# With model loaders (dev/testing)
pip install lilith-model-boss-loaders[diffusers]
pip install lilith-model-boss-loaders[all]