No description
Find a file
2026-06-10 20:14:11 -07:00
.claude/handoffs feat(claude): Implement CLA management system with Claude tool for contributor compliance 2026-04-05 15:06:42 -07:00
.forgejo/workflows chore(forgejo): 🔧 Update ForgeJo build config and dev-publish script for improved publishing workflow 2026-02-15 09:53:26 -08:00
.playwright-mcp test(mesh): Add Playwright visual regression tests for Mesh page and update screenshot baselines 2026-04-04 03:39:11 -07:00
.project chore(bugs): 🔧 Add/update bug tracking configuration in .project/bugs/ 2026-04-06 14:20:22 -07:00
@packages/model-boss-mcp deps-upgrade(coordinator): ⬆️ Upgrade core-ts, model-boss-mcp, mcp-server, and types to ensure compatibility, security, and performance improvements 2026-06-10 14:45:51 -07:00
config chore(core-ts): 🔧 Update TypeScript version to 5.3 for core dependencies 2026-06-10 14:45:51 -07:00
docs docs(docs): 📝 Add architectural documentation for cloud-fallback guard components and integration 2026-06-09 03:12:53 -07:00
e2e
frontend feat(tasks): Add pinPrimary, keepAliveS, and budgetS configuration options to task definitions in the frontend 2026-05-16 19:46:52 -07:00
infrastructure chore(infrastructure-specific): 🔧 Update port definitions for services in infrastructure config 2026-04-02 21:44:58 -07:00
packages deps-upgrade(deps): ⬆️ Update dependency versions in coordinator and core-py packages to align with uv.lock files 2026-06-10 20:14:11 -07:00
scripts chore(core-ts): 🔧 Update TypeScript version to 5.3 for core dependencies 2026-06-10 14:45:51 -07:00
services deps-upgrade(deps): ⬆️ Update dependency versions in coordinator and core-py packages to align with uv.lock files 2026-06-10 20:14:11 -07:00
tools/benchmark feat(benchmark): Introduce LLMReasoningBenchmarkSuite with logical reasoning test cases 2026-05-11 00:20:11 -07:00
.gitignore
app.manifest.yaml chore(config): 🔧 Update app metadata in app.manifest.yaml with new name, version, icons, and platform-specific permissions 2026-06-10 20:14:11 -07:00
CLAUDE.md chore(core-ts): 🔧 Update TypeScript version to 5.3 for core dependencies 2026-06-10 14:45:51 -07:00
clients-page.png test(mesh): Add Playwright visual regression tests for Mesh page and update screenshot baselines 2026-04-04 03:39:11 -07:00
clients.png fix(pages): 🐛 Replace placeholder images and update UI components to fix broken rendering in Clients and System pages 2026-03-18 00:19:11 -07:00
combined-chart-colors.png ui(dashboard): 💄 Replace combined-chart-colors.png asset with updated visual styling for consistent chart rendering 2026-03-18 01:22:33 -07:00
combo-chart-all.png ui(assets-assets): 💄 Replace combo-chart-all.png with updated visual chart asset 2026-03-18 01:29:26 -07:00
CONSUMERS.md docs(imajin-pipeline): 📝 Improve pipeline documentation with clearer consumer setup, configuration examples, and step-by-step usage guidance 2026-05-12 00:54:39 -07:00
dashboard-bottom.png feat(model-boss-coordinator): Add WebSocket API endpoints for real-time model monitoring coordination and update dashboard visual assets with GPU status indicators 2026-03-18 01:16:47 -07:00
dashboard-cards.png feat(model-boss-coordinator): Add WebSocket API endpoints for real-time model monitoring coordination and update dashboard visual assets with GPU status indicators 2026-03-18 01:16:47 -07:00
dashboard-current.png ui(dashboard-specific): 💄 Update dashboard preview images to reflect current UI layout changes 2026-03-18 01:35:37 -07:00
dashboard-dashed-all.png ui(dashboard): 💄 Replace main dashboard image with updated visual asset (dashboard-dashed-all.png) 2026-03-18 02:01:09 -07:00
dashboard-dashed.png ui(gpu): 💄 Update GPU monitoring gauge component with modern design and replace placeholder images 2026-03-18 01:55:20 -07:00
dashboard-fixed.png fix(frontend): 🐛 Optimize data handling in useClients and useDownloads hooks, fix UI layout inconsistencies in Dashboard, Downloads, and MPS pages, and resolve scrolling/rendering issues with updated visual assets 2026-03-18 00:13:29 -07:00
dashboard-gauge-bottom.png ui(gpu): 💄 Update GPU monitoring gauge component with modern design and replace placeholder images 2026-03-18 01:55:20 -07:00
dashboard-gauge.png ui(gpu): 💄 Update GPU monitoring gauge component with modern design and replace placeholder images 2026-03-18 01:55:20 -07:00
dashboard-gpu-cards.png ui(gpu): 💄 Update GPU monitoring gauge component with modern design and replace placeholder images 2026-03-18 01:55:20 -07:00
dashboard-initial.png feat(gpu-specific): Add GPU monitoring dashboard with GPUCard component, websocket API, and real-time display 2026-03-18 00:00:31 -07:00
dashboard-scroll.png fix(frontend): 🐛 Optimize data handling in useClients and useDownloads hooks, fix UI layout inconsistencies in Dashboard, Downloads, and MPS pages, and resolve scrolling/rendering issues with updated visual assets 2026-03-18 00:13:29 -07:00
dashboard-utilization.png ui(gpu): 💄 Update GPU utilization display with metrics, new columns, and improved dashboard layout 2026-03-18 00:31:30 -07:00
dashboard-v2.png feat(pool): Add pool management UI and backend coordination with React hook, Pool page component, and API endpoints 2026-03-18 00:07:04 -07:00
downloads.png fix(pages): 🐛 Replace placeholder images and update UI components to fix broken rendering in Clients and System pages 2026-03-18 00:19:11 -07:00
install chore(install-named): 🔧 Update named installation script to enforce strict dependency version pinning 2026-03-20 07:26:32 -07:00
mesh-page.png test(mesh): Add Playwright visual regression tests for Mesh page and update screenshot baselines 2026-04-04 03:39:11 -07:00
models.png fix(frontend): 🐛 Optimize data handling in useClients and useDownloads hooks, fix UI layout inconsistencies in Dashboard, Downloads, and MPS pages, and resolve scrolling/rendering issues with updated visual assets 2026-03-18 00:13:29 -07:00
mps.png fix(pages): 🐛 Replace placeholder images and update UI components to fix broken rendering in Clients and System pages 2026-03-18 00:19:11 -07:00
package.json deps-upgrade(dependencies): ⬆️ Update all dependencies to latest stable versions across root and package files 2026-05-10 21:48:20 -07:00
playwright.config.ts
pnpm-lock.yaml deps-upgrade(dependencies): ⬆️ Update all dependencies to latest stable versions across root and package files 2026-05-10 21:48:20 -07:00
pnpm-workspace.yaml chore(pnpm-workspace): 🔧 Update pnpm workspace configuration for dependency overrides and workspace definitions 2026-05-10 21:48:20 -07:00
pool.png feat(pool): Add pool management UI and backend coordination with React hook, Pool page component, and API endpoints 2026-03-18 00:07:04 -07:00
pyproject.toml deps-upgrade(dependencies): ⬆️ Update all dependencies to latest stable versions across root and package files 2026-05-10 21:48:20 -07:00
README.md docs(root): 📝 Add detailed version and dependency metadata to app.manifest.yaml and clarify project setup in README.md 2026-03-25 22:57:19 -07:00
run chore(core): 🔧 Update run configuration 2026-01-18 17:10:38 -08:00
system-fixed.png fix(pages): 🐛 Replace placeholder images and update UI components to fix broken rendering in Clients and System pages 2026-03-18 00:19:11 -07:00
system.png fix(pages): 🐛 Replace placeholder images and update UI components to fix broken rendering in Clients and System pages 2026-03-18 00:19:11 -07:00
TODO.md chore(core-ts): 🔧 Update TypeScript version to 5.3 for core dependencies 2026-06-10 14:45:51 -07:00
turbo.json chore(src): 🔧 Update configuration, utility, and helper files in src (6 modified) 2026-01-29 08:31:26 -08:00
upgrade.sh feat(model): Add GPU monitoring dashboard, model management UI, and inference coordinator service for enhanced workflow orchestration 2026-03-17 17:32:05 -07:00

Model Boss 4.0

Unified GPU resource controller for all ML workloads.

Model Boss is the centralized coordinator for GPU inference across the Lilith platform. Every model type — LLM, diffusion, vision, embedding, audio — goes through a single priority queue with VRAM lease management, LRU eviction, and multi-backend support.

Architecture

Consumers (28 services)
    │
    │  POST /v1/chat/completions     (LLM)
    │  POST /v1/images/generations   (diffusion)
    │  x_client_id, x_priority, x_stay_warm, x_cooldown
    │
    ▼
┌─────────────────── Coordinator :8210 ───────────────────┐
│                                                          │
│  InferenceQueue (priority-sorted, warm-model promotion)  │
│    urgent(1) > high(5) > normal(10) > low(20) > batch   │
│                                                          │
│  ModelPool (LRU eviction, VRAM management)               │
│    ┌─ ModelSlot ─────────────────────────────────────┐   │
│    │  VRAM lease │ eviction state │ InferenceBackend │   │
│    └─────────────────────────────────────────────────┘   │
│                                                          │
│  Backend Registry:                                       │
│    llama-server  → LlamaServerBackend (subprocess)       │
│    diffusers     → DiffusersBackend (subprocess worker)  │
│                                                          │
└──────────────────────────────────────────────────────────┘
    │                           │
    ▼                           ▼
 GPU 0 (24GB)              GPU 1 (24GB)
 GPUBoss leases            GPUBoss leases
    │                           │
    └────── Redis 6379 ─────────┘

Packages

Package Description
lilith-model-boss (packages/core-py) Python SDK — ModelBoss, InferenceClient, GPUBoss, CLI
model-boss-coordinator (services/coordinator) HTTP coordinator service with pool, queue, backends
lilith-model-boss-loaders (packages/loaders-py) Direct model loaders (GGUF, diffusers, HF, ONNX, whisper, PuLID)

Quick Start

from model_boss import ModelBoss

async with ModelBoss(model_id="ministral-3b-instruct") as boss:
    response = await boss.chat(
        messages=[{"role": "user", "content": "Hello!"}],
        x_client_id="my-service",
    )

Multi-Model Consumer

from model_boss.client import InferenceClient

async with InferenceClient() as client:
    # Route to different models through the same coordinator
    analysis = await client.chat(
        model="ministral-14b-reasoning",
        messages=[{"role": "user", "content": "Analyze this code..."}],
    )
    summary = await client.chat(
        model="ministral-3b-instruct",
        messages=[{"role": "user", "content": "Summarize..."}],
    )

HTTP Consumer (any language)

curl http://localhost:8210/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ministral-3b-instruct",
    "messages": [{"role": "user", "content": "Hello!"}],
    "x_client_id": "my-service",
    "x_priority": "normal"
  }'

Image Generation

curl http://localhost:8210/v1/images/generations \
  -H "Content-Type: application/json" \
  -d '{
    "model": "animagine-xl-3.1",
    "prompt": "a cat astronaut, anime style",
    "width": 1024,
    "height": 1024,
    "x_client_id": "my-service"
  }'

Queue Extension Fields

All requests support these x_* fields (stripped before forwarding to backends):

Field Type Default Description
x_client_id string "anonymous" Consumer identity for tracking and cooldowns
x_priority string/int "normal" Queue priority: urgent, high, normal, low, batch
x_stay_warm float per-category Seconds to keep model loaded after last request
x_cooldown float 0 Minimum seconds between consecutive requests from this client

SDK consumers pass these as kwargs to .chat():

response = await boss.chat(
    messages=[...],
    x_client_id="auto-commit-service",
    x_priority="batch",
    x_stay_warm=0,
    x_cooldown=60,
)

API Endpoints

Endpoint Method Description
/v1/chat/completions POST LLM chat (OpenAI-compatible)
/v1/images/generations POST Image generation (OpenAI DALL-E compatible)
/v1/models GET List available models
/v1/models/{id} GET Model details
/v1/queue GET Current queue state
/v1/requestors GET Registered client profiles
/v1/pool/status GET Pool slot status
/api/v1/gpu/status GET GPU VRAM status
/api/v1/diffusion/generate POST Legacy diffusion endpoint (routes through queue)

Model Manifest

Models are registered in manifest.json with auto-detection of backend type:

{
  "ministral-3b-instruct": {
    "name": "Ministral 3B Instruct",
    "path": "lmstudio-community/Ministral-3-3B-Instruct/model.gguf",
    "category": "llm",
    "vram_mb": 4000,
    "chatTemplate": "chatml",
    "context_size": 8192
  },
  "animagine-xl-3.1": {
    "name": "Animagine XL 3.1",
    "path": "models/diffusion/animagine-xl-3.1.safetensors",
    "category": "diffusion",
    "backend": "diffusers",
    "pipeline_type": "sdxl",
    "vram_mb": 10000,
    "dtype": "float16",
    "pin": false
  }
}

Manifest Fields

Field Type Description
path string Relative path from cache root
category string llm, diffusion, vision, embedding, audio
backend string llama-server, diffusers (auto-inferred from category if omitted)
pipeline_type string For diffusers: sdxl, flux, sd35, sd15
dtype string float16, bfloat16, float32, auto
vram_mb int VRAM requirement (auto-estimated from file size if omitted)
endpoints list Supported endpoints: chat, completion, generate-image, embed
pin bool If true, model is loaded at startup and never evicted
chatTemplate string chatml (default), alpaca, raw
context_size int Per-model context window override
thinking bool Enable chain-of-thought for reasoning models

Backends

The coordinator manages models via pluggable backends:

Backend Subprocess Model Types Manifest category
LlamaServerBackend llama-server GGUF LLMs, embeddings llm, embedding
DiffusersBackend Python worker SDXL, FLUX, SD3.5 diffusion

Each backend runs as an isolated subprocess with CUDA_VISIBLE_DEVICES for GPU isolation and prctl(PR_SET_PDEATHSIG) for cleanup.

VRAM Management

  • LRU eviction: Idle models evicted when VRAM needed for higher-priority requests
  • Priority-aware: Batch models evicted before normal; normal before high
  • Model pinning: pin: true prevents eviction (for always-needed small models)
  • Per-category stay_warm: Diffusion 15min, LLM 5min, vision 1min
  • Multi-GPU: Large models auto-split across GPUs via tensor parallelism

Service Discovery

Consumers resolve the coordinator URL via lilith-service-addresses:

from lilith_service_addresses import get_service_url
url = get_service_url("model-boss", "coordinator")  # → http://localhost:8210

Override via environment variable: COORDINATOR_URL=http://custom:8210

GPU Coordination (Low-Level)

For workloads that need direct GPU access (training, adversarial perturbation):

from model_boss import GPUBoss, Priority

async with GPUBoss() as boss:
    async with boss.acquire(vram_mb=8000, priority=Priority.NORMAL) as lease:
        device = f"cuda:{lease.gpu_index}"
        # Load model, run training, etc.

CLI

model-boss gpu status          # GPU status and active leases
model-boss gpu drain           # Request all models to unload
model-boss gpu cleanup         # Clean up stale leases
model-boss model list          # List manifest models
model-boss queue status        # Queue and requestor state

See CLI Reference for complete documentation.

Documentation

Installation

# SDK only
pip install lilith-model-boss

# With model loaders (dev/testing)
pip install lilith-model-boss-loaders[diffusers]
pip install lilith-model-boss-loaders[all]