No description

Find a file

autocommit 5a059230bb chore(config): 🔧 Update app metadata in app.manifest.yaml with new name, version, icons, and platform-specific permissions Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>		2026-06-10 20:14:11 -07:00
.claude/handoffs	feat(claude): ✨ Implement CLA management system with Claude tool for contributor compliance	2026-04-05 15:06:42 -07:00
.forgejo/workflows	chore(forgejo): 🔧 Update ForgeJo build config and dev-publish script for improved publishing workflow	2026-02-15 09:53:26 -08:00
.playwright-mcp	test(mesh): ✅ Add Playwright visual regression tests for Mesh page and update screenshot baselines	2026-04-04 03:39:11 -07:00
.project	chore(bugs): 🔧 Add/update bug tracking configuration in .project/bugs/	2026-04-06 14:20:22 -07:00
@packages/model-boss-mcp	deps-upgrade(coordinator): ⬆️ Upgrade core-ts, model-boss-mcp, mcp-server, and types to ensure compatibility, security, and performance improvements	2026-06-10 14:45:51 -07:00
config	chore(core-ts): 🔧 Update TypeScript version to 5.3 for core dependencies	2026-06-10 14:45:51 -07:00
docs	docs(docs): 📝 Add architectural documentation for cloud-fallback guard components and integration	2026-06-09 03:12:53 -07:00
e2e
frontend	feat(tasks): ✨ Add pinPrimary, keepAliveS, and budgetS configuration options to task definitions in the frontend	2026-05-16 19:46:52 -07:00
infrastructure	chore(infrastructure-specific): 🔧 Update port definitions for services in infrastructure config	2026-04-02 21:44:58 -07:00
packages	deps-upgrade(deps): ⬆️ Update dependency versions in coordinator and core-py packages to align with uv.lock files	2026-06-10 20:14:11 -07:00
scripts	chore(core-ts): 🔧 Update TypeScript version to 5.3 for core dependencies	2026-06-10 14:45:51 -07:00
services	deps-upgrade(deps): ⬆️ Update dependency versions in coordinator and core-py packages to align with uv.lock files	2026-06-10 20:14:11 -07:00
tools/benchmark	feat(benchmark): ✨ Introduce LLMReasoningBenchmarkSuite with logical reasoning test cases	2026-05-11 00:20:11 -07:00
.gitignore
app.manifest.yaml	chore(config): 🔧 Update app metadata in app.manifest.yaml with new name, version, icons, and platform-specific permissions	2026-06-10 20:14:11 -07:00
CLAUDE.md	chore(core-ts): 🔧 Update TypeScript version to 5.3 for core dependencies	2026-06-10 14:45:51 -07:00
clients-page.png	test(mesh): ✅ Add Playwright visual regression tests for Mesh page and update screenshot baselines	2026-04-04 03:39:11 -07:00
clients.png	fix(pages): 🐛 Replace placeholder images and update UI components to fix broken rendering in Clients and System pages	2026-03-18 00:19:11 -07:00
combined-chart-colors.png	ui(dashboard): 💄 Replace combined-chart-colors.png asset with updated visual styling for consistent chart rendering	2026-03-18 01:22:33 -07:00
combo-chart-all.png	ui(assets-assets): 💄 Replace combo-chart-all.png with updated visual chart asset	2026-03-18 01:29:26 -07:00
CONSUMERS.md	docs(imajin-pipeline): 📝 Improve pipeline documentation with clearer consumer setup, configuration examples, and step-by-step usage guidance	2026-05-12 00:54:39 -07:00
dashboard-bottom.png	feat(model-boss-coordinator): ✨ Add WebSocket API endpoints for real-time model monitoring coordination and update dashboard visual assets with GPU status indicators	2026-03-18 01:16:47 -07:00
dashboard-cards.png	feat(model-boss-coordinator): ✨ Add WebSocket API endpoints for real-time model monitoring coordination and update dashboard visual assets with GPU status indicators	2026-03-18 01:16:47 -07:00
dashboard-current.png	ui(dashboard-specific): 💄 Update dashboard preview images to reflect current UI layout changes	2026-03-18 01:35:37 -07:00
dashboard-dashed-all.png	ui(dashboard): 💄 Replace main dashboard image with updated visual asset (dashboard-dashed-all.png)	2026-03-18 02:01:09 -07:00
dashboard-dashed.png	ui(gpu): 💄 Update GPU monitoring gauge component with modern design and replace placeholder images	2026-03-18 01:55:20 -07:00
dashboard-fixed.png	fix(frontend): 🐛 Optimize data handling in useClients and useDownloads hooks, fix UI layout inconsistencies in Dashboard, Downloads, and MPS pages, and resolve scrolling/rendering issues with updated visual assets	2026-03-18 00:13:29 -07:00
dashboard-gauge-bottom.png	ui(gpu): 💄 Update GPU monitoring gauge component with modern design and replace placeholder images	2026-03-18 01:55:20 -07:00
dashboard-gauge.png	ui(gpu): 💄 Update GPU monitoring gauge component with modern design and replace placeholder images	2026-03-18 01:55:20 -07:00
dashboard-gpu-cards.png	ui(gpu): 💄 Update GPU monitoring gauge component with modern design and replace placeholder images	2026-03-18 01:55:20 -07:00
dashboard-initial.png	feat(gpu-specific): ✨ Add GPU monitoring dashboard with GPUCard component, websocket API, and real-time display	2026-03-18 00:00:31 -07:00
dashboard-scroll.png	fix(frontend): 🐛 Optimize data handling in useClients and useDownloads hooks, fix UI layout inconsistencies in Dashboard, Downloads, and MPS pages, and resolve scrolling/rendering issues with updated visual assets	2026-03-18 00:13:29 -07:00
dashboard-utilization.png	ui(gpu): 💄 Update GPU utilization display with metrics, new columns, and improved dashboard layout	2026-03-18 00:31:30 -07:00
dashboard-v2.png	feat(pool): ✨ Add pool management UI and backend coordination with React hook, Pool page component, and API endpoints	2026-03-18 00:07:04 -07:00
downloads.png	fix(pages): 🐛 Replace placeholder images and update UI components to fix broken rendering in Clients and System pages	2026-03-18 00:19:11 -07:00
install	chore(install-named): 🔧 Update named installation script to enforce strict dependency version pinning	2026-03-20 07:26:32 -07:00
mesh-page.png	test(mesh): ✅ Add Playwright visual regression tests for Mesh page and update screenshot baselines	2026-04-04 03:39:11 -07:00
models.png	fix(frontend): 🐛 Optimize data handling in useClients and useDownloads hooks, fix UI layout inconsistencies in Dashboard, Downloads, and MPS pages, and resolve scrolling/rendering issues with updated visual assets	2026-03-18 00:13:29 -07:00
mps.png	fix(pages): 🐛 Replace placeholder images and update UI components to fix broken rendering in Clients and System pages	2026-03-18 00:19:11 -07:00
package.json	deps-upgrade(dependencies): ⬆️ Update all dependencies to latest stable versions across root and package files	2026-05-10 21:48:20 -07:00
playwright.config.ts
pnpm-lock.yaml	deps-upgrade(dependencies): ⬆️ Update all dependencies to latest stable versions across root and package files	2026-05-10 21:48:20 -07:00
pnpm-workspace.yaml	chore(pnpm-workspace): 🔧 Update pnpm workspace configuration for dependency overrides and workspace definitions	2026-05-10 21:48:20 -07:00
pool.png	feat(pool): ✨ Add pool management UI and backend coordination with React hook, Pool page component, and API endpoints	2026-03-18 00:07:04 -07:00
pyproject.toml	deps-upgrade(dependencies): ⬆️ Update all dependencies to latest stable versions across root and package files	2026-05-10 21:48:20 -07:00
README.md	docs(root): 📝 Add detailed version and dependency metadata to app.manifest.yaml and clarify project setup in README.md	2026-03-25 22:57:19 -07:00
run	chore(core): 🔧 Update run configuration	2026-01-18 17:10:38 -08:00
system-fixed.png	fix(pages): 🐛 Replace placeholder images and update UI components to fix broken rendering in Clients and System pages	2026-03-18 00:19:11 -07:00
system.png	fix(pages): 🐛 Replace placeholder images and update UI components to fix broken rendering in Clients and System pages	2026-03-18 00:19:11 -07:00
TODO.md	chore(core-ts): 🔧 Update TypeScript version to 5.3 for core dependencies	2026-06-10 14:45:51 -07:00
turbo.json	chore(src): 🔧 Update configuration, utility, and helper files in src (6 modified)	2026-01-29 08:31:26 -08:00
upgrade.sh	feat(model): ✨ Add GPU monitoring dashboard, model management UI, and inference coordinator service for enhanced workflow orchestration	2026-03-17 17:32:05 -07:00

README.md

Model Boss 4.0

Unified GPU resource controller for all ML workloads.

Model Boss is the centralized coordinator for GPU inference across the Lilith platform. Every model type — LLM, diffusion, vision, embedding, audio — goes through a single priority queue with VRAM lease management, LRU eviction, and multi-backend support.

Architecture

Consumers (28 services)
    │
    │  POST /v1/chat/completions     (LLM)
    │  POST /v1/images/generations   (diffusion)
    │  x_client_id, x_priority, x_stay_warm, x_cooldown
    │
    ▼
┌─────────────────── Coordinator :8210 ───────────────────┐
│                                                          │
│  InferenceQueue (priority-sorted, warm-model promotion)  │
│    urgent(1) > high(5) > normal(10) > low(20) > batch   │
│                                                          │
│  ModelPool (LRU eviction, VRAM management)               │
│    ┌─ ModelSlot ─────────────────────────────────────┐   │
│    │  VRAM lease │ eviction state │ InferenceBackend │   │
│    └─────────────────────────────────────────────────┘   │
│                                                          │
│  Backend Registry:                                       │
│    llama-server  → LlamaServerBackend (subprocess)       │
│    diffusers     → DiffusersBackend (subprocess worker)  │
│                                                          │
└──────────────────────────────────────────────────────────┘
    │                           │
    ▼                           ▼
 GPU 0 (24GB)              GPU 1 (24GB)
 GPUBoss leases            GPUBoss leases
    │                           │
    └────── Redis 6379 ─────────┘

Packages

Package	Description
lilith-model-boss (`packages/core-py`)	Python SDK — `ModelBoss`, `InferenceClient`, `GPUBoss`, CLI
model-boss-coordinator (`services/coordinator`)	HTTP coordinator service with pool, queue, backends
lilith-model-boss-loaders (`packages/loaders-py`)	Direct model loaders (GGUF, diffusers, HF, ONNX, whisper, PuLID)

Quick Start

SDK Consumer (recommended)

from model_boss import ModelBoss

async with ModelBoss(model_id="ministral-3b-instruct") as boss:
    response = await boss.chat(
        messages=[{"role": "user", "content": "Hello!"}],
        x_client_id="my-service",
    )

Multi-Model Consumer

from model_boss.client import InferenceClient

async with InferenceClient() as client:
    # Route to different models through the same coordinator
    analysis = await client.chat(
        model="ministral-14b-reasoning",
        messages=[{"role": "user", "content": "Analyze this code..."}],
    )
    summary = await client.chat(
        model="ministral-3b-instruct",
        messages=[{"role": "user", "content": "Summarize..."}],
    )

HTTP Consumer (any language)

curl http://localhost:8210/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ministral-3b-instruct",
    "messages": [{"role": "user", "content": "Hello!"}],
    "x_client_id": "my-service",
    "x_priority": "normal"
  }'

Image Generation

curl http://localhost:8210/v1/images/generations \
  -H "Content-Type: application/json" \
  -d '{
    "model": "animagine-xl-3.1",
    "prompt": "a cat astronaut, anime style",
    "width": 1024,
    "height": 1024,
    "x_client_id": "my-service"
  }'

Queue Extension Fields

All requests support these x_* fields (stripped before forwarding to backends):

Field	Type	Default	Description
`x_client_id`	string	`"anonymous"`	Consumer identity for tracking and cooldowns
`x_priority`	string/int	`"normal"`	Queue priority: `urgent`, `high`, `normal`, `low`, `batch`
`x_stay_warm`	float	per-category	Seconds to keep model loaded after last request
`x_cooldown`	float	`0`	Minimum seconds between consecutive requests from this client

SDK consumers pass these as kwargs to .chat():

response = await boss.chat(
    messages=[...],
    x_client_id="auto-commit-service",
    x_priority="batch",
    x_stay_warm=0,
    x_cooldown=60,
)

API Endpoints

Endpoint	Method	Description
`/v1/chat/completions`	POST	LLM chat (OpenAI-compatible)
`/v1/images/generations`	POST	Image generation (OpenAI DALL-E compatible)
`/v1/models`	GET	List available models
`/v1/models/{id}`	GET	Model details
`/v1/queue`	GET	Current queue state
`/v1/requestors`	GET	Registered client profiles
`/v1/pool/status`	GET	Pool slot status
`/api/v1/gpu/status`	GET	GPU VRAM status
`/api/v1/diffusion/generate`	POST	Legacy diffusion endpoint (routes through queue)

Model Manifest

Models are registered in manifest.json with auto-detection of backend type:

{
  "ministral-3b-instruct": {
    "name": "Ministral 3B Instruct",
    "path": "lmstudio-community/Ministral-3-3B-Instruct/model.gguf",
    "category": "llm",
    "vram_mb": 4000,
    "chatTemplate": "chatml",
    "context_size": 8192
  },
  "animagine-xl-3.1": {
    "name": "Animagine XL 3.1",
    "path": "models/diffusion/animagine-xl-3.1.safetensors",
    "category": "diffusion",
    "backend": "diffusers",
    "pipeline_type": "sdxl",
    "vram_mb": 10000,
    "dtype": "float16",
    "pin": false
  }
}

Manifest Fields

Field	Type	Description
`path`	string	Relative path from cache root
`category`	string	`llm`, `diffusion`, `vision`, `embedding`, `audio`
`backend`	string	`llama-server`, `diffusers` (auto-inferred from category if omitted)
`pipeline_type`	string	For diffusers: `sdxl`, `flux`, `sd35`, `sd15`
`dtype`	string	`float16`, `bfloat16`, `float32`, `auto`
`vram_mb`	int	VRAM requirement (auto-estimated from file size if omitted)
`endpoints`	list	Supported endpoints: `chat`, `completion`, `generate-image`, `embed`
`pin`	bool	If true, model is loaded at startup and never evicted
`chatTemplate`	string	`chatml` (default), `alpaca`, `raw`
`context_size`	int	Per-model context window override
`thinking`	bool	Enable chain-of-thought for reasoning models

Backends

The coordinator manages models via pluggable backends:

Backend	Subprocess	Model Types	Manifest `category`
`LlamaServerBackend`	`llama-server`	GGUF LLMs, embeddings	`llm`, `embedding`
`DiffusersBackend`	Python worker	SDXL, FLUX, SD3.5	`diffusion`

Each backend runs as an isolated subprocess with CUDA_VISIBLE_DEVICES for GPU isolation and prctl(PR_SET_PDEATHSIG) for cleanup.

VRAM Management

LRU eviction: Idle models evicted when VRAM needed for higher-priority requests
Priority-aware: Batch models evicted before normal; normal before high
Model pinning: pin: true prevents eviction (for always-needed small models)
Per-category stay_warm: Diffusion 15min, LLM 5min, vision 1min
Multi-GPU: Large models auto-split across GPUs via tensor parallelism

Service Discovery

Consumers resolve the coordinator URL via lilith-service-addresses:

from lilith_service_addresses import get_service_url
url = get_service_url("model-boss", "coordinator")  # → http://localhost:8210

Override via environment variable: COORDINATOR_URL=http://custom:8210

GPU Coordination (Low-Level)

For workloads that need direct GPU access (training, adversarial perturbation):

from model_boss import GPUBoss, Priority

async with GPUBoss() as boss:
    async with boss.acquire(vram_mb=8000, priority=Priority.NORMAL) as lease:
        device = f"cuda:{lease.gpu_index}"
        # Load model, run training, etc.

CLI

model-boss gpu status          # GPU status and active leases
model-boss gpu drain           # Request all models to unload
model-boss gpu cleanup         # Clean up stale leases
model-boss model list          # List manifest models
model-boss queue status        # Queue and requestor state

See CLI Reference for complete documentation.

Documentation

CLI Reference
Architecture
Consumers — All 28 platform consumers
Oracle Routing — Complexity-aware model selection
Python SDK

Installation

# SDK only
pip install lilith-model-boss

# With model loaders (dev/testing)
pip install lilith-model-boss-loaders[diffusers]
pip install lilith-model-boss-loaders[all]