No description
Find a file
2026-01-18 17:10:38 -08:00
docs feat: model-boss architecture with extracted loaders and InferenceRouter 2026-01-17 12:39:05 -08:00
e2e fix(e2e): fix Playwright tests for tab-based navigation 2026-01-17 17:03:59 -08:00
frontend feat: model-boss architecture with extracted loaders and InferenceRouter 2026-01-17 12:39:05 -08:00
infrastructure infra: 🧱 Update CI/CD pipeline configuration 2026-01-18 17:05:19 -08:00
packages chore(src): 🔧 Update 10 Python files in source directory 2026-01-18 15:45:50 -08:00
scripts/run chore(core): 🔧 Update run configuration 2026-01-18 17:10:38 -08:00
services chore(src): 🔧 Update 10 Python files in source directory 2026-01-18 15:45:50 -08:00
.gitignore feat: model-boss architecture with extracted loaders and InferenceRouter 2026-01-17 12:39:05 -08:00
package-lock.json feat: model-boss architecture with extracted loaders and InferenceRouter 2026-01-17 12:39:05 -08:00
package.json feat: model-boss architecture with extracted loaders and InferenceRouter 2026-01-17 12:39:05 -08:00
playwright.config.ts chore(service): 🔧 Update 12 Python files in service 2026-01-17 18:30:48 -08:00
pnpm-lock.yaml feat: model-boss architecture with extracted loaders and InferenceRouter 2026-01-17 12:39:05 -08:00
pnpm-workspace.yaml feat: model-boss architecture with extracted loaders and InferenceRouter 2026-01-17 12:39:05 -08:00
pyproject.toml feat: model-boss architecture with extracted loaders and InferenceRouter 2026-01-17 12:39:05 -08:00
README.md feat: model-boss architecture with extracted loaders and InferenceRouter 2026-01-17 12:39:05 -08:00
run chore(core): 🔧 Update run configuration 2026-01-18 17:10:38 -08:00
TODO.md feat: model-boss architecture with extracted loaders and InferenceRouter 2026-01-17 12:39:05 -08:00
upgrade.sh chore(src): 🔧 Update 10 Python files in source directory 2026-01-18 15:45:50 -08:00

Model Boss

Unified GPU/VRAM lease coordinator and model management system for ML workloads.

Model Boss provides Redis-based coordination for GPU/VRAM resources across multiple ML processes, preventing VRAM contention and OOM errors. It features automatic VRAM estimation, request queueing with priority levels, preemption support, and a unified inference API.

Features

  • Single Manifest: Unified manifest for all model types (GGUF, safetensors, diffusion models)
  • VRAM Coordination: Redis-backed lease system preventing GPU memory contention
  • Auto-Loader Selection: Automatically chooses the right loader based on model format
  • Priority Queueing: Request queue with HIGH/NORMAL/LOW priority levels
  • Preemption System: Higher priority requests can preempt lower priority leases
  • Path Resolution: Resolves model IDs to filesystem paths, handles sharded models
  • RAM Coordination: Separate coordination for system RAM to prevent thrashing
  • CLI Tools: Comprehensive command-line interface for monitoring and management
  • Auto-Start Services: Automatically starts Redis and required services when needed

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                         Model Boss 3.0                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────────────┐          ┌─────────────────────┐     │
│  │   Python Package    │          │  TypeScript Package │     │
│  │   model-boss        │          │  @lilith/model-boss │     │
│  └─────────────────────┘          └─────────────────────┘     │
│           │                                   │                 │
│           ├─── GPU Boss ─────────────────────┤                 │
│           │    - VRAM leases                 │                 │
│           │    - Priority queue              │                 │
│           │    - Preemption                  │                 │
│           │                                  │                 │
│           ├─── RAM Boss ─────────────────────┤                 │
│           │    - RAM leases                  │                 │
│           │    - Memory analysis             │                 │
│           │    - Cache cleanup               │                 │
│           │                                  │                 │
│           └─── Path Loader ──────────────────┤                 │
│                - Model manifest              │                 │
│                - Path resolution             │                 │
│                - Sharded models              │                 │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │                   Redis Backend                         │   │
│  │   - Lease tracking     - Queue management              │   │
│  │   - GPU status         - Heartbeat monitoring          │   │
│  └─────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────┘

Packages

Package Language Description
model-boss Python Core library with GPU/RAM coordination, model loaders, CLI
@lilith/model-boss TypeScript Core library with GPU/RAM coordination, path resolution

Quick Start

Python

from model_boss import ModelBoss

# High-level API: automatic VRAM management
# Redis is auto-started if not running
async with ModelBoss(model_id="mistral-7b-instruct") as boss:
    response = await boss.model.chat([
        {"role": "user", "content": "Hello!"}
    ])
    print(response)

# Low-level GPU coordination
from model_boss import GPUBoss, Priority

async with GPUBoss() as boss:
    async with boss.acquire(vram_mb=8000, priority=Priority.NORMAL) as lease:
        # VRAM reserved, load your model here
        await load_model()
        await run_inference()
    # Auto-released when context exits

TypeScript

import { GPUBoss, Priority } from '@lilith/model-boss';

const boss = new GPUBoss();
await boss.connect();

// Acquire VRAM lease
const lease = await boss.acquire({
  vramMb: 8000,
  modelId: 'llama-7b',
  priority: Priority.NORMAL,
});

// Handle preemption
lease.onPreempt(async (reason) => {
  console.log(`Preempted: ${reason}`);
  await unloadModel();
});

// Use the GPU
await loadModel();

// Release when done
await lease.release();
await boss.close();

Installation

Python

# Basic installation
pip install model-boss

# With optional dependencies
pip install model-boss[torch]      # PyTorch support
pip install model-boss[llama]      # llama.cpp support
pip install model-boss[diffusers]  # Diffusion models
pip install model-boss[all]        # All optional dependencies

TypeScript

npm install @lilith/model-boss
# or
pnpm add @lilith/model-boss

CLI Usage

Model Boss includes a comprehensive CLI for monitoring and managing GPU/RAM resources.

# GPU commands
model-boss gpu status              # Show GPU status and active leases
model-boss gpu list                # List waiting queue requests
model-boss gpu kill <lease-id>     # Kill a specific lease
model-boss gpu drain               # Request all models to unload
model-boss gpu cleanup             # Clean up stale leases
model-boss gpu diagnose            # Diagnose GPU coordination issues

# RAM commands
model-boss ram status              # Show RAM usage and leases
model-boss ram analyze             # Detailed memory analysis
model-boss ram clear auto          # Clear caches based on pressure
model-boss ram cleanup             # Clean up stale RAM leases

See CLI Documentation for complete reference.

Configuration

Model Boss uses environment variables and config files for configuration.

Environment Variables

# Redis connection
REDIS_HOST=localhost
REDIS_PORT=6379
REDIS_DB=0

# GPU settings
GPU_BOSS_GRACE_PERIOD=30         # Preemption grace period (seconds)
GPU_BOSS_HEARTBEAT_INTERVAL=5    # Heartbeat interval (seconds)
GPU_BOSS_LEASE_TIMEOUT=60        # Lease timeout (seconds)

# Model paths
MODEL_BOSS_MODELS_DIR=/path/to/models
MODEL_BOSS_MANIFEST_PATH=/path/to/manifest.yaml

Python Configuration

from model_boss import ModelBossConfig

config = ModelBossConfig(
    redis_url="redis://localhost:6379/0",
    models_dir="/path/to/models",
    manifest_path="/path/to/manifest.yaml",
)

TypeScript Configuration

import { GPUBoss } from '@lilith/model-boss';

const boss = new GPUBoss({
  redis: {
    host: 'localhost',
    port: 6379,
    db: 0,
  },
  gracePeriod: 30,
  heartbeatInterval: 5,
});

Service Auto-Start

Model Boss can automatically start required services (like Redis) when they're not running. This makes it zero-configuration for most use cases.

Python

from model_boss import GPUBoss

# Redis auto-starts if not running (default behavior)
async with GPUBoss() as boss:
    lease = await boss.acquire(vram_mb=8000)

# Disable auto-start if you manage Redis yourself
async with GPUBoss(auto_start_services=False) as boss:
    lease = await boss.acquire(vram_mb=8000)

Manual Service Management

from model_boss.services import ServiceManager, ensure_services

# Check and start services manually
async with ServiceManager() as manager:
    status = await manager.get_status()
    print(f"Redis: {status['redis'].status}")

# Or use convenience function
status = await ensure_services()
if status['redis'].status == 'running':
    print("Redis is ready!")

Configuration

# Disable auto-start via environment variable
export MODEL_BOSS_AUTO_START_SERVICES=false

# Custom Redis port for auto-start
export MODEL_BOSS_REDIS_PORT=6380

Use Cases

Shared GPU Server

Multiple users running different models on the same GPU:

# User 1: Low priority background task
async with GPUBoss() as boss:
    async with boss.acquire(vram_mb=4000, priority=Priority.LOW) as lease:
        await train_model()

# User 2: High priority interactive task
async with GPUBoss() as boss:
    async with boss.acquire(vram_mb=8000, priority=Priority.HIGH) as lease:
        # This will preempt User 1's lease if needed
        await run_interactive_session()

Multi-Model Services

Running multiple models that need coordination:

# Service A: SDXL diffusion
async with ModelBoss(model_id="sdxl-turbo") as boss:
    image = await boss.model.generate("cat on a keyboard")

# Service B: LLM chat
async with ModelBoss(model_id="mistral-7b") as boss:
    response = await boss.model.chat([
        {"role": "user", "content": "Describe this image"}
    ])

Model Manifest

Model Boss uses a YAML manifest to map model IDs to filesystem paths.

models:
  mistral-7b-instruct:
    path: models/mistral-7b-instruct-v0.2.Q4_K_M.gguf
    format: gguf
    category: llm
    vram_mb: 4500

  sdxl-turbo:
    path: models/stable-diffusion-xl-turbo
    format: safetensors
    category: diffusion
    vram_mb: 6800

  llama-70b:
    path: models/llama-70b-sharded
    format: gguf
    category: llm
    sharded: true
    shard_count: 8
    vram_mb: 42000

Documentation

Development

# Clone repository
git clone https://forge.nasty.sh/lilith/model-boss
cd model-boss

# Install Python package in development mode
cd packages/core-py
pip install -e ".[dev]"
pytest

# Install TypeScript package
cd packages/core-ts
pnpm install
pnpm build
pnpm test

License

MIT License - see LICENSE file for details.

Contributing

Contributions welcome! Please ensure:

  • Code follows existing style (Ruff for Python, ESLint for TypeScript)
  • All tests pass
  • New features include tests and documentation
  • Breaking changes are clearly documented

Support