No description

Find a file

Lilith 468287773e ci: test workflow - deps available, builds locally ✓		2026-01-10 01:05:49 -08:00
.forgejo/workflows	feat: add GPUBoss VRAM coordination (mandatory)	2026-01-09 11:34:32 -08:00
src/lilith_llama_service	fix(@ml/llama-service): 🐛 resolve model loading issues in multi-model support	2026-01-10 00:30:25 -08:00
.gitignore	feat: add GPUBoss VRAM coordination (mandatory)	2026-01-09 11:34:32 -08:00
Dockerfile	feat: add GPUBoss VRAM coordination (mandatory)	2026-01-09 11:34:32 -08:00
pyproject.toml	feat: add GPUBoss VRAM coordination (mandatory)	2026-01-09 11:34:32 -08:00
README.md	feat: add GPUBoss VRAM coordination (mandatory)	2026-01-09 11:34:32 -08:00

README.md

lilith-llama-service

Local LLM inference service using llama.cpp with GPUBoss VRAM coordination.

Overview

A FastAPI-based microservice providing llama.cpp inference with:

Streaming responses: Server-Sent Events for real-time generation
GPU acceleration: Configurable GPU layer offloading
GPUBoss coordination: Redis-based VRAM lease management prevents race conditions
Health checks: Model availability monitoring

Installation

# Core package
pip install lilith-llama-service

# Development
pip install lilith-llama-service[dev]

Configuration

Set environment variables:

# Model path
export LLAMA_SERVICE_MODEL_PATH="/path/to/model.gguf"

# Inference settings
export LLAMA_SERVICE_CONTEXT_SIZE=8192
export LLAMA_SERVICE_MAX_TOKENS=2048
export LLAMA_SERVICE_TEMPERATURE=0.7
export LLAMA_SERVICE_N_GPU_LAYERS=-1  # -1 = all layers on GPU

# GPUBoss coordination (mandatory - Redis required)
export LLAMA_SERVICE_GPUBOSS_REDIS_URL="redis://localhost:6379/0"
export LLAMA_SERVICE_GPUBOSS_PRIORITY="normal"  # low, normal, high, critical
export LLAMA_SERVICE_GPUBOSS_LEASE_TIMEOUT_MS=30000
export LLAMA_SERVICE_GPUBOSS_MODEL_ID="llama-service"

GPUBoss Integration

GPUBoss VRAM coordination is mandatory - this service requires Redis:

Connects to Redis on startup to coordinate with other ML services
Acquires VRAM lease before loading model - prevents race conditions
Heartbeats to keep lease active during inference
Releases lease on shutdown for other services to use

This prevents issues where multiple ML services fight for GPU VRAM, causing OOM errors or segfaults.

Prerequisites: Redis must be running before starting this service.

Running

# As module
python -m lilith_llama_service

# Or with uvicorn
uvicorn lilith_llama_service.app:create_llama_service --factory --host 0.0.0.0 --port 8000

API

Health Check

GET /health

Response:

{
  "status": "ok",
  "version": "0.1.0",
  "model_loaded": true
}

Chat Completion

POST /chat
Content-Type: application/json

{
  "messages": [
    {"role": "user", "content": "Hello!"}
  ],
  "system_prompt": "You are a helpful assistant.",
  "stream": true
}

Streaming response (SSE):

event: message
data: {"type": "start"}

event: message
data: {"type": "chunk", "content": "Hello"}

event: message
data: {"type": "chunk", "content": "!"}

event: message
data: {"type": "done"}

Development

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run with coverage
pytest --cov=lilith_llama_service