No description
Find a file
2026-01-10 01:05:49 -08:00
.forgejo/workflows feat: add GPUBoss VRAM coordination (mandatory) 2026-01-09 11:34:32 -08:00
src/lilith_llama_service fix(@ml/llama-service): 🐛 resolve model loading issues in multi-model support 2026-01-10 00:30:25 -08:00
.gitignore feat: add GPUBoss VRAM coordination (mandatory) 2026-01-09 11:34:32 -08:00
Dockerfile feat: add GPUBoss VRAM coordination (mandatory) 2026-01-09 11:34:32 -08:00
pyproject.toml feat: add GPUBoss VRAM coordination (mandatory) 2026-01-09 11:34:32 -08:00
README.md feat: add GPUBoss VRAM coordination (mandatory) 2026-01-09 11:34:32 -08:00

lilith-llama-service

Local LLM inference service using llama.cpp with GPUBoss VRAM coordination.

Overview

A FastAPI-based microservice providing llama.cpp inference with:

  • Streaming responses: Server-Sent Events for real-time generation
  • GPU acceleration: Configurable GPU layer offloading
  • GPUBoss coordination: Redis-based VRAM lease management prevents race conditions
  • Health checks: Model availability monitoring

Installation

# Core package
pip install lilith-llama-service

# Development
pip install lilith-llama-service[dev]

Configuration

Set environment variables:

# Model path
export LLAMA_SERVICE_MODEL_PATH="/path/to/model.gguf"

# Inference settings
export LLAMA_SERVICE_CONTEXT_SIZE=8192
export LLAMA_SERVICE_MAX_TOKENS=2048
export LLAMA_SERVICE_TEMPERATURE=0.7
export LLAMA_SERVICE_N_GPU_LAYERS=-1  # -1 = all layers on GPU

# GPUBoss coordination (mandatory - Redis required)
export LLAMA_SERVICE_GPUBOSS_REDIS_URL="redis://localhost:6379/0"
export LLAMA_SERVICE_GPUBOSS_PRIORITY="normal"  # low, normal, high, critical
export LLAMA_SERVICE_GPUBOSS_LEASE_TIMEOUT_MS=30000
export LLAMA_SERVICE_GPUBOSS_MODEL_ID="llama-service"

GPUBoss Integration

GPUBoss VRAM coordination is mandatory - this service requires Redis:

  1. Connects to Redis on startup to coordinate with other ML services
  2. Acquires VRAM lease before loading model - prevents race conditions
  3. Heartbeats to keep lease active during inference
  4. Releases lease on shutdown for other services to use

This prevents issues where multiple ML services fight for GPU VRAM, causing OOM errors or segfaults.

Prerequisites: Redis must be running before starting this service.

Running

# As module
python -m lilith_llama_service

# Or with uvicorn
uvicorn lilith_llama_service.app:create_llama_service --factory --host 0.0.0.0 --port 8000

API

Health Check

GET /health

Response:

{
  "status": "ok",
  "version": "0.1.0",
  "model_loaded": true
}

Chat Completion

POST /chat
Content-Type: application/json

{
  "messages": [
    {"role": "user", "content": "Hello!"}
  ],
  "system_prompt": "You are a helpful assistant.",
  "stream": true
}

Streaming response (SSE):

event: message
data: {"type": "start"}

event: message
data: {"type": "chunk", "content": "Hello"}

event: message
data: {"type": "chunk", "content": "!"}

event: message
data: {"type": "done"}

Development

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run with coverage
pytest --cov=lilith_llama_service