No description
| .forgejo/workflows | ||
| src/lilith_llama_service | ||
| .gitignore | ||
| Dockerfile | ||
| pyproject.toml | ||
| README.md | ||
lilith-llama-service
Local LLM inference service using llama.cpp with GPUBoss VRAM coordination.
Overview
A FastAPI-based microservice providing llama.cpp inference with:
- Streaming responses: Server-Sent Events for real-time generation
- GPU acceleration: Configurable GPU layer offloading
- GPUBoss coordination: Redis-based VRAM lease management prevents race conditions
- Health checks: Model availability monitoring
Installation
# Core package
pip install lilith-llama-service
# Development
pip install lilith-llama-service[dev]
Configuration
Set environment variables:
# Model path
export LLAMA_SERVICE_MODEL_PATH="/path/to/model.gguf"
# Inference settings
export LLAMA_SERVICE_CONTEXT_SIZE=8192
export LLAMA_SERVICE_MAX_TOKENS=2048
export LLAMA_SERVICE_TEMPERATURE=0.7
export LLAMA_SERVICE_N_GPU_LAYERS=-1 # -1 = all layers on GPU
# GPUBoss coordination (mandatory - Redis required)
export LLAMA_SERVICE_GPUBOSS_REDIS_URL="redis://localhost:6379/0"
export LLAMA_SERVICE_GPUBOSS_PRIORITY="normal" # low, normal, high, critical
export LLAMA_SERVICE_GPUBOSS_LEASE_TIMEOUT_MS=30000
export LLAMA_SERVICE_GPUBOSS_MODEL_ID="llama-service"
GPUBoss Integration
GPUBoss VRAM coordination is mandatory - this service requires Redis:
- Connects to Redis on startup to coordinate with other ML services
- Acquires VRAM lease before loading model - prevents race conditions
- Heartbeats to keep lease active during inference
- Releases lease on shutdown for other services to use
This prevents issues where multiple ML services fight for GPU VRAM, causing OOM errors or segfaults.
Prerequisites: Redis must be running before starting this service.
Running
# As module
python -m lilith_llama_service
# Or with uvicorn
uvicorn lilith_llama_service.app:create_llama_service --factory --host 0.0.0.0 --port 8000
API
Health Check
GET /health
Response:
{
"status": "ok",
"version": "0.1.0",
"model_loaded": true
}
Chat Completion
POST /chat
Content-Type: application/json
{
"messages": [
{"role": "user", "content": "Hello!"}
],
"system_prompt": "You are a helpful assistant.",
"stream": true
}
Streaming response (SSE):
event: message
data: {"type": "start"}
event: message
data: {"type": "chunk", "content": "Hello"}
event: message
data: {"type": "chunk", "content": "!"}
event: message
data: {"type": "done"}
Development
# Install dev dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Run with coverage
pytest --cov=lilith_llama_service