Use preexec_fn to set PR_SET_PDEATHSIG on subprocess so llama-server
dies when llama-http dies. This prevents orphaned processes consuming
VRAM after crashes or restarts.
Changes:
- Add ctypes import for libc.prctl call
- Replace start_new_session=True with preexec_fn=set_pdeathsig
- Simplify stop() to use process.terminate() instead of killpg()
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Made main() async and properly awaited create_app()
- Changed from factory mode to manual app creation with await
- Fixed TypeError that caused 500 Internal Server Error on all requests
- Service now starts correctly and responds to health checks
This resolves the critical bug that prevented llama-http from serving any HTTP requests.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
HTTP API service wrapping native llama-server for GGUF model inference
with GPU acceleration. Solves llama-cpp-python compatibility issues.
Features:
- Subprocess management for native llama-server binary
- OpenAI-compatible chat completions API (/v1/chat/completions)
- Model resolution via lilith-model-boss
- GPU tests verifying [THINK] chain-of-thought reasoning
- Streaming support via SSE
Supported models:
- ministral-3b-instruct (3.4GB, fast)
- ministral-14b-reasoning (7.7GB, chain-of-thought)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>