6.5 KiB
6.5 KiB
llama-http
HTTP API service wrapping native llama-server for GGUF model inference with GPU acceleration.
Why This Service Exists
llama-cpp-python (the Python bindings) often lags behind the native llama.cpp library. When new model architectures are added to llama.cpp (like Mistral-family models), the Python bindings may not support them for months.
This service solves that by:
- Using the native llama-server binary (always up-to-date)
- Managing the subprocess lifecycle automatically
- Exposing an OpenAI-compatible API for easy integration
Supported Models
Any GGUF model supported by llama.cpp, including:
| Model | ID | Size | Use Case |
|---|---|---|---|
| Ministral 3B Instruct | ministral-3b-instruct |
3.4 GB | Fast responses, simple tasks |
| Ministral 14B Reasoning | ministral-14b-reasoning |
7.7 GB | Chain-of-thought, complex reasoning |
The reasoning model produces [THINK] tokens for chain-of-thought prompting.
Quick Start
cd ~/Code/@applications/@ml/llama-http
source .venv/bin/activate
# Default: ministral-3b-instruct
python -m llama_http
# Or with 14B reasoning model
LLAMA_HTTP_MODEL_ID=ministral-14b-reasoning python -m llama_http
Service runs on http://localhost:10010.
API Endpoints
Health Check
curl http://localhost:10010/health
Chat Completions (OpenAI-compatible)
curl -X POST http://localhost:10010/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "What is 2+2?"}],
"max_tokens": 100,
"temperature": 0.7,
"stream": false
}'
Streaming
curl -X POST http://localhost:10010/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages": [{"role": "user", "content": "Hello"}], "stream": true}'
Configuration
Environment variables (prefix: LLAMA_HTTP_):
| Variable | Default | Description |
|---|---|---|
LLAMA_HTTP_MODEL_ID |
ministral-3b-instruct |
Model ID from model-boss |
LLAMA_HTTP_MODEL_PATH |
(none) | Direct path override (bypasses model-boss) |
LLAMA_HTTP_CONTEXT_SIZE |
4096 |
Context window size |
LLAMA_HTTP_N_GPU_LAYERS |
-1 |
GPU layers (-1 = all) |
LLAMA_HTTP_PORT |
10010 |
FastAPI service port (via lilith-service-addresses) |
LLAMA_HTTP_LLAMA_SERVER_PORT |
10009 |
Internal llama-server port |
LLAMA_HTTP_FLASH_ATTN |
true |
Enable flash attention |
Architecture
┌─────────────────────────────────────────────────────────┐
│ llama-http │
│ ┌─────────────────┐ ┌─────────────────────────────┐ │
│ │ FastAPI App │───▶│ LlamaServerManager │ │
│ │ (port 10010) │ │ - Subprocess lifecycle │ │
│ └────────┬────────┘ │ - Health monitoring │ │
│ │ └──────────────┬──────────────┘ │
│ ▼ ▼ │
│ ┌─────────────────┐ ┌─────────────────────────────┐ │
│ │ LlamaClient │───▶│ llama-server (native) │ │
│ │ (HTTP proxy) │ │ (port 8009, CUDA) │ │
│ └─────────────────┘ └──────────────┬──────────────┘ │
│ ▼ │
│ ┌─────────────────────────────┐ │
│ │ GGUF Model (GPU VRAM) │ │
│ │ via model-boss resolution │ │
│ └─────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘
Dependencies
- model-boss: Model path resolution
- lilith-fastapi-service-base: FastAPI bootstrapping
- Native llama-server: Built from
~/Code/github-clones/llama.cpp
Building llama-server
cd ~/Code/github-clones/llama.cpp
git pull
cmake -B build -DGGML_CUDA=ON -DLLAMA_CURL=OFF
cmake --build build -j$(nproc)
# Binary at: build/bin/llama-server
Testing
# Run all tests (requires GPU)
pytest tests/ -v
# Run GPU reasoning tests specifically
pytest tests/test_reasoning_gpu.py -v -s
Tests verify:
[THINK]token generation for reasoning models- GPU memory allocation (>4GB for 14B model)
- SSE streaming responses
- Token usage reporting
- model-boss path resolution
Integration Examples
Python (httpx)
import httpx
async def chat(prompt: str) -> str:
async with httpx.AsyncClient() as client:
r = await client.post(
"http://localhost:10010/v1/chat/completions",
json={
"messages": [{"role": "user", "content": prompt}],
"max_tokens": 200,
},
timeout=60.0,
)
return r.json()["choices"][0]["message"]["content"]
TypeScript (fetch)
const response = await fetch('http://localhost:10010/v1/chat/completions', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
messages: [{ role: 'user', content: 'Hello' }],
max_tokens: 100,
}),
});
const data = await response.json();
console.log(data.choices[0].message.content);
Reasoning Model Output
The ministral-14b-reasoning model uses [THINK] tokens for chain-of-thought reasoning:
User: What is 15 * 23?
Assistant: [THINK]I need to calculate 15 multiplied by 23. Let me break this down:
- 15 × 20 = 300
- 15 × 3 = 45
- 300 + 45 = 345[/THINK]
The answer is **345**.
This is useful for:
- Complex reasoning tasks
- Prompt enhancement for image generation
- Multi-step problem solving
Related Services
- imajin/imagegen-assistant: Uses llama-http for prompt enhancement with Ministral 14B
- conversation-assistant: Can use llama-http as LLM backend
License
MIT