Lilith 1d8ab202dc chore(config): 🔧 Update configuration setup in README.md and config.py

2026-01-17 14:57:21 -08:00

6.5 KiB

Raw Permalink Blame History

llama-http

HTTP API service wrapping native llama-server for GGUF model inference with GPU acceleration.

Why This Service Exists

llama-cpp-python (the Python bindings) often lags behind the native llama.cpp library. When new model architectures are added to llama.cpp (like Mistral-family models), the Python bindings may not support them for months.

This service solves that by:

Using the native llama-server binary (always up-to-date)
Managing the subprocess lifecycle automatically
Exposing an OpenAI-compatible API for easy integration

Supported Models

Any GGUF model supported by llama.cpp, including:

Model	ID	Size	Use Case
Ministral 3B Instruct	`ministral-3b-instruct`	3.4 GB	Fast responses, simple tasks
Ministral 14B Reasoning	`ministral-14b-reasoning`	7.7 GB	Chain-of-thought, complex reasoning

The reasoning model produces [THINK] tokens for chain-of-thought prompting.

Quick Start

cd ~/Code/@applications/@ml/llama-http
source .venv/bin/activate

# Default: ministral-3b-instruct
python -m llama_http

# Or with 14B reasoning model
LLAMA_HTTP_MODEL_ID=ministral-14b-reasoning python -m llama_http

Service runs on http://localhost:10010.

API Endpoints

Health Check

curl http://localhost:10010/health

Chat Completions (OpenAI-compatible)

curl -X POST http://localhost:10010/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "What is 2+2?"}],
    "max_tokens": 100,
    "temperature": 0.7,
    "stream": false
  }'

Streaming

curl -X POST http://localhost:10010/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "Hello"}], "stream": true}'

Configuration

Environment variables (prefix: LLAMA_HTTP_):

Variable	Default	Description
`LLAMA_HTTP_MODEL_ID`	`ministral-3b-instruct`	Model ID from model-boss
`LLAMA_HTTP_MODEL_PATH`	(none)	Direct path override (bypasses model-boss)
`LLAMA_HTTP_CONTEXT_SIZE`	`4096`	Context window size
`LLAMA_HTTP_N_GPU_LAYERS`	`-1`	GPU layers (-1 = all)
`LLAMA_HTTP_PORT`	`10010`	FastAPI service port (via lilith-service-addresses)
`LLAMA_HTTP_LLAMA_SERVER_PORT`	`10009`	Internal llama-server port
`LLAMA_HTTP_FLASH_ATTN`	`true`	Enable flash attention

Architecture

┌─────────────────────────────────────────────────────────┐
│                     llama-http                          │
│  ┌─────────────────┐    ┌─────────────────────────────┐ │
│  │   FastAPI App   │───▶│  LlamaServerManager         │ │
│  │  (port 10010)   │    │  - Subprocess lifecycle     │ │
│  └────────┬────────┘    │  - Health monitoring        │ │
│           │             └──────────────┬──────────────┘ │
│           ▼                            ▼                │
│  ┌─────────────────┐    ┌─────────────────────────────┐ │
│  │  LlamaClient    │───▶│  llama-server (native)      │ │
│  │  (HTTP proxy)   │    │  (port 8009, CUDA)          │ │
│  └─────────────────┘    └──────────────┬──────────────┘ │
│                                        ▼                │
│                         ┌─────────────────────────────┐ │
│                         │  GGUF Model (GPU VRAM)      │ │
│                         │  via model-boss resolution  │ │
│                         └─────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘

Dependencies

model-boss: Model path resolution
lilith-fastapi-service-base: FastAPI bootstrapping
Native llama-server: Built from ~/Code/github-clones/llama.cpp

Building llama-server

cd ~/Code/github-clones/llama.cpp
git pull
cmake -B build -DGGML_CUDA=ON -DLLAMA_CURL=OFF
cmake --build build -j$(nproc)
# Binary at: build/bin/llama-server

Testing

# Run all tests (requires GPU)
pytest tests/ -v

# Run GPU reasoning tests specifically
pytest tests/test_reasoning_gpu.py -v -s

Tests verify:

[THINK] token generation for reasoning models
GPU memory allocation (>4GB for 14B model)
SSE streaming responses
Token usage reporting
model-boss path resolution

Integration Examples

Python (httpx)

import httpx

async def chat(prompt: str) -> str:
    async with httpx.AsyncClient() as client:
        r = await client.post(
            "http://localhost:10010/v1/chat/completions",
            json={
                "messages": [{"role": "user", "content": prompt}],
                "max_tokens": 200,
            },
            timeout=60.0,
        )
        return r.json()["choices"][0]["message"]["content"]

TypeScript (fetch)

const response = await fetch('http://localhost:10010/v1/chat/completions', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    messages: [{ role: 'user', content: 'Hello' }],
    max_tokens: 100,
  }),
});
const data = await response.json();
console.log(data.choices[0].message.content);

Reasoning Model Output

The ministral-14b-reasoning model uses [THINK] tokens for chain-of-thought reasoning:

User: What is 15 * 23?
Assistant: [THINK]I need to calculate 15 multiplied by 23. Let me break this down:
- 15 × 20 = 300
- 15 × 3 = 45
- 300 + 45 = 345[/THINK]

The answer is **345**.

This is useful for:

Complex reasoning tasks
Prompt enhancement for image generation
Multi-step problem solving

imajin/imagegen-assistant: Uses llama-http for prompt enhancement with Ministral 14B
conversation-assistant: Can use llama-http as LLM backend

License

MIT

6.5 KiB Raw Permalink Blame History Unescape Escape