No description
Find a file
2026-01-17 14:57:21 -08:00
src/llama_http chore(config): 🔧 Update configuration setup in README.md and config.py 2026-01-17 14:57:21 -08:00
tests chore(deps): 🔧 📦️ Update 7 py files in deps 2026-01-17 11:27:02 -08:00
.env.example chore(core): 🔧 Update .env.example file 2026-01-17 14:57:21 -08:00
.gitignore feat(llama-http): initial service for Mistral-family GGUF inference 2026-01-10 07:47:10 -08:00
pyproject.toml chore(deps): 🔧 📦️ Update 7 py files in deps 2026-01-17 11:27:02 -08:00
README.md chore(config): 🔧 Update configuration setup in README.md and config.py 2026-01-17 14:57:21 -08:00

llama-http

HTTP API service wrapping native llama-server for GGUF model inference with GPU acceleration.

Why This Service Exists

llama-cpp-python (the Python bindings) often lags behind the native llama.cpp library. When new model architectures are added to llama.cpp (like Mistral-family models), the Python bindings may not support them for months.

This service solves that by:

  1. Using the native llama-server binary (always up-to-date)
  2. Managing the subprocess lifecycle automatically
  3. Exposing an OpenAI-compatible API for easy integration

Supported Models

Any GGUF model supported by llama.cpp, including:

Model ID Size Use Case
Ministral 3B Instruct ministral-3b-instruct 3.4 GB Fast responses, simple tasks
Ministral 14B Reasoning ministral-14b-reasoning 7.7 GB Chain-of-thought, complex reasoning

The reasoning model produces [THINK] tokens for chain-of-thought prompting.

Quick Start

cd ~/Code/@applications/@ml/llama-http
source .venv/bin/activate

# Default: ministral-3b-instruct
python -m llama_http

# Or with 14B reasoning model
LLAMA_HTTP_MODEL_ID=ministral-14b-reasoning python -m llama_http

Service runs on http://localhost:10010.

API Endpoints

Health Check

curl http://localhost:10010/health

Chat Completions (OpenAI-compatible)

curl -X POST http://localhost:10010/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "What is 2+2?"}],
    "max_tokens": 100,
    "temperature": 0.7,
    "stream": false
  }'

Streaming

curl -X POST http://localhost:10010/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "Hello"}], "stream": true}'

Configuration

Environment variables (prefix: LLAMA_HTTP_):

Variable Default Description
LLAMA_HTTP_MODEL_ID ministral-3b-instruct Model ID from model-boss
LLAMA_HTTP_MODEL_PATH (none) Direct path override (bypasses model-boss)
LLAMA_HTTP_CONTEXT_SIZE 4096 Context window size
LLAMA_HTTP_N_GPU_LAYERS -1 GPU layers (-1 = all)
LLAMA_HTTP_PORT 10010 FastAPI service port (via lilith-service-addresses)
LLAMA_HTTP_LLAMA_SERVER_PORT 10009 Internal llama-server port
LLAMA_HTTP_FLASH_ATTN true Enable flash attention

Architecture

┌─────────────────────────────────────────────────────────┐
│                     llama-http                          │
│  ┌─────────────────┐    ┌─────────────────────────────┐ │
│  │   FastAPI App   │───▶│  LlamaServerManager         │ │
│  │  (port 10010)   │    │  - Subprocess lifecycle     │ │
│  └────────┬────────┘    │  - Health monitoring        │ │
│           │             └──────────────┬──────────────┘ │
│           ▼                            ▼                │
│  ┌─────────────────┐    ┌─────────────────────────────┐ │
│  │  LlamaClient    │───▶│  llama-server (native)      │ │
│  │  (HTTP proxy)   │    │  (port 8009, CUDA)          │ │
│  └─────────────────┘    └──────────────┬──────────────┘ │
│                                        ▼                │
│                         ┌─────────────────────────────┐ │
│                         │  GGUF Model (GPU VRAM)      │ │
│                         │  via model-boss resolution  │ │
│                         └─────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘

Dependencies

  • model-boss: Model path resolution
  • lilith-fastapi-service-base: FastAPI bootstrapping
  • Native llama-server: Built from ~/Code/github-clones/llama.cpp

Building llama-server

cd ~/Code/github-clones/llama.cpp
git pull
cmake -B build -DGGML_CUDA=ON -DLLAMA_CURL=OFF
cmake --build build -j$(nproc)
# Binary at: build/bin/llama-server

Testing

# Run all tests (requires GPU)
pytest tests/ -v

# Run GPU reasoning tests specifically
pytest tests/test_reasoning_gpu.py -v -s

Tests verify:

  • [THINK] token generation for reasoning models
  • GPU memory allocation (>4GB for 14B model)
  • SSE streaming responses
  • Token usage reporting
  • model-boss path resolution

Integration Examples

Python (httpx)

import httpx

async def chat(prompt: str) -> str:
    async with httpx.AsyncClient() as client:
        r = await client.post(
            "http://localhost:10010/v1/chat/completions",
            json={
                "messages": [{"role": "user", "content": prompt}],
                "max_tokens": 200,
            },
            timeout=60.0,
        )
        return r.json()["choices"][0]["message"]["content"]

TypeScript (fetch)

const response = await fetch('http://localhost:10010/v1/chat/completions', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    messages: [{ role: 'user', content: 'Hello' }],
    max_tokens: 100,
  }),
});
const data = await response.json();
console.log(data.choices[0].message.content);

Reasoning Model Output

The ministral-14b-reasoning model uses [THINK] tokens for chain-of-thought reasoning:

User: What is 15 * 23?
Assistant: [THINK]I need to calculate 15 multiplied by 23. Let me break this down:
- 15 × 20 = 300
- 15 × 3 = 45
- 300 + 45 = 345[/THINK]

The answer is **345**.

This is useful for:

  • Complex reasoning tasks
  • Prompt enhancement for image generation
  • Multi-step problem solving
  • imajin/imagegen-assistant: Uses llama-http for prompt enhancement with Ministral 14B
  • conversation-assistant: Can use llama-http as LLM backend

License

MIT