agent-ml/llamacpp
2026-06-10 03:59:22 -07:00
..
.forgejo/workflows chore: 🔧 Update files 2026-01-15 06:56:04 -08:00
src chore(src): 🔧 Update configuration files 2026-01-19 23:45:00 -08:00
context-test.mjs Add GPU debugging and testing scripts 2025-12-27 18:42:05 -08:00
deep-check.mjs Add GPU debugging and testing scripts 2025-12-27 18:42:05 -08:00
eslint.config.js fix(@ml/agent-ml): 🐛 resolve linting issues in key files 2026-01-04 20:45:36 -08:00
gpu-debug.mjs Add GPU debugging and testing scripts 2025-12-27 18:42:05 -08:00
integration-test.ts Initial commit: ML Core library with provider implementations 2025-12-25 17:10:28 -08:00
package.json deps-upgrade(dependencies): ⬆️ Update all dependencies across subpackages (claude, core, knowledge, llamacpp, provider-registry, tts) and root to maintain consistency and security 2026-06-10 03:59:22 -07:00
package.json.tmp 🔧 migrate to @lilith namespace, update configs 2025-12-31 01:34:24 -08:00
README.md Update package documentation with correct package names 2025-12-25 18:25:29 -08:00
test-integration.mjs Initial commit: ML Core library with provider implementations 2025-12-25 17:10:28 -08:00
test-layers.mjs Add GPU debugging and testing scripts 2025-12-27 18:42:05 -08:00
trace-init.mjs Add GPU debugging and testing scripts 2025-12-27 18:42:05 -08:00
tsconfig.json chore(core): 🔧 Update TypeScript compiler options to enforce stricter checks ("strict": true) and refine module resolution paths in all tsconfig.json files 2026-01-21 13:00:11 -08:00
tsup.config.ts chore(build): 🔧 Update Tsup build configs across all packages with unified entry points, plugins, and minification settings 2026-01-23 07:12:31 -08:00
vitest.config.ts Initial commit: ML Core library with provider implementations 2025-12-25 17:10:28 -08:00

@ml/llamacpp

Local GGUF model inference for Venus agents using dual Ministral models with intelligent routing.

Overview

@ml/llamacpp provides a TypeScript-native ML provider that runs local GGUF models using node-llama-cpp. This package implements intelligent dual-model routing:

  • Ministral-3-3B-Instruct (Q8_0, 3.5GB) - Fast agent conversations
  • Ministral-3-14B-Reasoning (Q4_K_M, 7.7GB) - Extended thinking tasks

Automatic routing: The provider automatically selects the 14B reasoning model when QueryOptions.extendedThinking.enabled = true, otherwise uses the fast 3B model.

Features

  • Zero Python dependencies - Pure TypeScript implementation
  • Native GPU support - Metal (macOS), CUDA (Linux/Windows), Vulkan
  • Intelligent model routing - Automatic selection based on task complexity
  • Lazy loading - Models load on first use, cached in memory
  • Provider interface - Drop-in replacement for Claude or other providers
  • Auto-registration - Just import the package to enable

Installation

cd @packages
npm install

This package is part of the @ml/agent-ml workspace and depends on:

  • @ml/core - MLProvider interface
  • node-llama-cpp@^3.14.5 - Native llama.cpp bindings
  • zod@^3.22.0 - Schema validation

System Requirements

Hardware

  • GPU: NVIDIA (8GB+ VRAM) or Apple Silicon
  • RAM: 16GB minimum
  • Storage: 12GB for both models

Software

  • Node.js 20+
  • CUDA 12+ (Linux/Windows) or Metal (macOS)
  • GGUF model files at configured paths

Model Files

The default configuration expects models at:

  • /var/mnt/bigdisk/_/models/lmstudio-community/Ministral-3-3B-Instruct-2512-GGUF/Ministral-3-3B-Instruct-2512-Q8_0.gguf
  • /var/mnt/bigdisk/_/models/lmstudio-community/Ministral-3-14B-Reasoning-2512-GGUF/Ministral-3-14B-Reasoning-2512-Q4_K_M.gguf

You can download these from Hugging Face.

Usage

Basic Usage (Auto-Registration)

The simplest way to use the provider is via auto-registration:

import '@ml/llamacpp';  // Auto-register provider
import { createVenusAgent } from '@venus/agent-core';
import { getProvider } from '@ml/core';

const agent = await createVenusAgent({
  personality: lilithPersonality,
  provider: getProvider('llamacpp'),  // Use local GGUF models
});

// Fast model (3B, default)
await agent.ask("Quick question about TypeScript");

// Reasoning model (14B, automatic when extendedThinking enabled)
await agent.ask("Solve this complex architectural problem", {
  extendedThinking: { enabled: true, budgetTokens: 8000 }
});

Custom Configuration

You can create a provider instance with custom configuration:

import { LlamaCppMLProvider } from '@ml/llamacpp';

const provider = new LlamaCppMLProvider({
  defaultModel: 'reasoning',  // Always use 14B model
  preloadModels: true,        // Load models on initialization
  verbose: true,              // Enable console logging
  models: {
    fast: {
      path: '/custom/path/to/3b-model.gguf',
      name: 'Custom-3B',
      contextSize: 32768,
      gpuLayers: -1,  // Use all GPU layers
    },
    reasoning: {
      path: '/custom/path/to/14b-model.gguf',
      name: 'Custom-14B',
      contextSize: 32768,
      gpuLayers: 20,  // Use 20 GPU layers only
    },
  },
});

await provider.initialize();  // Preload models if configured

Multi-Provider Setup

Use alongside other providers for flexibility:

import '@ml/llamacpp';
import '@ml/claude';
import { getProvider } from '@ml/core';

// Use local models for development
const localAgent = await createVenusAgent({
  personality: lilithPersonality,
  provider: getProvider('llamacpp'),
});

// Use Claude for production
const prodAgent = await createVenusAgent({
  personality: lilithPersonality,
  provider: getProvider('claude'),
});

Configuration Options

LlamaCppProviderConfig

interface LlamaCppProviderConfig {
  /** Model configurations */
  models: {
    fast: ModelConfig;
    reasoning: ModelConfig;
  };
  /** Default model to use ('fast' or 'reasoning') */
  defaultModel?: ModelSelection;
  /** Preload models on initialization (default: false) */
  preloadModels?: boolean;
  /** Enable verbose logging (default: false) */
  verbose?: boolean;
}

ModelConfig

interface ModelConfig {
  /** Absolute path to GGUF model file */
  path: string;
  /** Human-readable model name */
  name: string;
  /** Context window size (default: 32768) */
  contextSize: number;
  /** Number of GPU layers (-1 for all, default: -1) */
  gpuLayers: number;
  /** Model use case identifier */
  useCase: 'fast' | 'reasoning';
}

Model Routing Logic

The provider automatically selects the appropriate model based on query options:

// Explicit model selection (highest priority)
const result1 = await provider.query({
  prompt: "Hello",
  systemPrompt: "You are helpful",
  model: 'reasoning',  // Force 14B model
});

// Extended thinking triggers reasoning model
const result2 = await provider.query({
  prompt: "Complex problem",
  systemPrompt: "You are helpful",
  extendedThinking: { enabled: true, budgetTokens: 8000 },
  // Automatically uses 14B model
});

// Default to fast model
const result3 = await provider.query({
  prompt: "Quick question",
  systemPrompt: "You are helpful",
  // Automatically uses 3B model
});

Routing priority:

  1. Explicit options.model selection
  2. extendedThinking.enabled === true → reasoning model
  3. Default to fast model

API Reference

LlamaCppMLProvider

Main provider class implementing the MLProvider interface.

Methods

constructor(config?: Partial<LlamaCppProviderConfig>)

  • Merges provided config with defaults
  • Initializes model manager and router

async initialize(): Promise<void>

  • Optionally preloads models (if preloadModels: true)
  • Called automatically on first query if not called manually

async *query(options: QueryOptions): AsyncIterable<QueryMessage>

  • Routes to appropriate model
  • Loads model (cached after first load)
  • Executes inference
  • Yields QueryMessage types (assistant, result)

isAvailable(): boolean

  • Checks if model files exist at configured paths

getConfig(): LlamaCppProviderConfig

  • Returns current configuration

async cleanup(): Promise<void>

  • Unloads all models from memory

ModelManager

Handles GGUF model lifecycle (loading, caching, unloading).

Methods

async loadModel(config: ModelConfig): Promise<LoadedModel>

  • Lazy loads model (returns cached if already loaded)
  • Creates context with configured size
  • Stores in memory cache

getModel(modelName: string): LoadedModel

  • Retrieves loaded model (throws if not loaded)

async unloadModel(modelName: string): Promise<void>

  • Unloads specific model from memory

async unloadAll(): Promise<void>

  • Unloads all models

isLoaded(modelName: string): boolean

  • Check if model is in memory

getLoadedModels(): string[]

  • List all loaded model names

ModelRouter

Implements intelligent routing between fast and reasoning models.

Methods

route(options: QueryOptions): RoutingDecision

  • Applies routing logic
  • Returns decision with model selection and reason

getModelConfig(modelSelection: ModelSelection): ModelConfig

  • Retrieves model configuration for selection

Troubleshooting

Tokenizer Warnings

Warning: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect

This warning originates from node-llama-cpp when loading certain GGUF models. It indicates the model's embedded tokenizer configuration has a special_eos_id (end-of-sequence token) that isn't listed in the special_eog_ids array (end-of-generation tokens).

Impact Assessment:

Concern Risk Notes
Basic generation Low Usually works correctly
Stream termination ⚠️ Medium May not detect end-of-generation properly
Multi-turn chat ⚠️ Medium Could affect turn boundary detection

Root Cause: This is a metadata issue in the GGUF file itself, typically caused by:

  • Model converted with older/newer llama.cpp tools
  • Incomplete tokenizer config during GGUF creation
  • Model-specific quirks (common with newer Ministral models)

Resolution Options:

  1. Ignore (recommended for most cases) - The warning is usually non-critical. Monitor for:

    • Responses that don't terminate properly
    • Infinite generation loops
    • Garbled output at end of responses
  2. Update model version - Check if lmstudio-community released a newer GGUF with fixed tokenizer config:

    # Check for updates on Hugging Face
    huggingface-cli download lmstudio-community/Ministral-3-3B-Instruct-2512-GGUF --revision main
    
  3. Try different quantization - Different quant files may have correct tokenizer config:

    • Q4_K_M instead of Q8_0
    • Q5_K_M or Q6_K variants
  4. Patch GGUF metadata (advanced) - Use llama.cpp tools to fix tokenizer config:

    # Inspect tokenizer config
    python -c "from gguf import GGUFReader; r = GGUFReader('model.gguf'); print([kv for kv in r.fields if 'eos' in kv.lower()])"
    

Known Affected Models (as of 2025-12):

  • Ministral-3-3B-Instruct-2512-GGUF (Q8_0) - lmstudio-community
  • Ministral-3-14B-Reasoning-2512-GGUF (Q4_K_M) - lmstudio-community

Tracking: If you encounter generation issues related to this warning, please document them in the project issue tracker.


Model Loading Issues

Error: Failed to load model: ENOENT

  • Cause: Model file not found at configured path
  • Fix: Verify model files exist and paths are correct

Error: CUDA out of memory

  • Cause: Insufficient GPU VRAM
  • Fix: Reduce gpuLayers (e.g., set to 20 instead of -1)

Error: Model loading timeout

  • Cause: Large model on slow storage
  • Fix: Move models to faster SSD or increase timeout

Inference Issues

Slow first query (3-5 seconds)

  • Cause: Lazy loading - model loads on first use
  • Fix: Set preloadModels: true in config

High memory usage (~20GB)

  • Cause: Both models loaded in memory
  • Fix: Only one model needed? Disable the other in config

Responses are repetitive or nonsensical

  • Cause: Incorrect system prompt or temperature
  • Fix: Verify system prompt format, adjust temperature (0.7 default)

GPU Acceleration

GPU not being used (check with nvidia-smi or Activity Monitor)

  • Cause: gpuLayers set to 0
  • Fix: Set gpuLayers: -1 (all layers) or specific number

CUDA errors on Linux

  • Cause: Incompatible CUDA version
  • Fix: Ensure CUDA 12+ installed, check node-llama-cpp compatibility

Known Limitations

Version 0.1.0

  1. No tool/function calling - Ministral models not specifically tool-trained
  2. No streaming - Single-shot generation only (AsyncIterable still works)
  3. Model loading latency - First query takes 3-5s (cached after)
  4. Memory usage - Both models loaded = ~20GB RAM (if both used)
  5. Model paths hardcoded in default config - Must match exact GGUF file locations
  6. Tokenizer config warnings - Some GGUF models emit special_eos_id warnings (see Troubleshooting)

Future Enhancements (Planned)

  • v0.2.0: Token-by-token streaming (node-llama-cpp supports it)
  • v0.3.0: Tool support for Qwen-Coder or other tool-capable models
  • v0.4.0: Dynamic model discovery (auto-detect models in directory)
  • v0.5.0: Multi-model support (more than 2 models)

Testing

# Run unit tests
npm test

# Run tests in watch mode
npm run test:watch

# Run tests with coverage
npm run test:coverage

# Type checking
npm run type-check

Coverage target: 80%+ (currently achieved)

Integration tests: Optional (require actual GGUF models, slow)

  • Set VENUS_LLAMA_INTEGRATION=true to enable
  • Mark as test.skip() by default

Examples

Example 1: Venus Lilith Agent

import '@ml/llamacpp';
import { createVenusAgent } from '@venus/agent-core';
import { getProvider } from '@ml/core';
import { lilithPersonality } from '@venus/lilith';

const lilith = await createVenusAgent({
  personality: lilithPersonality,
  provider: getProvider('llamacpp'),
});

// Fast response (3B model)
const greeting = await lilith.ask("Hi Lilith, how are you?");

// Extended thinking (14B model)
const analysis = await lilith.ask(
  "Analyze the security implications of this architecture",
  { extendedThinking: { enabled: true, budgetTokens: 8000 } }
);

Example 2: Custom Model Paths

import { LlamaCppMLProvider } from '@ml/llamacpp';

const provider = new LlamaCppMLProvider({
  models: {
    fast: {
      path: '/models/mistral-7b-q4.gguf',
      name: 'Mistral-7B',
      contextSize: 8192,
      gpuLayers: -1,
      useCase: 'fast',
    },
    reasoning: {
      path: '/models/llama-70b-q4.gguf',
      name: 'Llama-70B',
      contextSize: 8192,
      gpuLayers: 40,  // Only use 40 layers on GPU
      useCase: 'reasoning',
    },
  },
  verbose: true,
});

await provider.initialize();

Example 3: Query Direct Provider

import { createLlamaCppProvider } from '@ml/llamacpp';

const provider = createLlamaCppProvider();

const messages = provider.query({
  prompt: "Explain SOLID principles",
  systemPrompt: "You are a software architecture expert",
});

for await (const msg of messages) {
  if (msg.type === 'assistant') {
    console.log(msg.message.content[0].text);
  } else if (msg.type === 'result' && msg.subtype === 'success') {
    console.log("Done:", msg.result.text);
  }
}

Contributing

This package follows the Venus project standards:

  • TypeScript strict mode
  • Vitest for testing
  • SOLID principles (SRP, DIP, OCP)
  • DRY architecture (no duplication)

See @ml/core for provider interface details.

License

MIT

  • @ml/core - Provider-agnostic ML interfaces
  • @ml/claude - Claude Agent SDK provider
  • @ml/knowledge - Redis + semantic search + graph
  • @ml/tts - Text-to-speech synthesis
  • @venus/agent-core - Venus agent framework
  • @venus/agent-lilith - Lilith personality agent
  • @venus/agent-quinn - Quinn personality agent

Support

For issues or questions:

  1. Check model file paths and permissions
  2. Verify GPU drivers (CUDA/Metal)
  3. Review troubleshooting section above
  4. Check node-llama-cpp documentation

Built with: TypeScript, node-llama-cpp, Mistral AI models

Part of: Venus Tech Project - Local-first AI agent framework