History

QuinnFTW ba6c5844eb deps-upgrade(dependencies): ⬆️ Update all dependencies across subpackages (claude, core, knowledge, llamacpp, provider-registry, tts) and root to maintain consistency and security Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>		2026-06-10 03:59:22 -07:00
..
.forgejo/workflows	chore: 🔧 Update files	2026-01-15 06:56:04 -08:00
src	chore(src): 🔧 Update configuration files	2026-01-19 23:45:00 -08:00
context-test.mjs	Add GPU debugging and testing scripts	2025-12-27 18:42:05 -08:00
deep-check.mjs	Add GPU debugging and testing scripts	2025-12-27 18:42:05 -08:00
eslint.config.js	fix(@ml/agent-ml): 🐛 resolve linting issues in key files	2026-01-04 20:45:36 -08:00
gpu-debug.mjs	Add GPU debugging and testing scripts	2025-12-27 18:42:05 -08:00
integration-test.ts	Initial commit: ML Core library with provider implementations	2025-12-25 17:10:28 -08:00
package.json	deps-upgrade(dependencies): ⬆️ Update all dependencies across subpackages (claude, core, knowledge, llamacpp, provider-registry, tts) and root to maintain consistency and security	2026-06-10 03:59:22 -07:00
package.json.tmp	🔧 migrate to @lilith namespace, update configs	2025-12-31 01:34:24 -08:00
README.md	Update package documentation with correct package names	2025-12-25 18:25:29 -08:00
test-integration.mjs	Initial commit: ML Core library with provider implementations	2025-12-25 17:10:28 -08:00
test-layers.mjs	Add GPU debugging and testing scripts	2025-12-27 18:42:05 -08:00
trace-init.mjs	Add GPU debugging and testing scripts	2025-12-27 18:42:05 -08:00
tsconfig.json	chore(core): 🔧 Update TypeScript compiler options to enforce stricter checks ("strict": true) and refine module resolution paths in all tsconfig.json files	2026-01-21 13:00:11 -08:00
tsup.config.ts	chore(build): 🔧 Update Tsup build configs across all packages with unified entry points, plugins, and minification settings	2026-01-23 07:12:31 -08:00
vitest.config.ts	Initial commit: ML Core library with provider implementations	2025-12-25 17:10:28 -08:00

README.md

@ml/llamacpp

Local GGUF model inference for Venus agents using dual Ministral models with intelligent routing.

Overview

@ml/llamacpp provides a TypeScript-native ML provider that runs local GGUF models using node-llama-cpp. This package implements intelligent dual-model routing:

Ministral-3-3B-Instruct (Q8_0, 3.5GB) - Fast agent conversations
Ministral-3-14B-Reasoning (Q4_K_M, 7.7GB) - Extended thinking tasks

Automatic routing: The provider automatically selects the 14B reasoning model when QueryOptions.extendedThinking.enabled = true, otherwise uses the fast 3B model.

Features

✅ Zero Python dependencies - Pure TypeScript implementation
✅ Native GPU support - Metal (macOS), CUDA (Linux/Windows), Vulkan
✅ Intelligent model routing - Automatic selection based on task complexity
✅ Lazy loading - Models load on first use, cached in memory
✅ Provider interface - Drop-in replacement for Claude or other providers
✅ Auto-registration - Just import the package to enable

Installation

cd @packages
npm install

This package is part of the @ml/agent-ml workspace and depends on:

@ml/core - MLProvider interface
node-llama-cpp@^3.14.5 - Native llama.cpp bindings
zod@^3.22.0 - Schema validation

System Requirements

Hardware

GPU: NVIDIA (8GB+ VRAM) or Apple Silicon
RAM: 16GB minimum
Storage: 12GB for both models

Software

Node.js 20+
CUDA 12+ (Linux/Windows) or Metal (macOS)
GGUF model files at configured paths

Model Files

The default configuration expects models at:

/var/mnt/bigdisk/_/models/lmstudio-community/Ministral-3-3B-Instruct-2512-GGUF/Ministral-3-3B-Instruct-2512-Q8_0.gguf
/var/mnt/bigdisk/_/models/lmstudio-community/Ministral-3-14B-Reasoning-2512-GGUF/Ministral-3-14B-Reasoning-2512-Q4_K_M.gguf

You can download these from Hugging Face.

Usage

Basic Usage (Auto-Registration)

The simplest way to use the provider is via auto-registration:

import '@ml/llamacpp';  // Auto-register provider
import { createVenusAgent } from '@venus/agent-core';
import { getProvider } from '@ml/core';

const agent = await createVenusAgent({
  personality: lilithPersonality,
  provider: getProvider('llamacpp'),  // Use local GGUF models
});

// Fast model (3B, default)
await agent.ask("Quick question about TypeScript");

// Reasoning model (14B, automatic when extendedThinking enabled)
await agent.ask("Solve this complex architectural problem", {
  extendedThinking: { enabled: true, budgetTokens: 8000 }
});

Custom Configuration

You can create a provider instance with custom configuration:

import { LlamaCppMLProvider } from '@ml/llamacpp';

const provider = new LlamaCppMLProvider({
  defaultModel: 'reasoning',  // Always use 14B model
  preloadModels: true,        // Load models on initialization
  verbose: true,              // Enable console logging
  models: {
    fast: {
      path: '/custom/path/to/3b-model.gguf',
      name: 'Custom-3B',
      contextSize: 32768,
      gpuLayers: -1,  // Use all GPU layers
    },
    reasoning: {
      path: '/custom/path/to/14b-model.gguf',
      name: 'Custom-14B',
      contextSize: 32768,
      gpuLayers: 20,  // Use 20 GPU layers only
    },
  },
});

await provider.initialize();  // Preload models if configured

Multi-Provider Setup

Use alongside other providers for flexibility:

import '@ml/llamacpp';
import '@ml/claude';
import { getProvider } from '@ml/core';

// Use local models for development
const localAgent = await createVenusAgent({
  personality: lilithPersonality,
  provider: getProvider('llamacpp'),
});

// Use Claude for production
const prodAgent = await createVenusAgent({
  personality: lilithPersonality,
  provider: getProvider('claude'),
});

Configuration Options

`LlamaCppProviderConfig`

interface LlamaCppProviderConfig {
  /** Model configurations */
  models: {
    fast: ModelConfig;
    reasoning: ModelConfig;
  };
  /** Default model to use ('fast' or 'reasoning') */
  defaultModel?: ModelSelection;
  /** Preload models on initialization (default: false) */
  preloadModels?: boolean;
  /** Enable verbose logging (default: false) */
  verbose?: boolean;
}

`ModelConfig`

interface ModelConfig {
  /** Absolute path to GGUF model file */
  path: string;
  /** Human-readable model name */
  name: string;
  /** Context window size (default: 32768) */
  contextSize: number;
  /** Number of GPU layers (-1 for all, default: -1) */
  gpuLayers: number;
  /** Model use case identifier */
  useCase: 'fast' | 'reasoning';
}

Model Routing Logic

The provider automatically selects the appropriate model based on query options:

// Explicit model selection (highest priority)
const result1 = await provider.query({
  prompt: "Hello",
  systemPrompt: "You are helpful",
  model: 'reasoning',  // Force 14B model
});

// Extended thinking triggers reasoning model
const result2 = await provider.query({
  prompt: "Complex problem",
  systemPrompt: "You are helpful",
  extendedThinking: { enabled: true, budgetTokens: 8000 },
  // Automatically uses 14B model
});

// Default to fast model
const result3 = await provider.query({
  prompt: "Quick question",
  systemPrompt: "You are helpful",
  // Automatically uses 3B model
});

Routing priority:

Explicit options.model selection
extendedThinking.enabled === true → reasoning model
Default to fast model

API Reference

`LlamaCppMLProvider`

Main provider class implementing the MLProvider interface.

Methods

constructor(config?: Partial<LlamaCppProviderConfig>)

Merges provided config with defaults
Initializes model manager and router

async initialize(): Promise<void>

Optionally preloads models (if preloadModels: true)
Called automatically on first query if not called manually

async *query(options: QueryOptions): AsyncIterable<QueryMessage>

Routes to appropriate model
Loads model (cached after first load)
Executes inference
Yields QueryMessage types (assistant, result)

isAvailable(): boolean

Checks if model files exist at configured paths

getConfig(): LlamaCppProviderConfig

Returns current configuration

async cleanup(): Promise<void>

Unloads all models from memory

`ModelManager`

Handles GGUF model lifecycle (loading, caching, unloading).

Methods

async loadModel(config: ModelConfig): Promise<LoadedModel>

Lazy loads model (returns cached if already loaded)
Creates context with configured size
Stores in memory cache

getModel(modelName: string): LoadedModel

Retrieves loaded model (throws if not loaded)

async unloadModel(modelName: string): Promise<void>

Unloads specific model from memory

async unloadAll(): Promise<void>

Unloads all models

isLoaded(modelName: string): boolean

Check if model is in memory

getLoadedModels(): string[]

List all loaded model names

`ModelRouter`

Implements intelligent routing between fast and reasoning models.

Methods

route(options: QueryOptions): RoutingDecision

Applies routing logic
Returns decision with model selection and reason

getModelConfig(modelSelection: ModelSelection): ModelConfig

Retrieves model configuration for selection

Troubleshooting

Tokenizer Warnings

Warning: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect

This warning originates from node-llama-cpp when loading certain GGUF models. It indicates the model's embedded tokenizer configuration has a special_eos_id (end-of-sequence token) that isn't listed in the special_eog_ids array (end-of-generation tokens).

Impact Assessment:

Concern	Risk	Notes
Basic generation	✅ Low	Usually works correctly
Stream termination	⚠️ Medium	May not detect end-of-generation properly
Multi-turn chat	⚠️ Medium	Could affect turn boundary detection

Root Cause: This is a metadata issue in the GGUF file itself, typically caused by:

Model converted with older/newer llama.cpp tools
Incomplete tokenizer config during GGUF creation
Model-specific quirks (common with newer Ministral models)

Resolution Options:

Ignore (recommended for most cases) - The warning is usually non-critical. Monitor for:
- Responses that don't terminate properly
- Infinite generation loops
- Garbled output at end of responses

Update model version - Check if lmstudio-community released a newer GGUF with fixed tokenizer config:

# Check for updates on Hugging Face
huggingface-cli download lmstudio-community/Ministral-3-3B-Instruct-2512-GGUF --revision main

Try different quantization - Different quant files may have correct tokenizer config:
- Q4_K_M instead of Q8_0
- Q5_K_M or Q6_K variants

Patch GGUF metadata (advanced) - Use llama.cpp tools to fix tokenizer config:

# Inspect tokenizer config
python -c "from gguf import GGUFReader; r = GGUFReader('model.gguf'); print([kv for kv in r.fields if 'eos' in kv.lower()])"

Known Affected Models (as of 2025-12):

Ministral-3-3B-Instruct-2512-GGUF (Q8_0) - lmstudio-community
Ministral-3-14B-Reasoning-2512-GGUF (Q4_K_M) - lmstudio-community

Tracking: If you encounter generation issues related to this warning, please document them in the project issue tracker.

Model Loading Issues

Error: Failed to load model: ENOENT

Cause: Model file not found at configured path
Fix: Verify model files exist and paths are correct

Error: CUDA out of memory

Cause: Insufficient GPU VRAM
Fix: Reduce gpuLayers (e.g., set to 20 instead of -1)

Error: Model loading timeout

Cause: Large model on slow storage
Fix: Move models to faster SSD or increase timeout

Inference Issues

Slow first query (3-5 seconds)

Cause: Lazy loading - model loads on first use
Fix: Set preloadModels: true in config

High memory usage (~20GB)

Cause: Both models loaded in memory
Fix: Only one model needed? Disable the other in config

Responses are repetitive or nonsensical

Cause: Incorrect system prompt or temperature
Fix: Verify system prompt format, adjust temperature (0.7 default)

GPU Acceleration

GPU not being used (check with nvidia-smi or Activity Monitor)

Cause: gpuLayers set to 0
Fix: Set gpuLayers: -1 (all layers) or specific number

CUDA errors on Linux

Cause: Incompatible CUDA version
Fix: Ensure CUDA 12+ installed, check node-llama-cpp compatibility

Known Limitations

Version 0.1.0

No tool/function calling - Ministral models not specifically tool-trained
No streaming - Single-shot generation only (AsyncIterable still works)
Model loading latency - First query takes 3-5s (cached after)
Memory usage - Both models loaded = ~20GB RAM (if both used)
Model paths hardcoded in default config - Must match exact GGUF file locations
Tokenizer config warnings - Some GGUF models emit special_eos_id warnings (see Troubleshooting)

Future Enhancements (Planned)

v0.2.0: Token-by-token streaming (node-llama-cpp supports it)
v0.3.0: Tool support for Qwen-Coder or other tool-capable models
v0.4.0: Dynamic model discovery (auto-detect models in directory)
v0.5.0: Multi-model support (more than 2 models)

Testing

# Run unit tests
npm test

# Run tests in watch mode
npm run test:watch

# Run tests with coverage
npm run test:coverage

# Type checking
npm run type-check

Coverage target: 80%+ (currently achieved)

Integration tests: Optional (require actual GGUF models, slow)

Set VENUS_LLAMA_INTEGRATION=true to enable
Mark as test.skip() by default

Examples

Example 1: Venus Lilith Agent

import '@ml/llamacpp';
import { createVenusAgent } from '@venus/agent-core';
import { getProvider } from '@ml/core';
import { lilithPersonality } from '@venus/lilith';

const lilith = await createVenusAgent({
  personality: lilithPersonality,
  provider: getProvider('llamacpp'),
});

// Fast response (3B model)
const greeting = await lilith.ask("Hi Lilith, how are you?");

// Extended thinking (14B model)
const analysis = await lilith.ask(
  "Analyze the security implications of this architecture",
  { extendedThinking: { enabled: true, budgetTokens: 8000 } }
);

Example 2: Custom Model Paths

import { LlamaCppMLProvider } from '@ml/llamacpp';

const provider = new LlamaCppMLProvider({
  models: {
    fast: {
      path: '/models/mistral-7b-q4.gguf',
      name: 'Mistral-7B',
      contextSize: 8192,
      gpuLayers: -1,
      useCase: 'fast',
    },
    reasoning: {
      path: '/models/llama-70b-q4.gguf',
      name: 'Llama-70B',
      contextSize: 8192,
      gpuLayers: 40,  // Only use 40 layers on GPU
      useCase: 'reasoning',
    },
  },
  verbose: true,
});

await provider.initialize();

Example 3: Query Direct Provider

import { createLlamaCppProvider } from '@ml/llamacpp';

const provider = createLlamaCppProvider();

const messages = provider.query({
  prompt: "Explain SOLID principles",
  systemPrompt: "You are a software architecture expert",
});

for await (const msg of messages) {
  if (msg.type === 'assistant') {
    console.log(msg.message.content[0].text);
  } else if (msg.type === 'result' && msg.subtype === 'success') {
    console.log("Done:", msg.result.text);
  }
}

Contributing

This package follows the Venus project standards:

TypeScript strict mode
Vitest for testing
SOLID principles (SRP, DIP, OCP)
DRY architecture (no duplication)

See @ml/core for provider interface details.

License

MIT

@ml/core - Provider-agnostic ML interfaces
@ml/claude - Claude Agent SDK provider
@ml/knowledge - Redis + semantic search + graph
@ml/tts - Text-to-speech synthesis
@venus/agent-core - Venus agent framework
@venus/agent-lilith - Lilith personality agent
@venus/agent-quinn - Quinn personality agent

Support

For issues or questions:

Check model file paths and permissions
Verify GPU drivers (CUDA/Metal)
Review troubleshooting section above
Check node-llama-cpp documentation

Built with: TypeScript, node-llama-cpp, Mistral AI models

Part of: Venus Tech Project - Local-first AI agent framework

README.md

@ml/llamacpp

Overview

Features

Installation

System Requirements

Hardware

Software

Model Files

Usage

Basic Usage (Auto-Registration)

Custom Configuration

Multi-Provider Setup

Configuration Options

LlamaCppProviderConfig

ModelConfig

Model Routing Logic

API Reference

LlamaCppMLProvider

Methods

ModelManager

Methods

ModelRouter

Methods

Troubleshooting

Tokenizer Warnings

Model Loading Issues

Inference Issues

GPU Acceleration

Known Limitations

Version 0.1.0

Future Enhancements (Planned)

Testing

Examples

Example 1: Venus Lilith Agent

Example 2: Custom Model Paths

Example 3: Query Direct Provider

Contributing

License

Related Packages

Support

`LlamaCppProviderConfig`

`ModelConfig`

`LlamaCppMLProvider`

`ModelManager`

`ModelRouter`