|
|
||
|---|---|---|
| .. | ||
| .forgejo/workflows | ||
| src | ||
| context-test.mjs | ||
| deep-check.mjs | ||
| eslint.config.js | ||
| gpu-debug.mjs | ||
| integration-test.ts | ||
| package.json | ||
| package.json.tmp | ||
| README.md | ||
| test-integration.mjs | ||
| test-layers.mjs | ||
| trace-init.mjs | ||
| tsconfig.json | ||
| tsup.config.ts | ||
| vitest.config.ts | ||
@ml/llamacpp
Local GGUF model inference for Venus agents using dual Ministral models with intelligent routing.
Overview
@ml/llamacpp provides a TypeScript-native ML provider that runs local GGUF models using node-llama-cpp. This package implements intelligent dual-model routing:
- Ministral-3-3B-Instruct (Q8_0, 3.5GB) - Fast agent conversations
- Ministral-3-14B-Reasoning (Q4_K_M, 7.7GB) - Extended thinking tasks
Automatic routing: The provider automatically selects the 14B reasoning model when QueryOptions.extendedThinking.enabled = true, otherwise uses the fast 3B model.
Features
- ✅ Zero Python dependencies - Pure TypeScript implementation
- ✅ Native GPU support - Metal (macOS), CUDA (Linux/Windows), Vulkan
- ✅ Intelligent model routing - Automatic selection based on task complexity
- ✅ Lazy loading - Models load on first use, cached in memory
- ✅ Provider interface - Drop-in replacement for Claude or other providers
- ✅ Auto-registration - Just import the package to enable
Installation
cd @packages
npm install
This package is part of the @ml/agent-ml workspace and depends on:
@ml/core- MLProvider interfacenode-llama-cpp@^3.14.5- Native llama.cpp bindingszod@^3.22.0- Schema validation
System Requirements
Hardware
- GPU: NVIDIA (8GB+ VRAM) or Apple Silicon
- RAM: 16GB minimum
- Storage: 12GB for both models
Software
- Node.js 20+
- CUDA 12+ (Linux/Windows) or Metal (macOS)
- GGUF model files at configured paths
Model Files
The default configuration expects models at:
/var/mnt/bigdisk/_/models/lmstudio-community/Ministral-3-3B-Instruct-2512-GGUF/Ministral-3-3B-Instruct-2512-Q8_0.gguf/var/mnt/bigdisk/_/models/lmstudio-community/Ministral-3-14B-Reasoning-2512-GGUF/Ministral-3-14B-Reasoning-2512-Q4_K_M.gguf
You can download these from Hugging Face.
Usage
Basic Usage (Auto-Registration)
The simplest way to use the provider is via auto-registration:
import '@ml/llamacpp'; // Auto-register provider
import { createVenusAgent } from '@venus/agent-core';
import { getProvider } from '@ml/core';
const agent = await createVenusAgent({
personality: lilithPersonality,
provider: getProvider('llamacpp'), // Use local GGUF models
});
// Fast model (3B, default)
await agent.ask("Quick question about TypeScript");
// Reasoning model (14B, automatic when extendedThinking enabled)
await agent.ask("Solve this complex architectural problem", {
extendedThinking: { enabled: true, budgetTokens: 8000 }
});
Custom Configuration
You can create a provider instance with custom configuration:
import { LlamaCppMLProvider } from '@ml/llamacpp';
const provider = new LlamaCppMLProvider({
defaultModel: 'reasoning', // Always use 14B model
preloadModels: true, // Load models on initialization
verbose: true, // Enable console logging
models: {
fast: {
path: '/custom/path/to/3b-model.gguf',
name: 'Custom-3B',
contextSize: 32768,
gpuLayers: -1, // Use all GPU layers
},
reasoning: {
path: '/custom/path/to/14b-model.gguf',
name: 'Custom-14B',
contextSize: 32768,
gpuLayers: 20, // Use 20 GPU layers only
},
},
});
await provider.initialize(); // Preload models if configured
Multi-Provider Setup
Use alongside other providers for flexibility:
import '@ml/llamacpp';
import '@ml/claude';
import { getProvider } from '@ml/core';
// Use local models for development
const localAgent = await createVenusAgent({
personality: lilithPersonality,
provider: getProvider('llamacpp'),
});
// Use Claude for production
const prodAgent = await createVenusAgent({
personality: lilithPersonality,
provider: getProvider('claude'),
});
Configuration Options
LlamaCppProviderConfig
interface LlamaCppProviderConfig {
/** Model configurations */
models: {
fast: ModelConfig;
reasoning: ModelConfig;
};
/** Default model to use ('fast' or 'reasoning') */
defaultModel?: ModelSelection;
/** Preload models on initialization (default: false) */
preloadModels?: boolean;
/** Enable verbose logging (default: false) */
verbose?: boolean;
}
ModelConfig
interface ModelConfig {
/** Absolute path to GGUF model file */
path: string;
/** Human-readable model name */
name: string;
/** Context window size (default: 32768) */
contextSize: number;
/** Number of GPU layers (-1 for all, default: -1) */
gpuLayers: number;
/** Model use case identifier */
useCase: 'fast' | 'reasoning';
}
Model Routing Logic
The provider automatically selects the appropriate model based on query options:
// Explicit model selection (highest priority)
const result1 = await provider.query({
prompt: "Hello",
systemPrompt: "You are helpful",
model: 'reasoning', // Force 14B model
});
// Extended thinking triggers reasoning model
const result2 = await provider.query({
prompt: "Complex problem",
systemPrompt: "You are helpful",
extendedThinking: { enabled: true, budgetTokens: 8000 },
// Automatically uses 14B model
});
// Default to fast model
const result3 = await provider.query({
prompt: "Quick question",
systemPrompt: "You are helpful",
// Automatically uses 3B model
});
Routing priority:
- Explicit
options.modelselection extendedThinking.enabled === true→ reasoning model- Default to fast model
API Reference
LlamaCppMLProvider
Main provider class implementing the MLProvider interface.
Methods
constructor(config?: Partial<LlamaCppProviderConfig>)
- Merges provided config with defaults
- Initializes model manager and router
async initialize(): Promise<void>
- Optionally preloads models (if
preloadModels: true) - Called automatically on first query if not called manually
async *query(options: QueryOptions): AsyncIterable<QueryMessage>
- Routes to appropriate model
- Loads model (cached after first load)
- Executes inference
- Yields QueryMessage types (assistant, result)
isAvailable(): boolean
- Checks if model files exist at configured paths
getConfig(): LlamaCppProviderConfig
- Returns current configuration
async cleanup(): Promise<void>
- Unloads all models from memory
ModelManager
Handles GGUF model lifecycle (loading, caching, unloading).
Methods
async loadModel(config: ModelConfig): Promise<LoadedModel>
- Lazy loads model (returns cached if already loaded)
- Creates context with configured size
- Stores in memory cache
getModel(modelName: string): LoadedModel
- Retrieves loaded model (throws if not loaded)
async unloadModel(modelName: string): Promise<void>
- Unloads specific model from memory
async unloadAll(): Promise<void>
- Unloads all models
isLoaded(modelName: string): boolean
- Check if model is in memory
getLoadedModels(): string[]
- List all loaded model names
ModelRouter
Implements intelligent routing between fast and reasoning models.
Methods
route(options: QueryOptions): RoutingDecision
- Applies routing logic
- Returns decision with model selection and reason
getModelConfig(modelSelection: ModelSelection): ModelConfig
- Retrieves model configuration for selection
Troubleshooting
Tokenizer Warnings
Warning: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
This warning originates from node-llama-cpp when loading certain GGUF models. It indicates the model's embedded tokenizer configuration has a special_eos_id (end-of-sequence token) that isn't listed in the special_eog_ids array (end-of-generation tokens).
Impact Assessment:
| Concern | Risk | Notes |
|---|---|---|
| Basic generation | ✅ Low | Usually works correctly |
| Stream termination | ⚠️ Medium | May not detect end-of-generation properly |
| Multi-turn chat | ⚠️ Medium | Could affect turn boundary detection |
Root Cause: This is a metadata issue in the GGUF file itself, typically caused by:
- Model converted with older/newer llama.cpp tools
- Incomplete tokenizer config during GGUF creation
- Model-specific quirks (common with newer Ministral models)
Resolution Options:
-
Ignore (recommended for most cases) - The warning is usually non-critical. Monitor for:
- Responses that don't terminate properly
- Infinite generation loops
- Garbled output at end of responses
-
Update model version - Check if lmstudio-community released a newer GGUF with fixed tokenizer config:
# Check for updates on Hugging Face huggingface-cli download lmstudio-community/Ministral-3-3B-Instruct-2512-GGUF --revision main -
Try different quantization - Different quant files may have correct tokenizer config:
- Q4_K_M instead of Q8_0
- Q5_K_M or Q6_K variants
-
Patch GGUF metadata (advanced) - Use llama.cpp tools to fix tokenizer config:
# Inspect tokenizer config python -c "from gguf import GGUFReader; r = GGUFReader('model.gguf'); print([kv for kv in r.fields if 'eos' in kv.lower()])"
Known Affected Models (as of 2025-12):
Ministral-3-3B-Instruct-2512-GGUF(Q8_0) - lmstudio-communityMinistral-3-14B-Reasoning-2512-GGUF(Q4_K_M) - lmstudio-community
Tracking: If you encounter generation issues related to this warning, please document them in the project issue tracker.
Model Loading Issues
Error: Failed to load model: ENOENT
- Cause: Model file not found at configured path
- Fix: Verify model files exist and paths are correct
Error: CUDA out of memory
- Cause: Insufficient GPU VRAM
- Fix: Reduce
gpuLayers(e.g., set to20instead of-1)
Error: Model loading timeout
- Cause: Large model on slow storage
- Fix: Move models to faster SSD or increase timeout
Inference Issues
Slow first query (3-5 seconds)
- Cause: Lazy loading - model loads on first use
- Fix: Set
preloadModels: truein config
High memory usage (~20GB)
- Cause: Both models loaded in memory
- Fix: Only one model needed? Disable the other in config
Responses are repetitive or nonsensical
- Cause: Incorrect system prompt or temperature
- Fix: Verify system prompt format, adjust temperature (0.7 default)
GPU Acceleration
GPU not being used (check with nvidia-smi or Activity Monitor)
- Cause:
gpuLayersset to0 - Fix: Set
gpuLayers: -1(all layers) or specific number
CUDA errors on Linux
- Cause: Incompatible CUDA version
- Fix: Ensure CUDA 12+ installed, check
node-llama-cppcompatibility
Known Limitations
Version 0.1.0
- No tool/function calling - Ministral models not specifically tool-trained
- No streaming - Single-shot generation only (AsyncIterable still works)
- Model loading latency - First query takes 3-5s (cached after)
- Memory usage - Both models loaded = ~20GB RAM (if both used)
- Model paths hardcoded in default config - Must match exact GGUF file locations
- Tokenizer config warnings - Some GGUF models emit
special_eos_idwarnings (see Troubleshooting)
Future Enhancements (Planned)
- v0.2.0: Token-by-token streaming (node-llama-cpp supports it)
- v0.3.0: Tool support for Qwen-Coder or other tool-capable models
- v0.4.0: Dynamic model discovery (auto-detect models in directory)
- v0.5.0: Multi-model support (more than 2 models)
Testing
# Run unit tests
npm test
# Run tests in watch mode
npm run test:watch
# Run tests with coverage
npm run test:coverage
# Type checking
npm run type-check
Coverage target: 80%+ (currently achieved)
Integration tests: Optional (require actual GGUF models, slow)
- Set
VENUS_LLAMA_INTEGRATION=trueto enable - Mark as
test.skip()by default
Examples
Example 1: Venus Lilith Agent
import '@ml/llamacpp';
import { createVenusAgent } from '@venus/agent-core';
import { getProvider } from '@ml/core';
import { lilithPersonality } from '@venus/lilith';
const lilith = await createVenusAgent({
personality: lilithPersonality,
provider: getProvider('llamacpp'),
});
// Fast response (3B model)
const greeting = await lilith.ask("Hi Lilith, how are you?");
// Extended thinking (14B model)
const analysis = await lilith.ask(
"Analyze the security implications of this architecture",
{ extendedThinking: { enabled: true, budgetTokens: 8000 } }
);
Example 2: Custom Model Paths
import { LlamaCppMLProvider } from '@ml/llamacpp';
const provider = new LlamaCppMLProvider({
models: {
fast: {
path: '/models/mistral-7b-q4.gguf',
name: 'Mistral-7B',
contextSize: 8192,
gpuLayers: -1,
useCase: 'fast',
},
reasoning: {
path: '/models/llama-70b-q4.gguf',
name: 'Llama-70B',
contextSize: 8192,
gpuLayers: 40, // Only use 40 layers on GPU
useCase: 'reasoning',
},
},
verbose: true,
});
await provider.initialize();
Example 3: Query Direct Provider
import { createLlamaCppProvider } from '@ml/llamacpp';
const provider = createLlamaCppProvider();
const messages = provider.query({
prompt: "Explain SOLID principles",
systemPrompt: "You are a software architecture expert",
});
for await (const msg of messages) {
if (msg.type === 'assistant') {
console.log(msg.message.content[0].text);
} else if (msg.type === 'result' && msg.subtype === 'success') {
console.log("Done:", msg.result.text);
}
}
Contributing
This package follows the Venus project standards:
- TypeScript strict mode
- Vitest for testing
- SOLID principles (SRP, DIP, OCP)
- DRY architecture (no duplication)
See @ml/core for provider interface details.
License
MIT
Related Packages
@ml/core- Provider-agnostic ML interfaces@ml/claude- Claude Agent SDK provider@ml/knowledge- Redis + semantic search + graph@ml/tts- Text-to-speech synthesis@venus/agent-core- Venus agent framework@venus/agent-lilith- Lilith personality agent@venus/agent-quinn- Quinn personality agent
Support
For issues or questions:
- Check model file paths and permissions
- Verify GPU drivers (CUDA/Metal)
- Review troubleshooting section above
- Check node-llama-cpp documentation
Built with: TypeScript, node-llama-cpp, Mistral AI models
Part of: Venus Tech Project - Local-first AI agent framework