|
|
||
|---|---|---|
| .. | ||
| docs | ||
| examples | ||
| scripts | ||
| src | ||
| tests | ||
| .env.example | ||
| =0.2.0 | ||
| =1.0.0 | ||
| =3.0.0 | ||
| API_REFERENCE_SOURCE.md | ||
| conversation-ml.service | ||
| docker-compose.yml | ||
| Dockerfile | ||
| pyproject.toml | ||
| README.md | ||
| requirements.txt | ||
| SOURCE_CLASSIFICATION_ENDPOINTS.md | ||
| test_source_endpoints.py | ||
Conversation Assistant ML Service
FastAPI-based ML inference service with intelligent response generation, conversation memory, style adaptation, and message triage.
Architecture
┌─────────────────────────────────────────────────────────────┐
│ ML Service (Port 8100) │
├─────────────────────────────────────────────────────────────┤
│ Core Endpoints │
│ ├── /generate - Sync text generation │
│ ├── /generate/async - Async job queue │
│ ├── /training/start - Start LoRA fine-tuning │
│ ├── /model/deploy - Hot-swap trained model │
│ └── /health - Health status │
├─────────────────────────────────────────────────────────────┤
│ ML Feature Endpoints │
│ ├── /suggestions - Multi-option response generation │
│ ├── /memory/* - Conversation memory (RAG) │
│ ├── /style/* - Style learning & adaptation │
│ └── /triage - Message urgency scoring │
├─────────────────────────────────────────────────────────────┤
│ Components │
│ ├── LLM Manager - GGUF model loading (llama-cpp) │
│ ├── LoRA Trainer - QLoRA fine-tuning (peft/trl) │
│ ├── Memory Store - Redis VSS + nomic-embed │
│ ├── Style Adapter - Per-contact style profiles │
│ ├── Intent Classifier - Message understanding │
│ └── Redis Client - Caching + job queuing │
└─────────────────────────────────────────────────────────────┘
Quick Start
# 1. Create virtual environment
python3 -m venv .venv
source .venv/bin/activate
# 2. Install dependencies
pip install -e .
pip install -e ~/Code/@packages/@ml/@tools/model-loader
pip install lilith-fastapi-service-base --extra-index-url https://forge.nasty.sh/api/packages/lilith/pypi/simple/
# 3. Copy environment configuration
cp .env.example .env
# 4. Start service
python -m uvicorn src.main:app --host 0.0.0.0 --port 8100 --reload
Configuration
Environment Variables
| Variable | Default | Description |
|---|---|---|
MODEL_NAME |
meta-llama/Llama-3.2-3B-Instruct |
Base model for inference |
MODEL_CACHE_DIR |
/opt/conversation-ml/models |
Model download directory |
MAX_MODEL_LENGTH |
4096 |
Maximum context length |
TEMPERATURE |
0.7 |
Generation temperature |
TOP_P |
0.95 |
Top-p sampling |
REDIS_HOST |
0.1984.nasty.sh |
Redis host |
REDIS_PORT |
6379 |
Redis port |
REDIS_PASSWORD |
- | Redis password (required) |
REDIS_DB |
0 |
Redis database number |
SERVICE_PORT |
8100 |
Service port |
LOG_LEVEL |
info |
Logging level |
WORKERS |
2 |
Uvicorn workers |
CUDA_VISIBLE_DEVICES |
0 |
GPU device(s) |
GPU_MEMORY_UTILIZATION |
0.8 |
GPU memory limit |
API_KEY |
- | API authentication key |
ALLOWED_HOSTS |
10.9.0.0/24,10.8.0.0/24 |
VPN CIDR ranges |
API Reference
Health Check
GET /health
Returns service health and model status.
Response:
{
"status": "healthy",
"model_loaded": true,
"model_version": "Llama-3.2-3B-Instruct-Q8_0",
"redis_connected": true,
"queue_length": 0
}
Generate Response
POST /generate
Generate a response for the given prompt. Uses Redis caching to avoid redundant generations.
Request:
{
"prompt": "User: How are you?\nAssistant:",
"max_tokens": 256,
"temperature": 0.7,
"top_p": 0.95,
"repeat_penalty": 1.1,
"stop": ["User:", "\n\n"],
"cache_key": null
}
Response:
{
"response": "I'm doing well, thank you for asking!",
"confidence": 0.85,
"model_version": "Llama-3.2-3B-Instruct-Q8_0",
"tokens_used": 42,
"cached": false
}
Async Generation
POST /generate/async
Queue a generation request for async processing. Returns job ID for polling.
Request: Same as /generate
Response:
{
"job_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "queued"
}
Check Async Job Status
GET /generate/status/{job_id}
Response:
{
"job_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "completed",
"result": { ... },
"error": null,
"created_at": "2024-12-28T10:00:00Z",
"completed_at": "2024-12-28T10:00:02Z"
}
ML Feature Endpoints
Suggested Replies
Generate themed response options for conversations.
Generate Suggestions
POST /suggestions
Generate multiple suggested response options with themes.
Request:
{
"conversation_id": "conv-123",
"messages": [
{"role": "user", "content": "Hey, are you free Saturday?", "timestamp": "2024-12-28T10:00:00Z"}
],
"count": 8,
"themes": ["casual", "brief", "empathetic"]
}
Response:
{
"request_id": "req-uuid",
"conversation_id": "conv-123",
"options": [
{
"text": "Yes! What did you have in mind?",
"descriptor": "Enthusiastic",
"theme": "casual",
"confidence": 0.92,
"quality_score": 0.88
}
],
"has_more": true,
"total_count": 8
}
Get More Suggestions
GET /suggestions/more/{request_id}
Retrieve remaining suggestions from a previous generation.
Response:
{
"options": [
{
"text": "Let me check my calendar",
"descriptor": "Noncommittal",
"theme": "brief",
"confidence": 0.85,
"quality_score": 0.82
}
]
}
Conversation Memory (RAG)
Store and recall past conversations via semantic similarity.
Store Memory
POST /memory/store
Store a conversation segment with auto-summarization.
Request:
{
"user_id": "user-123",
"contact_id": "contact-456",
"conversation_id": "conv-789",
"messages": [
{"role": "user", "content": "How was the concert?"},
{"role": "assistant", "content": "It was amazing! The opening act was great."}
],
"summary": null,
"metadata": {"event": "concert-discussion"}
}
Response:
{
"memory_id": "mem-uuid",
"summary": "Discussion about a concert, positive feedback about the opening act.",
"stored_at": "2024-12-28T10:00:00Z"
}
Recall Memories
POST /memory/recall
Retrieve relevant past conversations via semantic search.
Request:
{
"user_id": "user-123",
"contact_id": "contact-456",
"query": "concert last month",
"top_k": 3
}
Response:
{
"memories": [
{
"memory_id": "mem-uuid",
"user_id": "user-123",
"contact_id": "contact-456",
"summary": "Discussion about a concert...",
"similarity_score": 0.87,
"stored_at": "2024-12-28T10:00:00Z",
"messages": [...],
"metadata": {}
}
],
"query": "concert last month",
"total_found": 1,
"search_time_ms": 42.5
}
Inject Memories
POST /memory/inject
Inject recalled memories into conversation context.
Request:
{
"messages": [
{"role": "user", "content": "Remember that concert?"}
],
"memories": [...]
}
Response:
{
"messages": [
{"role": "system", "content": "# Relevant Past Conversations..."},
{"role": "user", "content": "Remember that concert?"}
],
"injected_count": 2
}
Get Memory Stats
GET /memory/stats
Get memory store statistics.
Response:
{
"total_memories": 150,
"unique_users": 3,
"unique_contacts": 12,
"index_size_bytes": 1048576,
"oldest_memory": "2024-01-01T00:00:00Z",
"newest_memory": "2024-12-28T10:00:00Z"
}
Delete Memory
DELETE /memory/{memory_id}
Delete a specific memory.
Response:
{
"deleted": true
}
Style Learning & Adaptation
Learn and apply user communication styles.
Learn Style
POST /style/learn
Learn style from training samples.
Request:
{
"user_id": "user-123",
"contact_id": "contact-456",
"samples": [
{"input": "How are you?", "output": "Good! You?"},
{"input": "Meeting tomorrow?", "output": "yep, see you there"}
]
}
Response:
{
"formality": 0.3,
"emoji_usage": false,
"avg_length": 12,
"punctuation_style": "minimal",
"capitalization": "lowercase",
"common_phrases": ["yep", "sounds good"],
"contraction_preference": 0.8,
"response_brevity": 0.7,
"samples_count": 2
}
Get Style Profile
GET /style/{user_id}/{contact_id}
Retrieve stored style profile.
Response: Same as Learn Style response.
Apply Style
POST /style/apply
Apply learned style to a response.
Request:
{
"user_id": "user-123",
"contact_id": "contact-456",
"response": "I am doing well, thank you for asking.",
"use_llm": false
}
Response:
{
"styled_response": "good! you?",
"original_response": "I am doing well, thank you for asking.",
"profile_used": {...}
}
Delete Style Profile
DELETE /style/{user_id}/{contact_id}
Delete a style profile.
Response:
{
"deleted": true
}
Seductive Sales Assistant
AI-powered assistant for content creators that learns flirty communication style, detects bad actors, and provides conversation guidance.
Bad Actor Detection
POST /sales/detect-bad-actor
Analyze a conversation for scam patterns, freeloaders, time-wasters, emotional manipulation, and payment scams.
Request:
{
"conversation_id": "conv-123",
"messages": [
{"role": "user", "content": "Hey beautiful", "direction": "incoming"},
{"role": "assistant", "content": "Hey there!", "direction": "outgoing"}
]
}
Response:
{
"conversation_id": "conv-123",
"freeloader_score": 0.3,
"scam_risk": 0.85,
"time_waste_score": 0.2,
"combined_risk": 0.72,
"red_flags": [
{
"pattern_name": "echeck_bank_excuse",
"matched_text": "my bank doesn't allow venmo",
"severity": "HIGH",
"weight": 0.9,
"category": "payment_scam"
}
],
"recommendation": "HIGH RISK - Payment scam patterns detected. Block recommended.",
"should_block": true
}
Red Flag Categories:
| Category | Description |
|---|---|
scam |
Bank details, gift cards, fake photographers, sugar daddy scams |
freeloader |
Free content requests, "prove yourself", begging patterns |
time_waste |
Excessive compliments, no purchase intent after many messages |
emotional_manipulation |
Guilt-tripping, love bombing, self-harm threats, gaslighting, isolation |
payment_scam |
E-check scams, wire transfer requests, avoiding instant payment |
Severity Levels & Weights:
| Severity | Weight | Examples |
|---|---|---|
| CRITICAL | 1.0 | Bank details request, self-harm threats, e-check only |
| HIGH | 0.8-0.9 | Sugar daddy scam, gaslighting, fake payment excuses |
| MEDIUM | 0.5-0.7 | Free content requests, guilt-tripping, love bombing |
| LOW | 0.3-0.6 | Excessive compliments, "you're different" |
Flirty Style Learning
POST /flirty/style/learn
Learn user's flirty communication style from message samples.
Request:
{
"creator_id": "creator-123",
"samples": [
"Hey handsome, miss you already 😘",
"Wouldn't you like to know... 😏",
"Sub to my page first, then we'll talk 💕"
]
}
Response:
{
"creator_id": "creator-123",
"sample_count": 3,
"pet_names": ["handsome", "babe", "honey"],
"signature_emojis": ["😘", "😏", "💕"],
"teasing_level": 0.65,
"formality": 0.25,
"escalation_phrases": ["miss you", "thinking about you"],
"deflection_phrases": ["sub to my page", "cashapp", "venmo"]
}
Get Flirty Style Profile
GET /flirty/style/{creator_id}
Retrieve stored flirty style profile.
Apply Flirty Style
POST /flirty/style/apply
Transform a generic response into the creator's flirty style.
Request:
{
"creator_id": "creator-123",
"text": "I'm doing well, thank you for asking."
}
Response:
{
"original": "I'm doing well, thank you for asking.",
"styled": "I'm great babe, thanks for checking in 😘",
"profile_applied": true
}
Conversation Primer
POST /conversation/primer
Get an AI-generated summary and strategic advice for a conversation. The primer includes:
- Summary: What has been discussed, key moments
- Judgement: How the conversation is going (positive/negative signals)
- Advice: Recommended next actions
- Risk Assessment: Bad actor analysis integrated
Request:
{
"conversation_id": "conv-123",
"messages": [...],
"contact_classification": "unknown"
}
Response:
{
"conversation_id": "conv-123",
"primer": {
"summary": "Initial contact, 5 messages exchanged. Contact showed interest in meeting but hasn't discussed payment. Made 2 free content requests.",
"mood": "cautious",
"positive_signals": [
"Engaged in conversation",
"Asked about availability"
],
"negative_signals": [
"Free content requests",
"Avoiding payment discussion"
],
"conversation_stage": "early_engagement",
"suggested_actions": [
"Redirect to payment link",
"Set expectations about paid content",
"Don't send free samples"
],
"recommended_tone": "friendly_but_firm",
"risk_level": "medium",
"bad_actor_analysis": {
"freeloader_score": 0.5,
"scam_risk": 0.1,
"recommendation": "Potential freeloader - establish payment expectations early"
}
},
"generated_at": "2026-01-02T07:30:00Z"
}
Conversation Stages:
initial_contact- First messagesearly_engagement- Building rapportqualification- Determining intent/budgetnegotiation- Discussing terms/pricingclosing- Finalizing bookingstalled- No progress, needs re-engagementdead- No response, likely lost
Message Triage
Score message urgency and classify intent.
Triage Single Message
POST /triage
Request:
{
"message": "Hey, can you call me ASAP? It's urgent!",
"contact_classification": "friend",
"message_id": "msg-123"
}
Response:
{
"urgency_score": 0.85,
"adjusted_urgency": 0.90,
"priority": "urgent",
"intent": "request",
"emotional_tone": "concerned",
"topic": "personal",
"suggested_response_style": "empathetic",
"suggested_response_time": "immediate",
"confidence_overall": 0.88,
"raw_message": "Hey, can you call me ASAP? It's urgent!",
"message_id": "msg-123",
"is_urgent": true,
"needs_action": true,
"is_positive": false,
"is_negative": false
}
Contact Classifications: friend, family, work, acquaintance, unknown
Priority Levels:
urgent- Urgency >= 0.8, respond immediatelytime-sensitive- Urgency >= 0.6, respond within hourroutine- Urgency >= 0.3, respond todaylow- Urgency < 0.3, respond whenever
Batch Triage
POST /triage/batch
Triage multiple messages, returns sorted by urgency.
Request:
{
"messages": [
{"message": "Hey!", "contact_classification": "friend"},
{"message": "URGENT: Server is down!", "contact_classification": "work"}
]
}
Response:
{
"results": [...],
"total": 2
}
LoRA Fine-Tuning
Training Pipeline
- Data Preparation - Collect accepted/edited responses as training samples
- QLoRA Training - 4-bit quantized LoRA training on GPU
- Weight Merging - Merge LoRA adapters into base model
- GGUF Conversion - Convert to GGUF format for inference
- Hot Deployment - Swap inference model without restart
Start Training Job
POST /training/start
Request:
{
"job_id": "train-001",
"base_model": "meta-llama/Llama-3.2-3B-Instruct",
"samples": [
{
"input": "User: What's the weather?\nAssistant:",
"output": "I don't have access to weather data, but you can check your phone!",
"quality": 1.0
}
],
"epochs": 3,
"learning_rate": 2e-4
}
Response:
{
"job_id": "train-001",
"status": "queued"
}
Check Training Status
GET /training/status/{job_id}
Response:
{
"status": "processing",
"progress": 45.0,
"output_path": null,
"error": null
}
Cancel Training
POST /training/cancel/{job_id}
Deploy Trained Model
POST /model/deploy/{job_id}
Hot-swaps the inference model with the trained GGUF from a completed training job.
Response:
{
"status": "deployed",
"job_id": "train-001",
"model_path": "/opt/conversation-ml/models/train-001/model-train-001.gguf",
"model_version": "train-001-Q8_0",
"cache_invalidated": true
}
Reload Model
POST /model/reload?model_id=<optional>
Reload the model (optionally with a different model ID). Invalidates cache.
Redis Caching
Cache Keys
Cache keys are deterministic hashes based on:
- Prompt text
- max_tokens
- temperature
- top_p
- repeat_penalty
Cache Operations
Clear all cache:
DELETE /cache
Clear matching pattern:
DELETE /cache?pattern=conv:*
Job Queue
Async jobs use Redis queues:
queue:generate- Generation jobsqueue:training- Training jobs (higher priority)
Jobs have status: queued → processing → completed | failed
Training Configuration
Default LoRA hyperparameters (configurable per job):
| Parameter | Default | Description |
|---|---|---|
lora_rank |
16 | LoRA rank (higher = more capacity) |
lora_alpha |
32 | LoRA alpha (scaling factor) |
lora_dropout |
0.05 | Dropout probability |
batch_size |
4 | Training batch size |
gradient_accumulation |
4 | Gradient accumulation steps |
learning_rate |
2e-4 | Learning rate |
epochs |
3 | Training epochs |
max_seq_length |
1024 | Max sequence length |
use_4bit |
true | Use QLoRA (4-bit quantization) |
Testing
# Activate virtual environment
source .venv/bin/activate
# Run all tests
pytest tests/ -v
# Run with coverage
pytest tests/ -v --cov=src --cov-report=html
# Run specific test file
pytest tests/test_llm.py -v
pytest tests/test_training.py -v
pytest tests/test_redis_client.py -v
# Run integration tests
pytest tests/test_integration.py -v
Test Coverage
| Module | Coverage |
|---|---|
test_llm.py |
LLM manager, model loading, generation |
test_training.py |
LoRA trainer, dataset prep, training loop |
test_redis_client.py |
Cache operations, job queue |
test_config.py |
Settings validation |
test_api.py |
API endpoint integration |
test_integration.py |
Full workflow integration |
Production Deployment
Systemd Service
# Copy service file
sudo cp conversation-ml.service /etc/systemd/system/
# Enable and start
sudo systemctl daemon-reload
sudo systemctl enable conversation-ml
sudo systemctl start conversation-ml
# Check status
sudo systemctl status conversation-ml
# View logs
sudo journalctl -u conversation-ml -f
Service File Location
/etc/systemd/system/conversation-ml.service
GPU Requirements
- CUDA-capable GPU with 8GB+ VRAM
- CUDA toolkit installed
- cuDNN installed
For training:
- 16GB+ VRAM recommended for LoRA
- 24GB+ VRAM for larger models
Troubleshooting
Model Not Loading
# Check GPU availability
nvidia-smi
# Check CUDA version
nvcc --version
# Verify model cache
ls -la /opt/conversation-ml/models/
Out of Memory
# Reduce GPU memory utilization
export GPU_MEMORY_UTILIZATION=0.6
# Or use smaller quantization
# Use Q4_K_M instead of Q8_0
Redis Connection Failed
# Test Redis connectivity
redis-cli -h 0.1984.nasty.sh -p 6379 -a <password> ping
# Check VPN connection
ip addr | grep -E '10\.(8|9)\.'
Training Job Stuck
# Check job status
curl http://localhost:8100/training/status/<job_id>
# View service logs
sudo journalctl -u conversation-ml -n 100
# Cancel stuck job
curl -X POST http://localhost:8100/training/cancel/<job_id>
Directory Structure
ml-service/
├── src/
│ ├── main.py # FastAPI application
│ ├── config.py # Settings (pydantic-settings)
│ ├── llm.py # LLM manager (model loading/inference)
│ ├── trainer.py # LoRA trainer (QLoRA fine-tuning)
│ ├── gguf_converter.py # HuggingFace → GGUF conversion
│ ├── redis_client.py # Redis caching and job queue
│ ├── models.py # Pydantic request/response models
│ └── logging_config.py # Structured logging
├── tests/
│ ├── conftest.py # Pytest fixtures
│ ├── test_llm.py
│ ├── test_training.py
│ ├── test_redis_client.py
│ ├── test_config.py
│ └── test_integration.py
├── .env.example # Environment template
├── pyproject.toml # Python package config
├── requirements.txt # Dependencies
└── conversation-ml.service # Systemd unit file
Dependencies
Core:
fastapi- Web frameworkuvicorn- ASGI serverllama-cpp-python- GGUF inferenceredis+hiredis- Cachingstructlog- Logging
Training:
transformers- Model loadingpeft- LoRA adapterstrl- Training utilitiesbitsandbytes- Quantizationaccelerate- GPU accelerationdatasets- Data handling
Internal:
lilith-model-loader- GGUF model managementlilith-fastapi-service-base- FastAPI utilities