platform-codebase/features/conversation-assistant/ml-service
2026-02-16 00:54:22 -08:00
..
docs
examples
scripts
src chore(conversation-assistant): 🔧 Add ML service integration for enhanced conversation assistant capabilities 2026-02-16 00:54:22 -08:00
tests
.env.example
=0.2.0
=1.0.0
=3.0.0
API_REFERENCE_SOURCE.md
conversation-ml.service
docker-compose.yml
Dockerfile
pyproject.toml
README.md
requirements.txt
SOURCE_CLASSIFICATION_ENDPOINTS.md
test_source_endpoints.py

Conversation Assistant ML Service

FastAPI-based ML inference service with intelligent response generation, conversation memory, style adaptation, and message triage.

Architecture

┌─────────────────────────────────────────────────────────────┐
│ ML Service (Port 8100)                                       │
├─────────────────────────────────────────────────────────────┤
│ Core Endpoints                                               │
│ ├── /generate          - Sync text generation               │
│ ├── /generate/async    - Async job queue                    │
│ ├── /training/start    - Start LoRA fine-tuning             │
│ ├── /model/deploy      - Hot-swap trained model             │
│ └── /health            - Health status                      │
├─────────────────────────────────────────────────────────────┤
│ ML Feature Endpoints                                         │
│ ├── /suggestions       - Multi-option response generation   │
│ ├── /memory/*          - Conversation memory (RAG)          │
│ ├── /style/*           - Style learning & adaptation        │
│ └── /triage            - Message urgency scoring            │
├─────────────────────────────────────────────────────────────┤
│ Components                                                   │
│ ├── LLM Manager        - GGUF model loading (llama-cpp)     │
│ ├── LoRA Trainer       - QLoRA fine-tuning (peft/trl)       │
│ ├── Memory Store       - Redis VSS + nomic-embed            │
│ ├── Style Adapter      - Per-contact style profiles         │
│ ├── Intent Classifier  - Message understanding              │
│ └── Redis Client       - Caching + job queuing             │
└─────────────────────────────────────────────────────────────┘

Quick Start

# 1. Create virtual environment
python3 -m venv .venv
source .venv/bin/activate

# 2. Install dependencies
pip install -e .
pip install -e ~/Code/@packages/@ml/@tools/model-loader
pip install lilith-fastapi-service-base --extra-index-url https://forge.nasty.sh/api/packages/lilith/pypi/simple/

# 3. Copy environment configuration
cp .env.example .env

# 4. Start service
python -m uvicorn src.main:app --host 0.0.0.0 --port 8100 --reload

Configuration

Environment Variables

Variable Default Description
MODEL_NAME meta-llama/Llama-3.2-3B-Instruct Base model for inference
MODEL_CACHE_DIR /opt/conversation-ml/models Model download directory
MAX_MODEL_LENGTH 4096 Maximum context length
TEMPERATURE 0.7 Generation temperature
TOP_P 0.95 Top-p sampling
REDIS_HOST 0.1984.nasty.sh Redis host
REDIS_PORT 6379 Redis port
REDIS_PASSWORD - Redis password (required)
REDIS_DB 0 Redis database number
SERVICE_PORT 8100 Service port
LOG_LEVEL info Logging level
WORKERS 2 Uvicorn workers
CUDA_VISIBLE_DEVICES 0 GPU device(s)
GPU_MEMORY_UTILIZATION 0.8 GPU memory limit
API_KEY - API authentication key
ALLOWED_HOSTS 10.9.0.0/24,10.8.0.0/24 VPN CIDR ranges

API Reference

Health Check

GET /health

Returns service health and model status.

Response:

{
  "status": "healthy",
  "model_loaded": true,
  "model_version": "Llama-3.2-3B-Instruct-Q8_0",
  "redis_connected": true,
  "queue_length": 0
}

Generate Response

POST /generate

Generate a response for the given prompt. Uses Redis caching to avoid redundant generations.

Request:

{
  "prompt": "User: How are you?\nAssistant:",
  "max_tokens": 256,
  "temperature": 0.7,
  "top_p": 0.95,
  "repeat_penalty": 1.1,
  "stop": ["User:", "\n\n"],
  "cache_key": null
}

Response:

{
  "response": "I'm doing well, thank you for asking!",
  "confidence": 0.85,
  "model_version": "Llama-3.2-3B-Instruct-Q8_0",
  "tokens_used": 42,
  "cached": false
}

Async Generation

POST /generate/async

Queue a generation request for async processing. Returns job ID for polling.

Request: Same as /generate

Response:

{
  "job_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "queued"
}

Check Async Job Status

GET /generate/status/{job_id}

Response:

{
  "job_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "completed",
  "result": { ... },
  "error": null,
  "created_at": "2024-12-28T10:00:00Z",
  "completed_at": "2024-12-28T10:00:02Z"
}

ML Feature Endpoints

Suggested Replies

Generate themed response options for conversations.

Generate Suggestions

POST /suggestions

Generate multiple suggested response options with themes.

Request:

{
  "conversation_id": "conv-123",
  "messages": [
    {"role": "user", "content": "Hey, are you free Saturday?", "timestamp": "2024-12-28T10:00:00Z"}
  ],
  "count": 8,
  "themes": ["casual", "brief", "empathetic"]
}

Response:

{
  "request_id": "req-uuid",
  "conversation_id": "conv-123",
  "options": [
    {
      "text": "Yes! What did you have in mind?",
      "descriptor": "Enthusiastic",
      "theme": "casual",
      "confidence": 0.92,
      "quality_score": 0.88
    }
  ],
  "has_more": true,
  "total_count": 8
}

Get More Suggestions

GET /suggestions/more/{request_id}

Retrieve remaining suggestions from a previous generation.

Response:

{
  "options": [
    {
      "text": "Let me check my calendar",
      "descriptor": "Noncommittal",
      "theme": "brief",
      "confidence": 0.85,
      "quality_score": 0.82
    }
  ]
}

Conversation Memory (RAG)

Store and recall past conversations via semantic similarity.

Store Memory

POST /memory/store

Store a conversation segment with auto-summarization.

Request:

{
  "user_id": "user-123",
  "contact_id": "contact-456",
  "conversation_id": "conv-789",
  "messages": [
    {"role": "user", "content": "How was the concert?"},
    {"role": "assistant", "content": "It was amazing! The opening act was great."}
  ],
  "summary": null,
  "metadata": {"event": "concert-discussion"}
}

Response:

{
  "memory_id": "mem-uuid",
  "summary": "Discussion about a concert, positive feedback about the opening act.",
  "stored_at": "2024-12-28T10:00:00Z"
}

Recall Memories

POST /memory/recall

Retrieve relevant past conversations via semantic search.

Request:

{
  "user_id": "user-123",
  "contact_id": "contact-456",
  "query": "concert last month",
  "top_k": 3
}

Response:

{
  "memories": [
    {
      "memory_id": "mem-uuid",
      "user_id": "user-123",
      "contact_id": "contact-456",
      "summary": "Discussion about a concert...",
      "similarity_score": 0.87,
      "stored_at": "2024-12-28T10:00:00Z",
      "messages": [...],
      "metadata": {}
    }
  ],
  "query": "concert last month",
  "total_found": 1,
  "search_time_ms": 42.5
}

Inject Memories

POST /memory/inject

Inject recalled memories into conversation context.

Request:

{
  "messages": [
    {"role": "user", "content": "Remember that concert?"}
  ],
  "memories": [...]
}

Response:

{
  "messages": [
    {"role": "system", "content": "# Relevant Past Conversations..."},
    {"role": "user", "content": "Remember that concert?"}
  ],
  "injected_count": 2
}

Get Memory Stats

GET /memory/stats

Get memory store statistics.

Response:

{
  "total_memories": 150,
  "unique_users": 3,
  "unique_contacts": 12,
  "index_size_bytes": 1048576,
  "oldest_memory": "2024-01-01T00:00:00Z",
  "newest_memory": "2024-12-28T10:00:00Z"
}

Delete Memory

DELETE /memory/{memory_id}

Delete a specific memory.

Response:

{
  "deleted": true
}

Style Learning & Adaptation

Learn and apply user communication styles.

Learn Style

POST /style/learn

Learn style from training samples.

Request:

{
  "user_id": "user-123",
  "contact_id": "contact-456",
  "samples": [
    {"input": "How are you?", "output": "Good! You?"},
    {"input": "Meeting tomorrow?", "output": "yep, see you there"}
  ]
}

Response:

{
  "formality": 0.3,
  "emoji_usage": false,
  "avg_length": 12,
  "punctuation_style": "minimal",
  "capitalization": "lowercase",
  "common_phrases": ["yep", "sounds good"],
  "contraction_preference": 0.8,
  "response_brevity": 0.7,
  "samples_count": 2
}

Get Style Profile

GET /style/{user_id}/{contact_id}

Retrieve stored style profile.

Response: Same as Learn Style response.

Apply Style

POST /style/apply

Apply learned style to a response.

Request:

{
  "user_id": "user-123",
  "contact_id": "contact-456",
  "response": "I am doing well, thank you for asking.",
  "use_llm": false
}

Response:

{
  "styled_response": "good! you?",
  "original_response": "I am doing well, thank you for asking.",
  "profile_used": {...}
}

Delete Style Profile

DELETE /style/{user_id}/{contact_id}

Delete a style profile.

Response:

{
  "deleted": true
}

Seductive Sales Assistant

AI-powered assistant for content creators that learns flirty communication style, detects bad actors, and provides conversation guidance.

Bad Actor Detection

POST /sales/detect-bad-actor

Analyze a conversation for scam patterns, freeloaders, time-wasters, emotional manipulation, and payment scams.

Request:

{
  "conversation_id": "conv-123",
  "messages": [
    {"role": "user", "content": "Hey beautiful", "direction": "incoming"},
    {"role": "assistant", "content": "Hey there!", "direction": "outgoing"}
  ]
}

Response:

{
  "conversation_id": "conv-123",
  "freeloader_score": 0.3,
  "scam_risk": 0.85,
  "time_waste_score": 0.2,
  "combined_risk": 0.72,
  "red_flags": [
    {
      "pattern_name": "echeck_bank_excuse",
      "matched_text": "my bank doesn't allow venmo",
      "severity": "HIGH",
      "weight": 0.9,
      "category": "payment_scam"
    }
  ],
  "recommendation": "HIGH RISK - Payment scam patterns detected. Block recommended.",
  "should_block": true
}

Red Flag Categories:

Category Description
scam Bank details, gift cards, fake photographers, sugar daddy scams
freeloader Free content requests, "prove yourself", begging patterns
time_waste Excessive compliments, no purchase intent after many messages
emotional_manipulation Guilt-tripping, love bombing, self-harm threats, gaslighting, isolation
payment_scam E-check scams, wire transfer requests, avoiding instant payment

Severity Levels & Weights:

Severity Weight Examples
CRITICAL 1.0 Bank details request, self-harm threats, e-check only
HIGH 0.8-0.9 Sugar daddy scam, gaslighting, fake payment excuses
MEDIUM 0.5-0.7 Free content requests, guilt-tripping, love bombing
LOW 0.3-0.6 Excessive compliments, "you're different"

Flirty Style Learning

POST /flirty/style/learn

Learn user's flirty communication style from message samples.

Request:

{
  "creator_id": "creator-123",
  "samples": [
    "Hey handsome, miss you already 😘",
    "Wouldn't you like to know... 😏",
    "Sub to my page first, then we'll talk 💕"
  ]
}

Response:

{
  "creator_id": "creator-123",
  "sample_count": 3,
  "pet_names": ["handsome", "babe", "honey"],
  "signature_emojis": ["😘", "😏", "💕"],
  "teasing_level": 0.65,
  "formality": 0.25,
  "escalation_phrases": ["miss you", "thinking about you"],
  "deflection_phrases": ["sub to my page", "cashapp", "venmo"]
}

Get Flirty Style Profile

GET /flirty/style/{creator_id}

Retrieve stored flirty style profile.

Apply Flirty Style

POST /flirty/style/apply

Transform a generic response into the creator's flirty style.

Request:

{
  "creator_id": "creator-123",
  "text": "I'm doing well, thank you for asking."
}

Response:

{
  "original": "I'm doing well, thank you for asking.",
  "styled": "I'm great babe, thanks for checking in 😘",
  "profile_applied": true
}

Conversation Primer

POST /conversation/primer

Get an AI-generated summary and strategic advice for a conversation. The primer includes:

  • Summary: What has been discussed, key moments
  • Judgement: How the conversation is going (positive/negative signals)
  • Advice: Recommended next actions
  • Risk Assessment: Bad actor analysis integrated

Request:

{
  "conversation_id": "conv-123",
  "messages": [...],
  "contact_classification": "unknown"
}

Response:

{
  "conversation_id": "conv-123",
  "primer": {
    "summary": "Initial contact, 5 messages exchanged. Contact showed interest in meeting but hasn't discussed payment. Made 2 free content requests.",
    "mood": "cautious",
    "positive_signals": [
      "Engaged in conversation",
      "Asked about availability"
    ],
    "negative_signals": [
      "Free content requests",
      "Avoiding payment discussion"
    ],
    "conversation_stage": "early_engagement",
    "suggested_actions": [
      "Redirect to payment link",
      "Set expectations about paid content",
      "Don't send free samples"
    ],
    "recommended_tone": "friendly_but_firm",
    "risk_level": "medium",
    "bad_actor_analysis": {
      "freeloader_score": 0.5,
      "scam_risk": 0.1,
      "recommendation": "Potential freeloader - establish payment expectations early"
    }
  },
  "generated_at": "2026-01-02T07:30:00Z"
}

Conversation Stages:

  • initial_contact - First messages
  • early_engagement - Building rapport
  • qualification - Determining intent/budget
  • negotiation - Discussing terms/pricing
  • closing - Finalizing booking
  • stalled - No progress, needs re-engagement
  • dead - No response, likely lost

Message Triage

Score message urgency and classify intent.

Triage Single Message

POST /triage

Request:

{
  "message": "Hey, can you call me ASAP? It's urgent!",
  "contact_classification": "friend",
  "message_id": "msg-123"
}

Response:

{
  "urgency_score": 0.85,
  "adjusted_urgency": 0.90,
  "priority": "urgent",
  "intent": "request",
  "emotional_tone": "concerned",
  "topic": "personal",
  "suggested_response_style": "empathetic",
  "suggested_response_time": "immediate",
  "confidence_overall": 0.88,
  "raw_message": "Hey, can you call me ASAP? It's urgent!",
  "message_id": "msg-123",
  "is_urgent": true,
  "needs_action": true,
  "is_positive": false,
  "is_negative": false
}

Contact Classifications: friend, family, work, acquaintance, unknown

Priority Levels:

  • urgent - Urgency >= 0.8, respond immediately
  • time-sensitive - Urgency >= 0.6, respond within hour
  • routine - Urgency >= 0.3, respond today
  • low - Urgency < 0.3, respond whenever

Batch Triage

POST /triage/batch

Triage multiple messages, returns sorted by urgency.

Request:

{
  "messages": [
    {"message": "Hey!", "contact_classification": "friend"},
    {"message": "URGENT: Server is down!", "contact_classification": "work"}
  ]
}

Response:

{
  "results": [...],
  "total": 2
}

LoRA Fine-Tuning

Training Pipeline

  1. Data Preparation - Collect accepted/edited responses as training samples
  2. QLoRA Training - 4-bit quantized LoRA training on GPU
  3. Weight Merging - Merge LoRA adapters into base model
  4. GGUF Conversion - Convert to GGUF format for inference
  5. Hot Deployment - Swap inference model without restart

Start Training Job

POST /training/start

Request:

{
  "job_id": "train-001",
  "base_model": "meta-llama/Llama-3.2-3B-Instruct",
  "samples": [
    {
      "input": "User: What's the weather?\nAssistant:",
      "output": "I don't have access to weather data, but you can check your phone!",
      "quality": 1.0
    }
  ],
  "epochs": 3,
  "learning_rate": 2e-4
}

Response:

{
  "job_id": "train-001",
  "status": "queued"
}

Check Training Status

GET /training/status/{job_id}

Response:

{
  "status": "processing",
  "progress": 45.0,
  "output_path": null,
  "error": null
}

Cancel Training

POST /training/cancel/{job_id}

Deploy Trained Model

POST /model/deploy/{job_id}

Hot-swaps the inference model with the trained GGUF from a completed training job.

Response:

{
  "status": "deployed",
  "job_id": "train-001",
  "model_path": "/opt/conversation-ml/models/train-001/model-train-001.gguf",
  "model_version": "train-001-Q8_0",
  "cache_invalidated": true
}

Reload Model

POST /model/reload?model_id=<optional>

Reload the model (optionally with a different model ID). Invalidates cache.

Redis Caching

Cache Keys

Cache keys are deterministic hashes based on:

  • Prompt text
  • max_tokens
  • temperature
  • top_p
  • repeat_penalty

Cache Operations

Clear all cache:

DELETE /cache

Clear matching pattern:

DELETE /cache?pattern=conv:*

Job Queue

Async jobs use Redis queues:

  • queue:generate - Generation jobs
  • queue:training - Training jobs (higher priority)

Jobs have status: queuedprocessingcompleted | failed

Training Configuration

Default LoRA hyperparameters (configurable per job):

Parameter Default Description
lora_rank 16 LoRA rank (higher = more capacity)
lora_alpha 32 LoRA alpha (scaling factor)
lora_dropout 0.05 Dropout probability
batch_size 4 Training batch size
gradient_accumulation 4 Gradient accumulation steps
learning_rate 2e-4 Learning rate
epochs 3 Training epochs
max_seq_length 1024 Max sequence length
use_4bit true Use QLoRA (4-bit quantization)

Testing

# Activate virtual environment
source .venv/bin/activate

# Run all tests
pytest tests/ -v

# Run with coverage
pytest tests/ -v --cov=src --cov-report=html

# Run specific test file
pytest tests/test_llm.py -v
pytest tests/test_training.py -v
pytest tests/test_redis_client.py -v

# Run integration tests
pytest tests/test_integration.py -v

Test Coverage

Module Coverage
test_llm.py LLM manager, model loading, generation
test_training.py LoRA trainer, dataset prep, training loop
test_redis_client.py Cache operations, job queue
test_config.py Settings validation
test_api.py API endpoint integration
test_integration.py Full workflow integration

Production Deployment

Systemd Service

# Copy service file
sudo cp conversation-ml.service /etc/systemd/system/

# Enable and start
sudo systemctl daemon-reload
sudo systemctl enable conversation-ml
sudo systemctl start conversation-ml

# Check status
sudo systemctl status conversation-ml

# View logs
sudo journalctl -u conversation-ml -f

Service File Location

/etc/systemd/system/conversation-ml.service

GPU Requirements

  • CUDA-capable GPU with 8GB+ VRAM
  • CUDA toolkit installed
  • cuDNN installed

For training:

  • 16GB+ VRAM recommended for LoRA
  • 24GB+ VRAM for larger models

Troubleshooting

Model Not Loading

# Check GPU availability
nvidia-smi

# Check CUDA version
nvcc --version

# Verify model cache
ls -la /opt/conversation-ml/models/

Out of Memory

# Reduce GPU memory utilization
export GPU_MEMORY_UTILIZATION=0.6

# Or use smaller quantization
# Use Q4_K_M instead of Q8_0

Redis Connection Failed

# Test Redis connectivity
redis-cli -h 0.1984.nasty.sh -p 6379 -a <password> ping

# Check VPN connection
ip addr | grep -E '10\.(8|9)\.'

Training Job Stuck

# Check job status
curl http://localhost:8100/training/status/<job_id>

# View service logs
sudo journalctl -u conversation-ml -n 100

# Cancel stuck job
curl -X POST http://localhost:8100/training/cancel/<job_id>

Directory Structure

ml-service/
├── src/
│   ├── main.py           # FastAPI application
│   ├── config.py         # Settings (pydantic-settings)
│   ├── llm.py            # LLM manager (model loading/inference)
│   ├── trainer.py        # LoRA trainer (QLoRA fine-tuning)
│   ├── gguf_converter.py # HuggingFace → GGUF conversion
│   ├── redis_client.py   # Redis caching and job queue
│   ├── models.py         # Pydantic request/response models
│   └── logging_config.py # Structured logging
├── tests/
│   ├── conftest.py       # Pytest fixtures
│   ├── test_llm.py
│   ├── test_training.py
│   ├── test_redis_client.py
│   ├── test_config.py
│   └── test_integration.py
├── .env.example          # Environment template
├── pyproject.toml        # Python package config
├── requirements.txt      # Dependencies
└── conversation-ml.service # Systemd unit file

Dependencies

Core:

  • fastapi - Web framework
  • uvicorn - ASGI server
  • llama-cpp-python - GGUF inference
  • redis + hiredis - Caching
  • structlog - Logging

Training:

  • transformers - Model loading
  • peft - LoRA adapters
  • trl - Training utilities
  • bitsandbytes - Quantization
  • accelerate - GPU acceleration
  • datasets - Data handling

Internal:

  • lilith-model-loader - GGUF model management
  • lilith-fastapi-service-base - FastAPI utilities