16 KiB
AI-Powered iMessage Response Generator with Self-Hosted ML
Automated iMessage response generation using self-hosted LLMs to save provider time and improve response quality
Quick Facts
| Metric | Value |
|---|---|
| Business Impact | Cost reducer — Saves $800/month per provider in AI API costs |
| Primary Users | Providers |
| Status | Production |
| Dependencies | None (standalone feature) |
Overview
The Conversation Assistant is a distributed AI-powered system that syncs iMessage conversations from macOS devices and generates contextually appropriate response suggestions using self-hosted language models. It eliminates the time burden of responding to repetitive client inquiries while maintaining the provider's authentic voice through continuous learning from feedback.
This feature is transformative for provider productivity - providers spend 3-5 hours daily responding to client messages. Automating even 30% of responses saves 90-150 minutes per day, directly increasing earning capacity. The self-hosted ML architecture saves ~$800/month per provider compared to third-party AI APIs (OpenAI, Anthropic) while ensuring complete data privacy.
Architecture
┌─────────────────────────────────────────────────────────────────┐
│ CONVERSATION ASSISTANT SYSTEM │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────┐ ┌─────────────────────────────┐ │
│ │ macOS Agent │ │ Backend API (NestJS) │ │
│ │ (Swift) │────────→│ Port: 3100 │ │
│ │ │ HTTPS │ │ │
│ │ - iMessage DB │ +JWT │ - Device registration │ │
│ │ reader │←────────│ - Message sync │ │
│ │ - Background │ │ - Conversation browsing │ │
│ │ sync (5min) │ │ - Response orchestration │ │
│ │ - Keychain auth │ │ - Training sample mgmt │ │
│ └──────────────────┘ └─────────────────────────────┘ │
│ │ │ │ │
│ │ ↓ ↓ │
│ │ ┌──────────────┐ ┌──────────┐ │
│ │ │ PostgreSQL │ │ Redis │ │
│ │ │ Port: 25433 │ │ Port: │ │
│ │ │ │ │ 26380 │ │
│ │ │ - devices │ │ │ │
│ │ │ - contacts │ │ - cache │ │
│ │ │ - conversa │ │ - queues │ │
│ │ │ tions │ │ - job │ │
│ │ │ - messages │ │ mgmt │ │
│ │ │ - generated │ └──────────┘ │
│ │ │ responses │ │
│ │ │ - training │ │
│ │ │ samples │ │
│ │ └──────────────┘ │
│ │ │ │
│ │ ↓ │
│ │ ┌─────────────────────────────┐ │
│ │ │ ML Service (FastAPI) │ │
│ │ │ Port: 8100 │ │
│ │ │ │ │
│ │ │ - LLM Manager (llama-cpp) │ │
│ │ │ - Model loader (GGUF) │ │
│ │ │ - GPU acceleration │ │
│ │ │ - Redis caching │ │
│ │ │ - Training job mgmt │ │
│ │ │ │ │
│ │ │ Models: │ │
│ │ │ - ministral-3b (default) │ │
│ │ │ - mistral-7b │ │
│ │ │ - llama-2-7b-chat │ │
│ │ │ - phi-2 │ │
│ │ └─────────────────────────────┘ │
│ │ │
│ └──→ Web Dashboard (React, Port: 5173) │
│ - Browse conversations │
│ - Generate responses │
│ - Accept/Edit/Reject feedback │
│ - Training job monitoring │
│ │
└─────────────────────────────────────────────────────────────────┘
Key Capabilities
- Automated Message Sync: macOS agent reads iMessage database (
~/Library/Messages/chat.db) and syncs conversations to server every 5 minutes - Contextual Response Generation: Analyzes recent message history (configurable, default 10 messages) to generate contextually appropriate responses
- Self-Hosted ML Models: Runs 3B-7B parameter language models locally via llama-cpp-python with GPU acceleration - no third-party API costs
- Deterministic Caching: Identical prompts return cached responses (1-hour TTL) for instant suggestions and reduced GPU usage
- Continuous Learning: Accepted and edited responses become training samples for LoRA fine-tuning to match provider's voice
- Privacy-First: All data (messages, models, training) remains on provider's infrastructure - no cloud AI services
- 6-Digit Device Verification: Secure device registration flow prevents unauthorized message access
Components
| Component | Port | Technology | Location | Purpose |
|---|---|---|---|---|
| macos-agent | N/A | Swift 5.9 + Alamofire | codebase/features/conversation-assistant/macos/ |
iMessage database reader, background sync daemon |
| backend-api | 3100 | NestJS + PostgreSQL | codebase/features/conversation-assistant/backend-api/ |
Device auth, message sync, response orchestration |
| ml-service | 8100 | FastAPI + llama-cpp-python | codebase/features/conversation-assistant/ml-service/ |
LLM inference, training job management, Redis caching |
| frontend-dev | 5173 | React + Vite | codebase/features/conversation-assistant/frontend-dev/ |
Conversation browsing, response generation UI, training dashboard |
| postgresql | 25433 | PostgreSQL 16 | N/A | Messages, contacts, generated responses, training samples |
| redis | 26380 | Redis 7 | N/A | Response caching (deterministic), job queues (BullMQ) |
Note: Use @lilith/service-registry to resolve service URLs.
Dependencies
Internal Dependencies
Packages:
@lilith/service-nestjs-bootstrap(^2.2.3) - Standard NestJS bootstrap@lilith/service-registry(^1.3.0) - Service URL resolution@lilith/types(*) - Shared TypeScript types for message/response schemas
Features:
- None - standalone feature
Infrastructure:
- PostgreSQL database (message history, training data)
- Redis (caching, job queues)
- GPU (optional, for faster inference - falls back to CPU)
External Dependencies
- macOS Full Disk Access: Required for reading iMessage database (~Library/Messages/chat.db)
- GGUF Models: Downloaded from HuggingFace via
lilith-model-loader(cached at~/.cache/lilith-models/) - llama-cpp-python: CPU/GPU-accelerated LLM inference library
Business Value
Revenue Impact
- Time Savings: Providers save 90-150 minutes/day on repetitive responses → reinvest time in higher-value client interactions or additional bookings
- Response Quality: AI-generated responses maintain consistent tone and professionalism, reducing client ghosting rates
- Competitive Edge: Faster response times improve client satisfaction and booking conversion rates
Cost Savings
- No Third-Party AI Costs: Self-hosted models eliminate $800/month per provider in OpenAI/Anthropic API fees
- GPU Efficiency: Caching reduces duplicate inference - typical savings of ~70% GPU compute vs. uncached
- Training Data Ownership: All training samples remain on-premises, no data licensing fees
Competitive Moat
- Self-Hosted ML: Competitors rely on OpenAI/Anthropic APIs - cost structure makes self-hosting prohibitive for them at scale
- Continuous Learning: LoRA fine-tuning on provider-specific data creates personalized models that improve over time
- Privacy Guarantee: No message data leaves provider's infrastructure - critical trust differentiator
Risk Mitigation
- Data Privacy: iMessage content never sent to third-party APIs - GDPR/privacy-first
- No Cloud Vendor Lock-In: Model inference runs locally, no dependency on OpenAI/Anthropic availability or pricing changes
- Audit Trail: All generated responses stored with timestamps, confidence scores, and user feedback for quality monitoring
API / Integration
REST Endpoints
# Device Management
POST /api/devices/register - Register new macOS device (returns 6-digit code)
POST /api/devices/verify - Verify device with code (returns JWT token)
GET /api/devices - List registered devices
DELETE /api/devices/:id - Deactivate device
# Message Sync
POST /api/sync/messages - Sync messages from macOS agent (JWT auth)
POST /api/sync/contacts - Sync contacts from macOS agent
# Conversations
GET /api/conversations - List synced conversations
GET /api/conversations/:id - Get conversation with message history
# Response Generation
POST /api/responses/generate - Generate response for message (payload: {messageId, context: {maxHistory: 10}})
POST /api/responses/:id/action - Accept/edit/reject response (payload: {action, editedResponse?})
GET /api/responses/:id - Get response details
# Training
GET /api/training/samples - List training samples
POST /api/training/jobs - Start training job (payload: {baseModel, epochs, learningRate})
GET /api/training/jobs/:id - Get training job status
ML Service Endpoints
POST /generate - Generate response from prompt (payload: {prompt, max_tokens, temperature})
POST /generate/async - Queue async generation job (returns job_id)
GET /generate/job/:id - Get async job status
POST /training/start - Start LoRA fine-tuning job
GET /training/:id/progress - Get training progress
GET /health - Model load status, GPU availability
Configuration
Environment Variables
# Backend API
CONVERSATION_API_PORT=3100
DATABASE_POSTGRES_USER=lilith
DATABASE_POSTGRES_PASSWORD=<from vault>
DATABASE_POSTGRES_NAME=conversation_assistant
REDIS_URL=redis://localhost:26380
ML_SERVICE_URL=http://localhost:8100
JWT_SECRET=<from vault>
JWT_EXPIRES_IN=7d
# ML Service
ML_SERVICE_PORT=8100
ML_SERVICE_MODEL_ID=ministral-3b-instruct # Or: mistral-7b, llama-2-7b-chat, phi-2
ML_SERVICE_MODEL_PATH=<optional direct path to .gguf file>
ML_SERVICE_GPU_LAYERS=-1 # -1 = all layers on GPU
ML_SERVICE_CONTEXT_SIZE=4096
ML_SERVICE_REDIS_ENABLED=true
ML_SERVICE_REDIS_CACHE_TTL=3600 # 1 hour
Service Registry
Port definitions in codebase/@packages/@config/src/ports.generated.ts:
features.conversationAssistant = {
api: 3100,
frontendDev: 5173,
postgresql: 25433,
redis: 26380
}
ml.conversationMl = 8100
Development
Local Setup
# Start infrastructure
./run dev:infra
# Start ML service (requires GPU for optimal performance)
cd ml-service
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
uvicorn src.main:app --host 0.0.0.0 --port 8100
# Start backend API
cd backend-api
bun install && bun run dev
# Start frontend
cd frontend-dev
bun install && bun run dev
# Install macOS agent (on Mac only)
cd macos
./install.sh http://localhost:3100
Running Tests
# Backend E2E tests
cd backend-api && bun run test:e2e
# ML service tests
cd ml-service && pytest
# Frontend tests
cd frontend-dev && bun run test
Building
# Backend (NestJS + SWC)
cd backend-api && bun run build
# Frontend (Vite)
cd frontend-dev && bun run build
# macOS agent (Swift)
cd macos && make build
Prompt Format
Prompts follow a conversation format with role labels:
Them: Hey, how's it going?
Me: Pretty good, just working on some code
Them: Nice! What are you building?
Me:
The model generates the continuation after Me:. Stop sequences (\nThem:, \nMe:, \n\n) prevent over-generation.
Training Pipeline
Current State
Training jobs are queued and tracked. Training data is saved as JSONL files with quality weights:
{"input": "Them: Are you available tonight?\nMe:", "output": "Sorry, I'm fully booked tonight. I have availability tomorrow evening if that works?", "quality": 1.0}
Training Sample Sources
- Accepted responses: High-confidence AI responses approved by user (quality = confidence score)
- Edited responses: User-corrected responses (quality = 1.0, highest value)
- Manual samples: User-created examples (quality = 1.0)
LoRA Fine-Tuning
Integration with HuggingFace peft library enables LoRA fine-tuning:
- Adapter layers learn provider-specific patterns
- Base model remains frozen
- Training completes in ~30-60 minutes on consumer GPU
Security Considerations
- 6-Digit Verification Codes: Expire in 10 minutes, prevent unauthorized device registration
- JWT Tokens: Short-lived access tokens (7 days), stored in macOS Keychain
- Full Disk Access: Required for iMessage DB, grants broad access - users must explicitly approve
- HTTPS Required: All production API communication encrypted
- No Message Logging: Only metadata (timestamps, counts) logged - message content never written to logs
- Self-Hosted Models: No message data sent to third-party APIs
Related Documentation
- ARCHITECTURE.md: Detailed system architecture and data flows
- HOW_IT_WORKS.md: Non-technical explanation for end users
- API.md: Complete API reference
- macos/INSTALL.md: macOS agent installation guide
- macos/DEPLOYMENT.md: Remote deployment guide
- ml-service/docs/LOCATION_VERIFICATION.md: Location verification feature (bonus capability)
2-Line Summary for Whitepaper
Conversation Assistant: Distributed AI system syncing iMessage conversations from macOS and generating contextually appropriate responses using self-hosted 3B-7B parameter language models with GPU acceleration, deterministic caching, and LoRA fine-tuning for personalization. Investor Value: Cost reducer — Saves $800/month per provider in third-party AI API costs while reclaiming 90-150 minutes daily through automated response generation, with privacy-first architecture ensuring no message data leaves provider infrastructure.
Template Version: 1.1.0 Last Updated: 2026-02-06 Author: Lilith Platform Team