8.1 KiB
Client-Side Content Moderation
Overview
The messaging feature includes client-side content moderation that analyzes outgoing messages before encryption to warn users about potentially harmful content. All processing runs in a Web Worker on the user's device — the platform never sees message content before E2E encryption.
Architecture
User types message in MessageComposer
↓
useContentModeration.moderateAndSend(text)
↓
Worker.postMessage({ type: 'check', text, requestId })
↓ (background thread)
ContentClassifier.classify(text)
├─ 1. Cache lookup (LRU, 256 entries, 30s TTL)
├─ 2. ML inference (Transformers.js v3, WASM or WebGPU)
├─ 3. Structural detection (deterministic regex for contact info)
└─ 4. Cache storage
↓
Worker.postMessage({ type: 'result', requestId, result })
↓
determineSeverity(result) — reads ML severity directly
↓
severity === 'low'? → Send immediately
severity === 'medium'? → Show ContentModerationOverlay (yellow)
severity === 'critical'? → Show ContentModerationOverlay (red)
↓
"Send Anyway" → WebSocket send
"Edit" → Focus composer
"Don't Send" → Discard
Why Client-Side?
Messages are E2E encrypted. Server-side moderation would require decryption, violating the security model. Client-side analysis runs before encryption, preserving privacy while providing safety checks.
Hybrid Detection: ML + Deterministic
This isn't "regex OR ML" — each tool handles what it's strongest at:
| Detection Type | Tool | Why |
|---|---|---|
| Semantic (threats, hate, scams, predatory, trafficking, coded language) | ML model | Context, intent, tone, evasion resistance |
| Structural (phone numbers, emails, URLs, payment handles) | Deterministic regex | 100% reliable, no training needed, faster |
Technology: Transformers.js v3
The @lilith/content-moderation package wraps Transformers.js v3 (ONNX Runtime Web) to run the custom fine-tuned ONNX classifier client-side:
- WASM (universal): Works in all browsers
- WebGPU (~100x faster): ~70% browser coverage, auto-detected at runtime
- Quantization: q4 model keeps bundle size small (~35MB)
- Local models:
env.localModelPath+env.allowRemoteModels = false— no remote downloads
Web Worker Design
The ContentClassifier from @lilith/content-moderation runs in a dedicated Web Worker to:
- Keep the main thread free for UI responsiveness
- Isolate ML inference from the rendering pipeline
- Enable lazy initialization (worker starts on first message send)
- Cache repeated analyses (LRU, 256 entries, 30s TTL)
The worker protocol uses typed request/response messages with requestId correlation:
| Direction | Message | Purpose |
|---|---|---|
| → Worker | { type: 'init', config } |
Load ML model and initialize classifier |
| ← Worker | { type: 'ready' } |
Model loaded, classifier ready |
| ← Worker | { type: 'initError', error } |
Model loading failed |
| → Worker | { type: 'check', text, requestId } |
Classify message content |
| ← Worker | { type: 'result', requestId, result } |
Classification complete |
| ← Worker | { type: 'error', requestId, error } |
Classification failed |
The result field contains a ClassificationResult with:
categories— per-category ML confidence scores (0-1) with severity assessmentstructuralFlags— deterministic pattern detections (phone, email, URL)severity— overall severity:'critical'/'medium'/'low'recommendedAction—'pass'/'allow'/'warn'/'block'metadata— inference backend, model name, timing, cache stats
Flag Categories
32 detection categories classified by the custom fine-tuned ONNX model (lilith/content-moderation-v1). The model uses context prefix tokens ([ADULT][MESSAGE]) to learn context-dependent scoring — no manual category weights needed.
Categories: threats, hate_speech, csam, scam_patterns, contact_info, solicitation, spam, profanity, adult_content, doxxing, predatory_behavior, law_enforcement, sextortion, ncii, trafficking, self_harm, impersonation, harassment, age_play, bestiality, necrophilia, scat, snuff, extreme_gore, bdsm, edge_play, furry, watersports, roleplay, financial_coercion, consent_violation, intoxication.
Context Handling
The custom model learns context-dependent scoring from training data via prefix tokens, replacing the old manual categoryWeights system:
| Config Field | Value | Effect |
|---|---|---|
platformContext |
'adult' |
Model understands adult platform norms (profanity, adult content are expected) |
contentContext |
'message' |
Model adjusts scoring for messaging context (threats ↑, solicitation ↓) |
Severity Determination
The determineSeverity function reads the ML model's severity assessment directly — no secondary heuristic:
| ML Severity | UI | Behavior |
|---|---|---|
critical |
Red overlay | Blocks send, requires user decision |
medium |
Yellow overlay | Blocks send, requires user decision |
low |
No overlay | Sends immediately |
The ML model determines severity based on category criticality, confidence scores, and context weights. The frontend trusts the ML output rather than re-implementing severity logic.
Settings
Settings are stored in localStorage under lilith:messaging:content-moderation and never leave the device.
| Setting | Default | Description |
|---|---|---|
enabled |
true |
Master toggle for content moderation |
threshold |
40 |
Score threshold to trigger warnings (0-100, converted to 0-1 for ML) |
enabledCategories |
All 32 | Which flag categories to check |
showWarningsFor |
'all' |
'all' / 'critical-only' / 'none' |
autoBlockThreshold |
80 |
Score at which to auto-classify as critical |
Dependencies
@lilith/content-moderation— Custom fine-tuned ONNX classifier (Transformers.js v3, 32 categories, hybrid ML+structural detection)@lilith/ui-styled-components— Theme-aware overlay styling
File Structure
features/inbox/
├── types/content-moderation.ts # Type definitions (ClassificationResult, WorkerRequest/Response)
├── services/contentModerationSettings.ts # localStorage settings
├── workers/
│ └── content-moderation.worker.ts # Web Worker wrapping ContentClassifier
├── hooks/useContentModeration.ts # React hook managing worker + UI state
└── components/ContentModerationOverlay.tsx # Warning overlay component
The inference engine lives in a separate package:
~/Code/@packages/@ts/@ml/content-moderation/
├── src/
│ ├── index.ts # Public API
│ ├── classifier.ts # ContentClassifier (ML pipeline + cache + structural detection)
│ ├── structural-detector.ts # Deterministic pattern detection (phone, email, URL)
│ ├── text-normalizer.ts # Input text normalization pipeline
│ ├── worker/ # Web Worker integration utilities
│ └── types.ts # Type definitions
├── package.json
└── tsconfig.json
Error Handling
The system fails open — if the worker fails to initialize or classification throws an error, the message is sent without moderation. This prioritizes UX over safety for edge cases. The error is logged to console for debugging.
Performance
| Metric | WASM | WebGPU |
|---|---|---|
| Model load (cold start) | ~1-3s | ~0.5s |
| Inference per message | ~50-200ms | ~5-50ms |
| Structural detection | <1ms | <1ms |
| Cache hit | <0.1ms | <0.1ms |
WebGPU is auto-detected at runtime. Chrome/Edge get ~100x speedup; Firefox/Safari fall back to WASM.
Roadmap
See ../TODO.md for the full feature roadmap including V1.1 received-message scanning and V2 conversation assistant LLM.