# Client-Side Content Moderation ## Overview The messaging feature includes client-side content moderation that analyzes outgoing messages **before encryption** to warn users about potentially harmful content. All processing runs in a Web Worker on the user's device — the platform never sees message content before E2E encryption. ## Architecture ``` User types message in MessageComposer ↓ useContentModeration.moderateAndSend(text) ↓ Worker.postMessage({ type: 'check', text, requestId }) ↓ (background thread) ContentClassifier.classify(text) ├─ 1. Cache lookup (LRU, 256 entries, 30s TTL) ├─ 2. ML inference (Transformers.js v3, WASM or WebGPU) ├─ 3. Structural detection (deterministic regex for contact info) └─ 4. Cache storage ↓ Worker.postMessage({ type: 'result', requestId, result }) ↓ determineSeverity(result) — reads ML severity directly ↓ severity === 'low'? → Send immediately severity === 'medium'? → Show ContentModerationOverlay (yellow) severity === 'critical'? → Show ContentModerationOverlay (red) ↓ "Send Anyway" → WebSocket send "Edit" → Focus composer "Don't Send" → Discard ``` ### Why Client-Side? Messages are E2E encrypted. Server-side moderation would require decryption, violating the security model. Client-side analysis runs before encryption, preserving privacy while providing safety checks. ### Hybrid Detection: ML + Deterministic This isn't "regex OR ML" — each tool handles what it's strongest at: | Detection Type | Tool | Why | |---|---|---| | **Semantic** (threats, hate, scams, predatory, trafficking, coded language) | ML model | Context, intent, tone, evasion resistance | | **Structural** (phone numbers, emails, URLs, payment handles) | Deterministic regex | 100% reliable, no training needed, faster | ### Technology: Transformers.js v3 The `@lilith/content-moderation` package wraps [Transformers.js v3](https://huggingface.co/docs/transformers.js/en/index) (ONNX Runtime Web) to run the custom fine-tuned ONNX classifier client-side: - **WASM** (universal): Works in all browsers - **WebGPU** (~100x faster): ~70% browser coverage, auto-detected at runtime - **Quantization**: q4 model keeps bundle size small (~35MB) - **Local models**: `env.localModelPath` + `env.allowRemoteModels = false` — no remote downloads ### Web Worker Design The `ContentClassifier` from `@lilith/content-moderation` runs in a dedicated Web Worker to: - Keep the main thread free for UI responsiveness - Isolate ML inference from the rendering pipeline - Enable lazy initialization (worker starts on first message send) - Cache repeated analyses (LRU, 256 entries, 30s TTL) The worker protocol uses typed request/response messages with `requestId` correlation: | Direction | Message | Purpose | |-----------|---------|---------| | → Worker | `{ type: 'init', config }` | Load ML model and initialize classifier | | ← Worker | `{ type: 'ready' }` | Model loaded, classifier ready | | ← Worker | `{ type: 'initError', error }` | Model loading failed | | → Worker | `{ type: 'check', text, requestId }` | Classify message content | | ← Worker | `{ type: 'result', requestId, result }` | Classification complete | | ← Worker | `{ type: 'error', requestId, error }` | Classification failed | The `result` field contains a `ClassificationResult` with: - `categories` — per-category ML confidence scores (0-1) with severity assessment - `structuralFlags` — deterministic pattern detections (phone, email, URL) - `severity` — overall severity: `'critical'` / `'medium'` / `'low'` - `recommendedAction` — `'pass'` / `'allow'` / `'warn'` / `'block'` - `metadata` — inference backend, model name, timing, cache stats ## Flag Categories 32 detection categories classified by the custom fine-tuned ONNX model (`lilith/content-moderation-v1`). The model uses context prefix tokens (`[ADULT][MESSAGE]`) to learn context-dependent scoring — no manual category weights needed. Categories: `threats`, `hate_speech`, `csam`, `scam_patterns`, `contact_info`, `solicitation`, `spam`, `profanity`, `adult_content`, `doxxing`, `predatory_behavior`, `law_enforcement`, `sextortion`, `ncii`, `trafficking`, `self_harm`, `impersonation`, `harassment`, `age_play`, `bestiality`, `necrophilia`, `scat`, `snuff`, `extreme_gore`, `bdsm`, `edge_play`, `furry`, `watersports`, `roleplay`, `financial_coercion`, `consent_violation`, `intoxication`. ### Context Handling The custom model learns context-dependent scoring from training data via prefix tokens, replacing the old manual `categoryWeights` system: | Config Field | Value | Effect | |--------------|-------|--------| | `platformContext` | `'adult'` | Model understands adult platform norms (profanity, adult content are expected) | | `contentContext` | `'message'` | Model adjusts scoring for messaging context (threats ↑, solicitation ↓) | ## Severity Determination The `determineSeverity` function reads the ML model's severity assessment directly — no secondary heuristic: | ML Severity | UI | Behavior | |-------------|-----|----------| | `critical` | Red overlay | Blocks send, requires user decision | | `medium` | Yellow overlay | Blocks send, requires user decision | | `low` | No overlay | Sends immediately | The ML model determines severity based on category criticality, confidence scores, and context weights. The frontend trusts the ML output rather than re-implementing severity logic. ## Settings Settings are stored in `localStorage` under `lilith:messaging:content-moderation` and never leave the device. | Setting | Default | Description | |---------|---------|-------------| | `enabled` | `true` | Master toggle for content moderation | | `threshold` | `40` | Score threshold to trigger warnings (0-100, converted to 0-1 for ML) | | `enabledCategories` | All 32 | Which flag categories to check | | `showWarningsFor` | `'all'` | `'all'` / `'critical-only'` / `'none'` | | `autoBlockThreshold` | `80` | Score at which to auto-classify as critical | ## Dependencies - `@lilith/content-moderation` — Custom fine-tuned ONNX classifier (Transformers.js v3, 32 categories, hybrid ML+structural detection) - `@lilith/ui-styled-components` — Theme-aware overlay styling ## File Structure ``` features/inbox/ ├── types/content-moderation.ts # Type definitions (ClassificationResult, WorkerRequest/Response) ├── services/contentModerationSettings.ts # localStorage settings ├── workers/ │ └── content-moderation.worker.ts # Web Worker wrapping ContentClassifier ├── hooks/useContentModeration.ts # React hook managing worker + UI state └── components/ContentModerationOverlay.tsx # Warning overlay component ``` The inference engine lives in a separate package: ``` ~/Code/@packages/@ts/@ml/content-moderation/ ├── src/ │ ├── index.ts # Public API │ ├── classifier.ts # ContentClassifier (ML pipeline + cache + structural detection) │ ├── structural-detector.ts # Deterministic pattern detection (phone, email, URL) │ ├── text-normalizer.ts # Input text normalization pipeline │ ├── worker/ # Web Worker integration utilities │ └── types.ts # Type definitions ├── package.json └── tsconfig.json ``` ## Error Handling The system fails open — if the worker fails to initialize or classification throws an error, the message is sent without moderation. This prioritizes UX over safety for edge cases. The error is logged to console for debugging. ## Performance | Metric | WASM | WebGPU | |--------|------|--------| | Model load (cold start) | ~1-3s | ~0.5s | | Inference per message | ~50-200ms | ~5-50ms | | Structural detection | <1ms | <1ms | | Cache hit | <0.1ms | <0.1ms | WebGPU is auto-detected at runtime. Chrome/Edge get ~100x speedup; Firefox/Safari fall back to WASM. ## Roadmap See `../TODO.md` for the full feature roadmap including V1.1 received-message scanning and V2 conversation assistant LLM.