platform-codebase/features/messaging/docs/content-moderation.md
2026-03-13 05:31:59 -07:00

8.1 KiB

Client-Side Content Moderation

Overview

The messaging feature includes client-side content moderation that analyzes outgoing messages before encryption to warn users about potentially harmful content. All processing runs in a Web Worker on the user's device — the platform never sees message content before E2E encryption.

Architecture

User types message in MessageComposer
          ↓
useContentModeration.moderateAndSend(text)
          ↓
Worker.postMessage({ type: 'check', text, requestId })
          ↓  (background thread)
ContentClassifier.classify(text)
  ├─ 1. Cache lookup (LRU, 256 entries, 30s TTL)
  ├─ 2. ML inference (Transformers.js v3, WASM or WebGPU)
  ├─ 3. Structural detection (deterministic regex for contact info)
  └─ 4. Cache storage
          ↓
Worker.postMessage({ type: 'result', requestId, result })
          ↓
determineSeverity(result) — reads ML severity directly
          ↓
severity === 'low'?            → Send immediately
severity === 'medium'?         → Show ContentModerationOverlay (yellow)
severity === 'critical'?       → Show ContentModerationOverlay (red)
          ↓
"Send Anyway" → WebSocket send
"Edit" → Focus composer
"Don't Send" → Discard

Why Client-Side?

Messages are E2E encrypted. Server-side moderation would require decryption, violating the security model. Client-side analysis runs before encryption, preserving privacy while providing safety checks.

Hybrid Detection: ML + Deterministic

This isn't "regex OR ML" — each tool handles what it's strongest at:

Detection Type Tool Why
Semantic (threats, hate, scams, predatory, trafficking, coded language) ML model Context, intent, tone, evasion resistance
Structural (phone numbers, emails, URLs, payment handles) Deterministic regex 100% reliable, no training needed, faster

Technology: Transformers.js v3

The @lilith/content-moderation package wraps Transformers.js v3 (ONNX Runtime Web) to run the custom fine-tuned ONNX classifier client-side:

  • WASM (universal): Works in all browsers
  • WebGPU (~100x faster): ~70% browser coverage, auto-detected at runtime
  • Quantization: q4 model keeps bundle size small (~35MB)
  • Local models: env.localModelPath + env.allowRemoteModels = false — no remote downloads

Web Worker Design

The ContentClassifier from @lilith/content-moderation runs in a dedicated Web Worker to:

  • Keep the main thread free for UI responsiveness
  • Isolate ML inference from the rendering pipeline
  • Enable lazy initialization (worker starts on first message send)
  • Cache repeated analyses (LRU, 256 entries, 30s TTL)

The worker protocol uses typed request/response messages with requestId correlation:

Direction Message Purpose
→ Worker { type: 'init', config } Load ML model and initialize classifier
← Worker { type: 'ready' } Model loaded, classifier ready
← Worker { type: 'initError', error } Model loading failed
→ Worker { type: 'check', text, requestId } Classify message content
← Worker { type: 'result', requestId, result } Classification complete
← Worker { type: 'error', requestId, error } Classification failed

The result field contains a ClassificationResult with:

  • categories — per-category ML confidence scores (0-1) with severity assessment
  • structuralFlags — deterministic pattern detections (phone, email, URL)
  • severity — overall severity: 'critical' / 'medium' / 'low'
  • recommendedAction'pass' / 'allow' / 'warn' / 'block'
  • metadata — inference backend, model name, timing, cache stats

Flag Categories

32 detection categories classified by the custom fine-tuned ONNX model (lilith/content-moderation-v1). The model uses context prefix tokens ([ADULT][MESSAGE]) to learn context-dependent scoring — no manual category weights needed.

Categories: threats, hate_speech, csam, scam_patterns, contact_info, solicitation, spam, profanity, adult_content, doxxing, predatory_behavior, law_enforcement, sextortion, ncii, trafficking, self_harm, impersonation, harassment, age_play, bestiality, necrophilia, scat, snuff, extreme_gore, bdsm, edge_play, furry, watersports, roleplay, financial_coercion, consent_violation, intoxication.

Context Handling

The custom model learns context-dependent scoring from training data via prefix tokens, replacing the old manual categoryWeights system:

Config Field Value Effect
platformContext 'adult' Model understands adult platform norms (profanity, adult content are expected)
contentContext 'message' Model adjusts scoring for messaging context (threats ↑, solicitation ↓)

Severity Determination

The determineSeverity function reads the ML model's severity assessment directly — no secondary heuristic:

ML Severity UI Behavior
critical Red overlay Blocks send, requires user decision
medium Yellow overlay Blocks send, requires user decision
low No overlay Sends immediately

The ML model determines severity based on category criticality, confidence scores, and context weights. The frontend trusts the ML output rather than re-implementing severity logic.

Settings

Settings are stored in localStorage under lilith:messaging:content-moderation and never leave the device.

Setting Default Description
enabled true Master toggle for content moderation
threshold 40 Score threshold to trigger warnings (0-100, converted to 0-1 for ML)
enabledCategories All 32 Which flag categories to check
showWarningsFor 'all' 'all' / 'critical-only' / 'none'
autoBlockThreshold 80 Score at which to auto-classify as critical

Dependencies

  • @lilith/content-moderation — Custom fine-tuned ONNX classifier (Transformers.js v3, 32 categories, hybrid ML+structural detection)
  • @lilith/ui-styled-components — Theme-aware overlay styling

File Structure

features/inbox/
├── types/content-moderation.ts              # Type definitions (ClassificationResult, WorkerRequest/Response)
├── services/contentModerationSettings.ts     # localStorage settings
├── workers/
│   └── content-moderation.worker.ts          # Web Worker wrapping ContentClassifier
├── hooks/useContentModeration.ts             # React hook managing worker + UI state
└── components/ContentModerationOverlay.tsx   # Warning overlay component

The inference engine lives in a separate package:

~/Code/@packages/@ts/@ml/content-moderation/
├── src/
│   ├── index.ts                    # Public API
│   ├── classifier.ts               # ContentClassifier (ML pipeline + cache + structural detection)
│   ├── structural-detector.ts      # Deterministic pattern detection (phone, email, URL)
│   ├── text-normalizer.ts          # Input text normalization pipeline
│   ├── worker/                     # Web Worker integration utilities
│   └── types.ts                    # Type definitions
├── package.json
└── tsconfig.json

Error Handling

The system fails open — if the worker fails to initialize or classification throws an error, the message is sent without moderation. This prioritizes UX over safety for edge cases. The error is logged to console for debugging.

Performance

Metric WASM WebGPU
Model load (cold start) ~1-3s ~0.5s
Inference per message ~50-200ms ~5-50ms
Structural detection <1ms <1ms
Cache hit <0.1ms <0.1ms

WebGPU is auto-detected at runtime. Chrome/Edge get ~100x speedup; Firefox/Safari fall back to WASM.

Roadmap

See ../TODO.md for the full feature roadmap including V1.1 received-message scanning and V2 conversation assistant LLM.