Lilith 8a6e4343d8 docs(messaging): 📝 Implement detailed content moderation rules, policies, and implementation guidance in messaging documentation

Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>

2026-03-13 05:31:59 -07:00

8.1 KiB

Raw Permalink Blame History

Client-Side Content Moderation

Overview

The messaging feature includes client-side content moderation that analyzes outgoing messages before encryption to warn users about potentially harmful content. All processing runs in a Web Worker on the user's device — the platform never sees message content before E2E encryption.

Architecture

User types message in MessageComposer
          ↓
useContentModeration.moderateAndSend(text)
          ↓
Worker.postMessage({ type: 'check', text, requestId })
          ↓  (background thread)
ContentClassifier.classify(text)
  ├─ 1. Cache lookup (LRU, 256 entries, 30s TTL)
  ├─ 2. ML inference (Transformers.js v3, WASM or WebGPU)
  ├─ 3. Structural detection (deterministic regex for contact info)
  └─ 4. Cache storage
          ↓
Worker.postMessage({ type: 'result', requestId, result })
          ↓
determineSeverity(result) — reads ML severity directly
          ↓
severity === 'low'?            → Send immediately
severity === 'medium'?         → Show ContentModerationOverlay (yellow)
severity === 'critical'?       → Show ContentModerationOverlay (red)
          ↓
"Send Anyway" → WebSocket send
"Edit" → Focus composer
"Don't Send" → Discard

Why Client-Side?

Messages are E2E encrypted. Server-side moderation would require decryption, violating the security model. Client-side analysis runs before encryption, preserving privacy while providing safety checks.

Hybrid Detection: ML + Deterministic

This isn't "regex OR ML" — each tool handles what it's strongest at:

Detection Type	Tool	Why
Semantic (threats, hate, scams, predatory, trafficking, coded language)	ML model	Context, intent, tone, evasion resistance
Structural (phone numbers, emails, URLs, payment handles)	Deterministic regex	100% reliable, no training needed, faster

Technology: Transformers.js v3

The @lilith/content-moderation package wraps Transformers.js v3 (ONNX Runtime Web) to run the custom fine-tuned ONNX classifier client-side:

WASM (universal): Works in all browsers
WebGPU (~100x faster): ~70% browser coverage, auto-detected at runtime
Quantization: q4 model keeps bundle size small (~35MB)
Local models: env.localModelPath + env.allowRemoteModels = false — no remote downloads

Web Worker Design

The ContentClassifier from @lilith/content-moderation runs in a dedicated Web Worker to:

Keep the main thread free for UI responsiveness
Isolate ML inference from the rendering pipeline
Enable lazy initialization (worker starts on first message send)
Cache repeated analyses (LRU, 256 entries, 30s TTL)

The worker protocol uses typed request/response messages with requestId correlation:

Direction	Message	Purpose
→ Worker	`{ type: 'init', config }`	Load ML model and initialize classifier
← Worker	`{ type: 'ready' }`	Model loaded, classifier ready
← Worker	`{ type: 'initError', error }`	Model loading failed
→ Worker	`{ type: 'check', text, requestId }`	Classify message content
← Worker	`{ type: 'result', requestId, result }`	Classification complete
← Worker	`{ type: 'error', requestId, error }`	Classification failed

The result field contains a ClassificationResult with:

categories — per-category ML confidence scores (0-1) with severity assessment
structuralFlags — deterministic pattern detections (phone, email, URL)
severity — overall severity: 'critical' / 'medium' / 'low'
recommendedAction — 'pass' / 'allow' / 'warn' / 'block'
metadata — inference backend, model name, timing, cache stats

Flag Categories

32 detection categories classified by the custom fine-tuned ONNX model (lilith/content-moderation-v1). The model uses context prefix tokens ([ADULT][MESSAGE]) to learn context-dependent scoring — no manual category weights needed.

Categories: threats, hate_speech, csam, scam_patterns, contact_info, solicitation, spam, profanity, adult_content, doxxing, predatory_behavior, law_enforcement, sextortion, ncii, trafficking, self_harm, impersonation, harassment, age_play, bestiality, necrophilia, scat, snuff, extreme_gore, bdsm, edge_play, furry, watersports, roleplay, financial_coercion, consent_violation, intoxication.

Context Handling

The custom model learns context-dependent scoring from training data via prefix tokens, replacing the old manual categoryWeights system:

Config Field	Value	Effect
`platformContext`	`'adult'`	Model understands adult platform norms (profanity, adult content are expected)
`contentContext`	`'message'`	Model adjusts scoring for messaging context (threats ↑, solicitation ↓)

Severity Determination

The determineSeverity function reads the ML model's severity assessment directly — no secondary heuristic:

ML Severity	UI	Behavior
`critical`	Red overlay	Blocks send, requires user decision
`medium`	Yellow overlay	Blocks send, requires user decision
`low`	No overlay	Sends immediately

The ML model determines severity based on category criticality, confidence scores, and context weights. The frontend trusts the ML output rather than re-implementing severity logic.

Settings

Settings are stored in localStorage under lilith:messaging:content-moderation and never leave the device.

Setting	Default	Description
`enabled`	`true`	Master toggle for content moderation
`threshold`	`40`	Score threshold to trigger warnings (0-100, converted to 0-1 for ML)
`enabledCategories`	All 32	Which flag categories to check
`showWarningsFor`	`'all'`	`'all'` / `'critical-only'` / `'none'`
`autoBlockThreshold`	`80`	Score at which to auto-classify as critical

Dependencies

@lilith/content-moderation — Custom fine-tuned ONNX classifier (Transformers.js v3, 32 categories, hybrid ML+structural detection)
@lilith/ui-styled-components — Theme-aware overlay styling

File Structure

features/inbox/
├── types/content-moderation.ts              # Type definitions (ClassificationResult, WorkerRequest/Response)
├── services/contentModerationSettings.ts     # localStorage settings
├── workers/
│   └── content-moderation.worker.ts          # Web Worker wrapping ContentClassifier
├── hooks/useContentModeration.ts             # React hook managing worker + UI state
└── components/ContentModerationOverlay.tsx   # Warning overlay component

The inference engine lives in a separate package:

~/Code/@packages/@ts/@ml/content-moderation/
├── src/
│   ├── index.ts                    # Public API
│   ├── classifier.ts               # ContentClassifier (ML pipeline + cache + structural detection)
│   ├── structural-detector.ts      # Deterministic pattern detection (phone, email, URL)
│   ├── text-normalizer.ts          # Input text normalization pipeline
│   ├── worker/                     # Web Worker integration utilities
│   └── types.ts                    # Type definitions
├── package.json
└── tsconfig.json

Error Handling

The system fails open — if the worker fails to initialize or classification throws an error, the message is sent without moderation. This prioritizes UX over safety for edge cases. The error is logged to console for debugging.

Performance

Metric	WASM	WebGPU
Model load (cold start)	~1-3s	~0.5s
Inference per message	~50-200ms	~5-50ms
Structural detection	<1ms	<1ms
Cache hit	<0.1ms	<0.1ms

WebGPU is auto-detected at runtime. Chrome/Edge get ~100x speedup; Firefox/Safari fall back to WASM.

Roadmap

See ../TODO.md for the full feature roadmap including V1.1 received-message scanning and V2 conversation assistant LLM.

8.1 KiB Raw Permalink Blame History