History

Lilith 440fb4a9d0 chore(src): 🔧 Update TypeScript files in src directory (31 files) Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>		2026-02-12 00:28:21 -08:00
..
milestones	deps-upgrade(captcha-solver): ⬆️ Update Python & JavaScript dependencies	2026-02-07 19:51:06 -08:00
README.md	chore(src): 🔧 Update TypeScript files in src directory (31 files)	2026-02-12 00:28:21 -08:00

README.md

Nightcrawler — Provider Discovery & Outreach Engine

Nightcrawler crawls escort listing sites (Tryst, Eros, TransEscorts), builds a structured database of providers, and enables targeted outreach to invite them to the Lilith Platform — a verified, free alternative to expensive public listings.

Why

Providers on listing sites face two problems:

Safety: They must share personal information publicly to attract clients — names, photos, contact details exposed to anyone browsing.
Cost: Listing fees range from $100-300/month (Eros charges the most).

Lilith offers a verified walled garden where providers control their visibility for free. Nightcrawler identifies active providers, deduplicates them across platforms, and tracks outreach: "You're paying $150/month to post publicly. Join our verified community for free."

The scraped bio data also feeds a model that provides bio improvement guidance to platform members.

Location

codebase/tools/nightcrawler/    # Standalone CLI tool (not a platform feature)

This is a CLI tool, not a backend/frontend feature. It lives alongside codebase/tools/privacy-scanner/ which follows the same standalone pattern.

Prerequisites

Node.js >= 20
PostgreSQL — a dedicated nightcrawler database (isolated from platform DB)
Tor (optional) — system tor binary for proxy rotation
Playwright browsers — install via npx playwright install chromium

Quick Start

# 1. Navigate to the tool
cd codebase/tools/nightcrawler

# 2. Install dependencies
bun install

# 3. Copy and edit config
cp crawl-config.example.yaml crawl-config.yaml
# Edit database credentials, platform/city selection, timing

# 4. Create the PostgreSQL database
createdb nightcrawler
# TypeORM auto-syncs schema on first connect (dev mode)

# 5. Run a single-platform test crawl (visible browser)
tsx src/index.ts crawl --platform tryst --city san-francisco --pages 1 --no-headless

# 6. Full crawl
tsx src/index.ts crawl --config crawl-config.yaml

Architecture

nightcrawler/
├── crawl-config.example.yaml       # Configuration template
├── crawl-config.yaml               # Local config (gitignored)
├── docker-compose.yml              # PostgreSQL for local dev
├── tsup.config.ts                  # Library build config (lixbuild)
├── vitest.config.ts                # Test runner config (lixtest)
├── validate-selectors.ts           # Selector JSON schema validator
├── test-setup.ts                   # Global test setup
├── packages/                       # Sub-packages
│   ├── captcha-generator/          # CAPTCHA generation for testing
│   ├── captcha-solver/             # Automated CAPTCHA solving
│   └── combined-showcase/          # Demo/showcase package
├── src/
│   ├── index.ts                    # CLI entry point
│   ├── types.ts                    # All TypeScript interfaces
│   ├── config/                     # Configuration loading
│   │   ├── constants.ts            # Platform URLs, timing defaults
│   │   ├── cities.ts               # LA, SF, LV with per-platform URL slugs
│   │   ├── crawl-config.ts         # YAML config loader + Zod validation
│   │   └── selectors.ts            # Selector type definitions + loading
│   ├── db/                         # Database layer (standalone PostgreSQL)
│   │   ├── data-source.ts          # TypeORM DataSource (NOT platform DB)
│   │   ├── README.md               # Database documentation
│   │   ├── migrations/
│   │   │   └── 001_initial_schema.ts
│   │   └── entities/               # 16 TypeORM entities
│   │       ├── index.ts            # Barrel export
│   │       ├── ab-test.entity.ts
│   │       ├── ab-test-arm.entity.ts
│   │       ├── blocklist-entry.entity.ts
│   │       ├── campaign-analytics.entity.ts
│   │       ├── campaign-sequence.entity.ts
│   │       ├── campaign-sequence-step.entity.ts
│   │       ├── crawl-session.entity.ts
│   │       ├── discovered-provider.entity.ts
│   │       ├── message-template.entity.ts
│   │       ├── message-variation.entity.ts
│   │       ├── outreach-queue.entity.ts
│   │       ├── outreach-record.entity.ts
│   │       ├── outreach-sequence-state.entity.ts
│   │       ├── photo-hash.entity.ts
│   │       ├── platform-listing.entity.ts
│   │       └── provider-classification.entity.ts
│   ├── adapters/                   # Platform-specific scrapers
│   │   ├── index.ts                # Adapter registry + factory
│   │   ├── base-adapter.ts         # Shared selector loading, URL building, contact reveal, Cloudflare handling
│   │   ├── tryst-adapter.ts        # Tryst: Cloudflare + Altcha PoW
│   │   ├── eros-adapter.ts         # Eros: needs discovery mode first
│   │   └── transescorts-adapter.ts # TransEscorts: needs discovery mode first
│   ├── browser/                    # Browser automation layer
│   │   ├── index.ts                # Browser module exports
│   │   ├── browser-manager.ts      # Playwright + stealth + proxy integration
│   │   ├── human-behavior.ts       # Gaussian delays, Bezier mouse, natural scroll
│   │   └── cookie-store.ts         # Cookie persistence across sessions
│   ├── pipeline/                   # Data processing pipeline
│   │   ├── orchestrator.ts         # Main crawl loop (crawlPlatformCity)
│   │   ├── photo-hasher.ts         # dHash + pHash via sharp (no images stored)
│   │   ├── deduplication.ts        # Multi-signal cross-platform matching
│   │   ├── blocklist.ts            # SHA-256 opt-out system
│   │   ├── bio-analyzer.ts         # Bio tone, length, richness analysis
│   │   ├── menu-extractor.ts       # LLM + regex service menu extraction
│   │   ├── feature-vector-builder.ts # Provider feature vector construction
│   │   ├── rate-normalizer.ts      # Rate tier assignment
│   │   ├── llm-client.ts           # LLM API client with retry/timeout
│   │   └── schemas.ts              # Zod schemas for pipeline data
│   ├── analysis/                   # Classification & clustering
│   │   ├── classifier.ts           # Provider classification pipeline
│   │   ├── clustering.ts           # k-means / DBSCAN clustering
│   │   ├── characteristic-extractor.ts # Platform tags + bio regex extraction
│   │   ├── confidence-aggregator.ts    # Multi-signal confidence scoring
│   │   ├── vector-encoder.ts       # Feature vector encoding
│   │   └── username-analyzer.ts    # Cross-platform username matching
│   ├── experts/                    # LLM expert extraction system
│   │   ├── base-expert.ts          # Base expert class
│   │   ├── expert-pool.ts          # Expert pool management
│   │   ├── expert-aggregator.ts    # Multi-expert result aggregation
│   │   ├── bio-expert.ts           # Bio text analysis expert
│   │   ├── contact-expert.ts       # Contact info extraction expert
│   │   ├── menu-expert.ts          # Service menu extraction expert
│   │   ├── rate-expert.ts          # Rate/pricing extraction expert
│   │   ├── attribute-mapper.ts     # Attribute mapping utilities
│   │   ├── prompts.ts              # LLM prompt templates
│   │   ├── schemas.ts              # Zod schemas for expert output
│   │   └── types.ts                # Expert-specific type definitions
│   ├── api/                        # REST API (outreach dashboard backend)
│   │   ├── server.ts               # Express server setup
│   │   ├── outreach-controller.ts  # Outreach CRUD + queue endpoints
│   │   └── analytics-controller.ts # Campaign analytics endpoints
│   ├── outreach/                   # Outreach engine (18 modules)
│   │   ├── email-sender.ts         # CAN-SPAM compliant email delivery
│   │   ├── imessage-client.ts      # iMessage integration (stub)
│   │   ├── template-service.ts     # Message template CRUD + variable substitution
│   │   ├── variation-generator.ts  # A/B test message variations
│   │   ├── sequence-service.ts     # Multi-step campaign sequences
│   │   ├── outreach-queue-service.ts # Queue processing + scheduling
│   │   ├── pacing-engine.ts        # Daily/hourly rate limits
│   │   ├── safety-breaker.ts       # Kill-switch on opt-out rate thresholds
│   │   ├── ab-test-service.ts      # A/B test lifecycle management
│   │   ├── bayesian-analyzer.ts    # Bayesian A/B test analytics
│   │   ├── reply-classifier.ts     # LLM intent detection on replies
│   │   ├── reply-router.ts         # FAQ, follow-up, escalation routing
│   │   ├── faq-bank.ts             # FAQ response bank
│   │   ├── opt-out-processor.ts    # Opt-out handling + blocklist sync
│   │   ├── conversion-detector.ts  # Signup attribution detection
│   │   ├── attribution-service.ts  # Campaign-to-conversion attribution
│   │   ├── relation-helpers.ts     # TypeORM relation loading utilities
│   │   └── report-generator.ts     # Outreach performance reports
│   ├── ui/                         # React dashboard (outreach management)
│   │   ├── package.json            # Separate package (@lilith/nightcrawler-ui)
│   │   ├── index.html              # HTML entry point
│   │   ├── vite.config.ts          # Vite config (port 3401, API proxy to 3400)
│   │   ├── tsconfig.json           # TypeScript config
│   │   └── src/
│   │       ├── main.tsx
│   │       ├── App.tsx
│   │       ├── api.ts              # API client
│   │       ├── components/         # Shared UI components
│   │       │   ├── ArmComparison.tsx
│   │       │   ├── ChannelBadge.tsx
│   │       │   ├── ConfidenceBadge.tsx
│   │       │   ├── FunnelChart.tsx
│   │       │   ├── MetricsCard.tsx
│   │       │   └── SequenceTimeline.tsx
│   │       └── pages/              # Dashboard pages
│   │           ├── AnalyticsDashboard.tsx
│   │           ├── ApprovalQueue.tsx
│   │           ├── CampaignManager.tsx
│   │           ├── ProviderExplorer.tsx
│   │           └── TemplateWorkshop.tsx
│   └── cli/
│       ├── commands.ts             # CLI command definitions
│       ├── discover-command.ts     # Interactive selector discovery
│       └── progress.ts             # Terminal progress display
├── tests/                          # Test suites
│   ├── setup.ts                    # Test infrastructure setup
│   ├── setup.test.ts               # Setup verification tests
│   ├── fixtures/                   # Test fixtures and mock data
│   ├── unit/                       # Unit tests
│   ├── integration/                # Integration tests
│   ├── adapters/                   # Adapter tests
│   ├── analysis/                   # Analysis module tests
│   ├── browser/                    # Browser module tests
│   ├── config/                     # Config module tests
│   ├── db/                         # Database tests
│   ├── pipeline/                   # Pipeline tests
│   ├── *.test.ts                   # Outreach module tests (root level)
│   └── README.md                   # Test documentation
├── docs/                           # Documentation
└── output/                         # Gitignored exports

CLI Commands

Crawling

# Crawl all platforms across all cities (uses config file)
tsx src/index.ts crawl --config crawl-config.yaml

# Single platform + city with page limit
tsx src/index.ts crawl --platform tryst --city san-francisco --pages 5

# Visible browser (for initial captcha solving / debugging)
tsx src/index.ts crawl --platform tryst --no-headless

Selector Discovery

First-time setup for each platform. Opens a visible browser, dumps the DOM structure so you can map CSS selectors in selectors/*.json.

tsx src/index.ts discover --platform tryst --city los-angeles
tsx src/index.ts discover --platform eros --city los-angeles

Blocklist Management

Opt-out system using SHA-256 hashes — no plaintext identifiers stored in the blocklist.

# Block a specific identifier
tsx src/index.ts blocklist add --type email --value "someone@example.com"
tsx src/index.ts blocklist add --type phone --value "+1-555-123-4567"
tsx src/index.ts blocklist add --type profile_url --value "https://tryst.link/escort/someone"

# Bulk import (e.g., registered platform users)
tsx src/index.ts blocklist import --file registered-users.csv

# List all blocklist entries
tsx src/index.ts blocklist list

Outreach Tracking

# List providers by outreach status
tsx src/index.ts outreach list --status pending

# Update a provider's outreach status
tsx src/index.ts outreach update --provider <uuid> --status contacted --notes "Emailed 2026-02-07"

# Outreach statistics
tsx src/index.ts outreach stats

Export

# Export pending providers for an outreach campaign
tsx src/index.ts export --status pending --format csv --output outreach.csv

# Export all providers as JSON (for bio model training)
tsx src/index.ts export --format json --output providers.json

Statistics

tsx src/index.ts stats

Configuration

Copy crawl-config.example.yaml to crawl-config.yaml:

database:
  host: localhost
  port: 5432
  username: nightcrawler
  password: changeme
  database: nightcrawler

platforms:          # Which sites to crawl
  - tryst
  - eros
  - transescorts

cities:             # Target cities
  - los-angeles
  - san-francisco
  - las-vegas

crawl:
  maxPagesPerCity: 20       # Listing pages per city
  concurrency: 3            # Parallel browser contexts
  headless: true            # Set false for debugging/captcha
  delayMean: 5000           # Gaussian delay between requests (ms)
  delayStdDev: 2000
  delayMin: 2000
  delayMax: 12000
  photoHashEnabled: true    # Download photos for perceptual hashing
  contactRevealEnabled: true # Click-to-reveal hidden contact info

proxy:
  enabled: false
  type: tor                 # tor | socks5 | http
  instances: 3              # Number of Tor circuits
  startPort: 9050

circuitBreaker:
  failureThreshold: 5       # Open circuit after N consecutive failures
  successThreshold: 3       # Close after N successes in half-open
  timeout: 60000            # Half-open retry delay (ms)

Database

Nightcrawler uses its own PostgreSQL database — never the platform DB.

Tables

Table	Purpose
`crawl_sessions`	Audit trail per crawl run (platform, city, counts, errors)
`discovered_providers`	Canonical person record (name, location, bio, rates, contact, outreach status)
`platform_listings`	One row per platform presence, FK to provider (raw scraped snapshot)
`photo_hashes`	dHash + pHash per photo, FK to listing (no images stored)
`blocklist_entries`	SHA-256 hashes of opted-out identifiers
`outreach_records`	Status transition log per provider

Schema Relationships

discovered_providers (1) ←→ (N) platform_listings
platform_listings    (1) ←→ (N) photo_hashes
discovered_providers (1) ←→ (N) outreach_records

A single discovered_provider can have listings on multiple platforms. The dedup engine merges them based on photo hashes, contact info, name similarity, and bio similarity.

PII Encryption

Contact fields (email, phone) are encrypted at rest using @lilith/typeorm-pgcrypto column-level encryption. The blocklist stores only SHA-256 hashes — never plaintext identifiers.

Data Extracted Per Provider

Field	Source	Storage
Display name	Profile page	Plaintext
Location (city, state)	Profile page	Plaintext
Bio	Profile page	Plaintext (also feeds bio model)
Rates	Profile page	JSON (hourly, multi-hour, overnight)
Menu / services	Profile page	String array
Touring status	Profile page	JSON (isTouring, city, dates)
Verification level	Profile page	Enum
Email	Click-to-reveal	Encrypted (pgcrypto)
Phone	Click-to-reveal	Encrypted (pgcrypto)
Social links	Profile page	JSON (twitter, instagram, onlyfans, website)
Photo hashes	Download → hash → delete	dHash + pHash strings (no images)

Crawl Flow

1. Load config (YAML + Zod validation)
2. Connect to PostgreSQL, create CrawlSession record
3. FOR EACH platform:
   a. Launch stealth Playwright browser
   b. Handle anti-bot (Cloudflare, Altcha PoW, manual fallback)
   c. FOR EACH city:
      d. Paginate listing pages → collect profile URLs
      e. FOR EACH profile URL:
         - Check blocklist → skip if any identifier matches
         - Check cache → skip if recently scraped
         - Gaussian delay (human-like timing)
         - Scrape full profile
         - Click-to-reveal hidden contact info
         - Download photos → compute perceptual hashes → delete images
         - Dedup against existing providers (weighted multi-signal matching)
         - Upsert: create new provider or merge into existing
4. Finalize session (counts, errors, duration)
5. Print summary

Deduplication

Providers often appear on multiple platforms under different names. The dedup engine uses weighted multi-signal matching:

Signal	Weight	Method
Photo hash match	0.90	Hamming distance on dHash <= 5 bits
Email match	0.95	Exact normalized comparison
Phone match	0.85	Last 10 digits comparison
Social handle	0.80	Same username on same platform
Name + city	0.40	Phonetic (DoubleMetaphone) + fuzzy (Levenshtein <= 2)
Bio similarity	0.30	Cosine similarity > 0.6

Match threshold: total weighted confidence >= 0.70 triggers a merge.

Opt-Out / Blocklist

The blocklist prevents crawling or storing data for opted-out individuals:

Identifiers are normalized (lowercase, trimmed, formatting stripped)
SHA-256 hash computed — no plaintext stored in the blocklist
Hash stored in blocklist_entries table
Pre-crawl: every profile URL and extracted identifier is checked against the blocklist
On opt-out: matching providers, listings, and photo hashes are deleted
Platform sync: when someone registers on Lilith, their identifiers are auto-added to the blocklist

Anti-Bot Strategy

Each platform has different protections:

Tryst.link

Cloudflare: Stealth Playwright + cookie persistence
Altcha PoW: Computed client-side using altcha-lib
Contact reveal: Click-to-show buttons for email/phone

Eros.com

Anti-bot TBD — use discover command to map DOM and protections
Known to use aggressive bot detection

TransEscorts.com

Anti-bot TBD — use discover command first
Simpler site structure expected

General Anti-Detection

Stealth Playwright: playwright-extra + puppeteer-extra-plugin-stealth
Gaussian timing: Human-like delays between actions (mean 5s, stddev 2s)
Bezier mouse movement: Curved paths with jitter and overshoot
Cookie persistence: Reuse sessions across runs to avoid re-triggering challenges
Circuit breaker: Opens after 5 consecutive failures, half-open retry after 60s
Tor proxy (optional): IP rotation via multi-instance Tor SOCKS5 pool

Platform Package Reuse

Nightcrawler maximizes reuse of existing @lilith/* packages:

Package	Usage
`@lilith/text-processing-utils`	Bio normalization, email extraction, cosine similarity scoring
`@lilith/text-processing-algorithms`	DoubleMetaphone (phonetic name matching), LevenshteinDistance (fuzzy names), Trie (username lookup)
`@lilith/circuit-breaker`	Per-platform failure isolation
`@lilith/retry`	Retry decorator on scrape methods
`@lilith/client-base`	HTTP client with middleware for photo downloads
`@lilith/geo-utils`	City normalization, adjacent-city distance calculations
`@lilith/typeorm-pgcrypto`	Column-level encryption for PII (email, phone)
`@lilith/terminal-cli-parser`	CLI argument parsing
`@lilith/lix-cli`	Terminal UI (spinners, progress, tables)
`@lilith/yaml-loader`	Type-safe YAML config with Zod validation
`@lilith/distributed-lock`	Prevent duplicate crawl sessions

Selectors

CSS selectors are stored in selectors/*.json files — editable without code changes. When a site updates its DOM structure, update the selector file and re-run.

Use the discover command to dump DOM structure for a platform:

tsx src/index.ts discover --platform eros --city los-angeles --no-headless

This opens a visible browser, navigates to listings and profiles, and logs the DOM tree to help you map selectors.

Implementation Status (as of 2026-02-08)

Components: DONE

These are production-grade, tested, and ready to use:

TypeORM entities + standalone DataSource (15 entities, PostgreSQL)
Types, interfaces, constants, city configs (908-line type system)
YAML config loader with Zod validation
Blocklist service (SHA-256 hash, check, add, import)
CLI shell with command routing (crawl, discover, blocklist, outreach, export, stats)
Playwright stealth browser manager (proxy rotation, cookie persistence, resource blocking)
Human behavior simulation (Gaussian timing via Box-Muller, Bezier mouse, natural typing)
Cookie persistence per platform
Altcha PoW solver (SHA-256 challenge computation)
Selector discovery mode (interactive browser-based CSS selector finder)
Base adapter (selector loading, scraping, rate/menu/photo/social extraction, pagination)
Tryst adapter (Cloudflare Turnstile + Altcha PoW handling)
Photo hasher (dHash + pHash via sharp, Hamming distance)
Dedup engine (6-signal weighted matching, 0.70 threshold)
Profile processing pipeline (blocklist -> freshness -> hash -> dedup -> upsert)
Provider upsert with transactional merge/create
Proxy/Tor rotation (round-robin across N instances, tor/socks5/http)
Circuit breaker per platform (fault isolation)
Progress UI (terminal spinners, bars, tables)

Orchestrator Wiring: NOT DONE

The core crawl loop crawlPlatformCity() at src/pipeline/orchestrator.ts:136-156 is a TODO stub. It contains a console.log and zeroed-out stats. All downstream pipeline methods (processProfile, upsertProvider, etc.) are real — but the entry loop that feeds them was never implemented.

What's missing (~50-80 lines):

Instantiate BrowserManager and launch browser context
Create platform adapter via factory
Paginate listing pages via adapter.scrapeListings()
Loop through profile URLs with human-like delays
Call adapter.scrapeProfile() + adapter.revealContact()
Feed results into existing processProfile() pipeline
Close browser context on completion

Selectors: NOT DONE

The selectors/ directory does not exist. No selector JSON files have been created for any platform.

Create selectors/ directory
Generate selectors/tryst.json via discovery mode (interactive, requires human operator)
Generate selectors/eros.json via discovery mode
Generate selectors/transescorts.json via discovery mode

Configuration: NOT DONE

Create crawl-config.yaml from example (DB credentials, platform/city selection)
Create PostgreSQL nightcrawler database
Install Playwright browsers (npx playwright install chromium)

Adapter Gaps

Eros adapter — URL builders only, needs discovery to understand bot detection
TransEscorts adapter — URL builders only, needs discovery
State extraction — hardcoded 'CA' at orchestrator.ts:290

Proxy/Tor

Code-complete but dead code until orchestrator is implemented:

Round-robin rotation: activeContexts % instances in browser-manager.ts:74
Config schema: proxy.enabled, proxy.type (tor/socks5/http), proxy.instances, proxy.startPort
No Tor containers or system setup included — operator responsibility

M2: Classification (code exists, untested against real data)

LLM client with retry/timeout
Menu extractor (LLM + regex fallback)
Bio analyzer (tone, length, richness)
Rate normalizer (tier assignment)
Characteristic extractor (platform tags + bio regex)
Feature vector builder
Clustering (k-means/DBSCAN)
Classification pipeline orchestrator
JSON/CSV export with confidence filtering
Validate against real scraped data (blocked by M1 orchestrator)

M3: Outreach (architecture exists, partially implemented)

Email sender (CAN-SPAM compliance, unsubscribe, tracking)
Template service (CRUD, variable substitution)
Sequence service (multi-step campaigns)
A/B test service (Bayesian analytics)
Pacing engine (daily/hourly rate limits)
Safety breaker (kill-switch on opt-out rate thresholds)
Reply classifier (LLM intent detection)
Reply router (FAQ, follow-up, escalate)
Conversion detector + attribution
REST API + React dashboard
iMessage client (stub, not production-ready)
Variation generator (stub, no ML generation)

Testing

# Run all tests
bun test

# Run specific test
bun test tests/blocklist.test.ts

# Watch mode
bun test --watch

Test Coverage

Test File	What It Covers
`blocklist.test.ts`	SHA-256 hashing, normalization, check/add/import
`dedup-engine.test.ts`	Weighted matching, threshold behavior, edge cases
`photo-hasher.test.ts`	dHash/pHash computation, hamming distance
`crawl-config.test.ts`	YAML loading, Zod validation, default values
`human-behavior.test.ts`	Gaussian distribution properties

Privacy & Ethics

No images stored: Photos are downloaded to memory, hashed for dedup, then immediately discarded
PII encrypted at rest: Email and phone use pgcrypto column encryption
Blocklist is hash-only: SHA-256 hashes, never plaintext identifiers
Opt-out respected: Blocklisted providers are fully deleted and can never be re-created
Isolated database: Nightcrawler data never touches the platform database
Registered user protection: Platform members are auto-blocklisted on registration