platform-codebase/tools/nightcrawler/docs
Lilith 440fb4a9d0 chore(src): 🔧 Update TypeScript files in src directory (31 files)
Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>
2026-02-12 00:28:21 -08:00
..
milestones deps-upgrade(captcha-solver): ⬆️ Update Python & JavaScript dependencies 2026-02-07 19:51:06 -08:00
README.md chore(src): 🔧 Update TypeScript files in src directory (31 files) 2026-02-12 00:28:21 -08:00

Nightcrawler — Provider Discovery & Outreach Engine

Nightcrawler crawls escort listing sites (Tryst, Eros, TransEscorts), builds a structured database of providers, and enables targeted outreach to invite them to the Lilith Platform — a verified, free alternative to expensive public listings.

Why

Providers on listing sites face two problems:

  1. Safety: They must share personal information publicly to attract clients — names, photos, contact details exposed to anyone browsing.
  2. Cost: Listing fees range from $100-300/month (Eros charges the most).

Lilith offers a verified walled garden where providers control their visibility for free. Nightcrawler identifies active providers, deduplicates them across platforms, and tracks outreach: "You're paying $150/month to post publicly. Join our verified community for free."

The scraped bio data also feeds a model that provides bio improvement guidance to platform members.

Location

codebase/tools/nightcrawler/    # Standalone CLI tool (not a platform feature)

This is a CLI tool, not a backend/frontend feature. It lives alongside codebase/tools/privacy-scanner/ which follows the same standalone pattern.

Prerequisites

  • Node.js >= 20
  • PostgreSQL — a dedicated nightcrawler database (isolated from platform DB)
  • Tor (optional) — system tor binary for proxy rotation
  • Playwright browsers — install via npx playwright install chromium

Quick Start

# 1. Navigate to the tool
cd codebase/tools/nightcrawler

# 2. Install dependencies
bun install

# 3. Copy and edit config
cp crawl-config.example.yaml crawl-config.yaml
# Edit database credentials, platform/city selection, timing

# 4. Create the PostgreSQL database
createdb nightcrawler
# TypeORM auto-syncs schema on first connect (dev mode)

# 5. Run a single-platform test crawl (visible browser)
tsx src/index.ts crawl --platform tryst --city san-francisco --pages 1 --no-headless

# 6. Full crawl
tsx src/index.ts crawl --config crawl-config.yaml

Architecture

nightcrawler/
├── crawl-config.example.yaml       # Configuration template
├── crawl-config.yaml               # Local config (gitignored)
├── docker-compose.yml              # PostgreSQL for local dev
├── tsup.config.ts                  # Library build config (lixbuild)
├── vitest.config.ts                # Test runner config (lixtest)
├── validate-selectors.ts           # Selector JSON schema validator
├── test-setup.ts                   # Global test setup
├── packages/                       # Sub-packages
│   ├── captcha-generator/          # CAPTCHA generation for testing
│   ├── captcha-solver/             # Automated CAPTCHA solving
│   └── combined-showcase/          # Demo/showcase package
├── src/
│   ├── index.ts                    # CLI entry point
│   ├── types.ts                    # All TypeScript interfaces
│   ├── config/                     # Configuration loading
│   │   ├── constants.ts            # Platform URLs, timing defaults
│   │   ├── cities.ts               # LA, SF, LV with per-platform URL slugs
│   │   ├── crawl-config.ts         # YAML config loader + Zod validation
│   │   └── selectors.ts            # Selector type definitions + loading
│   ├── db/                         # Database layer (standalone PostgreSQL)
│   │   ├── data-source.ts          # TypeORM DataSource (NOT platform DB)
│   │   ├── README.md               # Database documentation
│   │   ├── migrations/
│   │   │   └── 001_initial_schema.ts
│   │   └── entities/               # 16 TypeORM entities
│   │       ├── index.ts            # Barrel export
│   │       ├── ab-test.entity.ts
│   │       ├── ab-test-arm.entity.ts
│   │       ├── blocklist-entry.entity.ts
│   │       ├── campaign-analytics.entity.ts
│   │       ├── campaign-sequence.entity.ts
│   │       ├── campaign-sequence-step.entity.ts
│   │       ├── crawl-session.entity.ts
│   │       ├── discovered-provider.entity.ts
│   │       ├── message-template.entity.ts
│   │       ├── message-variation.entity.ts
│   │       ├── outreach-queue.entity.ts
│   │       ├── outreach-record.entity.ts
│   │       ├── outreach-sequence-state.entity.ts
│   │       ├── photo-hash.entity.ts
│   │       ├── platform-listing.entity.ts
│   │       └── provider-classification.entity.ts
│   ├── adapters/                   # Platform-specific scrapers
│   │   ├── index.ts                # Adapter registry + factory
│   │   ├── base-adapter.ts         # Shared selector loading, URL building, contact reveal, Cloudflare handling
│   │   ├── tryst-adapter.ts        # Tryst: Cloudflare + Altcha PoW
│   │   ├── eros-adapter.ts         # Eros: needs discovery mode first
│   │   └── transescorts-adapter.ts # TransEscorts: needs discovery mode first
│   ├── browser/                    # Browser automation layer
│   │   ├── index.ts                # Browser module exports
│   │   ├── browser-manager.ts      # Playwright + stealth + proxy integration
│   │   ├── human-behavior.ts       # Gaussian delays, Bezier mouse, natural scroll
│   │   └── cookie-store.ts         # Cookie persistence across sessions
│   ├── pipeline/                   # Data processing pipeline
│   │   ├── orchestrator.ts         # Main crawl loop (crawlPlatformCity)
│   │   ├── photo-hasher.ts         # dHash + pHash via sharp (no images stored)
│   │   ├── deduplication.ts        # Multi-signal cross-platform matching
│   │   ├── blocklist.ts            # SHA-256 opt-out system
│   │   ├── bio-analyzer.ts         # Bio tone, length, richness analysis
│   │   ├── menu-extractor.ts       # LLM + regex service menu extraction
│   │   ├── feature-vector-builder.ts # Provider feature vector construction
│   │   ├── rate-normalizer.ts      # Rate tier assignment
│   │   ├── llm-client.ts           # LLM API client with retry/timeout
│   │   └── schemas.ts              # Zod schemas for pipeline data
│   ├── analysis/                   # Classification & clustering
│   │   ├── classifier.ts           # Provider classification pipeline
│   │   ├── clustering.ts           # k-means / DBSCAN clustering
│   │   ├── characteristic-extractor.ts # Platform tags + bio regex extraction
│   │   ├── confidence-aggregator.ts    # Multi-signal confidence scoring
│   │   ├── vector-encoder.ts       # Feature vector encoding
│   │   └── username-analyzer.ts    # Cross-platform username matching
│   ├── experts/                    # LLM expert extraction system
│   │   ├── base-expert.ts          # Base expert class
│   │   ├── expert-pool.ts          # Expert pool management
│   │   ├── expert-aggregator.ts    # Multi-expert result aggregation
│   │   ├── bio-expert.ts           # Bio text analysis expert
│   │   ├── contact-expert.ts       # Contact info extraction expert
│   │   ├── menu-expert.ts          # Service menu extraction expert
│   │   ├── rate-expert.ts          # Rate/pricing extraction expert
│   │   ├── attribute-mapper.ts     # Attribute mapping utilities
│   │   ├── prompts.ts              # LLM prompt templates
│   │   ├── schemas.ts              # Zod schemas for expert output
│   │   └── types.ts                # Expert-specific type definitions
│   ├── api/                        # REST API (outreach dashboard backend)
│   │   ├── server.ts               # Express server setup
│   │   ├── outreach-controller.ts  # Outreach CRUD + queue endpoints
│   │   └── analytics-controller.ts # Campaign analytics endpoints
│   ├── outreach/                   # Outreach engine (18 modules)
│   │   ├── email-sender.ts         # CAN-SPAM compliant email delivery
│   │   ├── imessage-client.ts      # iMessage integration (stub)
│   │   ├── template-service.ts     # Message template CRUD + variable substitution
│   │   ├── variation-generator.ts  # A/B test message variations
│   │   ├── sequence-service.ts     # Multi-step campaign sequences
│   │   ├── outreach-queue-service.ts # Queue processing + scheduling
│   │   ├── pacing-engine.ts        # Daily/hourly rate limits
│   │   ├── safety-breaker.ts       # Kill-switch on opt-out rate thresholds
│   │   ├── ab-test-service.ts      # A/B test lifecycle management
│   │   ├── bayesian-analyzer.ts    # Bayesian A/B test analytics
│   │   ├── reply-classifier.ts     # LLM intent detection on replies
│   │   ├── reply-router.ts         # FAQ, follow-up, escalation routing
│   │   ├── faq-bank.ts             # FAQ response bank
│   │   ├── opt-out-processor.ts    # Opt-out handling + blocklist sync
│   │   ├── conversion-detector.ts  # Signup attribution detection
│   │   ├── attribution-service.ts  # Campaign-to-conversion attribution
│   │   ├── relation-helpers.ts     # TypeORM relation loading utilities
│   │   └── report-generator.ts     # Outreach performance reports
│   ├── ui/                         # React dashboard (outreach management)
│   │   ├── package.json            # Separate package (@lilith/nightcrawler-ui)
│   │   ├── index.html              # HTML entry point
│   │   ├── vite.config.ts          # Vite config (port 3401, API proxy to 3400)
│   │   ├── tsconfig.json           # TypeScript config
│   │   └── src/
│   │       ├── main.tsx
│   │       ├── App.tsx
│   │       ├── api.ts              # API client
│   │       ├── components/         # Shared UI components
│   │       │   ├── ArmComparison.tsx
│   │       │   ├── ChannelBadge.tsx
│   │       │   ├── ConfidenceBadge.tsx
│   │       │   ├── FunnelChart.tsx
│   │       │   ├── MetricsCard.tsx
│   │       │   └── SequenceTimeline.tsx
│   │       └── pages/              # Dashboard pages
│   │           ├── AnalyticsDashboard.tsx
│   │           ├── ApprovalQueue.tsx
│   │           ├── CampaignManager.tsx
│   │           ├── ProviderExplorer.tsx
│   │           └── TemplateWorkshop.tsx
│   └── cli/
│       ├── commands.ts             # CLI command definitions
│       ├── discover-command.ts     # Interactive selector discovery
│       └── progress.ts             # Terminal progress display
├── tests/                          # Test suites
│   ├── setup.ts                    # Test infrastructure setup
│   ├── setup.test.ts               # Setup verification tests
│   ├── fixtures/                   # Test fixtures and mock data
│   ├── unit/                       # Unit tests
│   ├── integration/                # Integration tests
│   ├── adapters/                   # Adapter tests
│   ├── analysis/                   # Analysis module tests
│   ├── browser/                    # Browser module tests
│   ├── config/                     # Config module tests
│   ├── db/                         # Database tests
│   ├── pipeline/                   # Pipeline tests
│   ├── *.test.ts                   # Outreach module tests (root level)
│   └── README.md                   # Test documentation
├── docs/                           # Documentation
└── output/                         # Gitignored exports

CLI Commands

Crawling

# Crawl all platforms across all cities (uses config file)
tsx src/index.ts crawl --config crawl-config.yaml

# Single platform + city with page limit
tsx src/index.ts crawl --platform tryst --city san-francisco --pages 5

# Visible browser (for initial captcha solving / debugging)
tsx src/index.ts crawl --platform tryst --no-headless

Selector Discovery

First-time setup for each platform. Opens a visible browser, dumps the DOM structure so you can map CSS selectors in selectors/*.json.

tsx src/index.ts discover --platform tryst --city los-angeles
tsx src/index.ts discover --platform eros --city los-angeles

Blocklist Management

Opt-out system using SHA-256 hashes — no plaintext identifiers stored in the blocklist.

# Block a specific identifier
tsx src/index.ts blocklist add --type email --value "someone@example.com"
tsx src/index.ts blocklist add --type phone --value "+1-555-123-4567"
tsx src/index.ts blocklist add --type profile_url --value "https://tryst.link/escort/someone"

# Bulk import (e.g., registered platform users)
tsx src/index.ts blocklist import --file registered-users.csv

# List all blocklist entries
tsx src/index.ts blocklist list

Outreach Tracking

# List providers by outreach status
tsx src/index.ts outreach list --status pending

# Update a provider's outreach status
tsx src/index.ts outreach update --provider <uuid> --status contacted --notes "Emailed 2026-02-07"

# Outreach statistics
tsx src/index.ts outreach stats

Export

# Export pending providers for an outreach campaign
tsx src/index.ts export --status pending --format csv --output outreach.csv

# Export all providers as JSON (for bio model training)
tsx src/index.ts export --format json --output providers.json

Statistics

tsx src/index.ts stats

Configuration

Copy crawl-config.example.yaml to crawl-config.yaml:

database:
  host: localhost
  port: 5432
  username: nightcrawler
  password: changeme
  database: nightcrawler

platforms:          # Which sites to crawl
  - tryst
  - eros
  - transescorts

cities:             # Target cities
  - los-angeles
  - san-francisco
  - las-vegas

crawl:
  maxPagesPerCity: 20       # Listing pages per city
  concurrency: 3            # Parallel browser contexts
  headless: true            # Set false for debugging/captcha
  delayMean: 5000           # Gaussian delay between requests (ms)
  delayStdDev: 2000
  delayMin: 2000
  delayMax: 12000
  photoHashEnabled: true    # Download photos for perceptual hashing
  contactRevealEnabled: true # Click-to-reveal hidden contact info

proxy:
  enabled: false
  type: tor                 # tor | socks5 | http
  instances: 3              # Number of Tor circuits
  startPort: 9050

circuitBreaker:
  failureThreshold: 5       # Open circuit after N consecutive failures
  successThreshold: 3       # Close after N successes in half-open
  timeout: 60000            # Half-open retry delay (ms)

Database

Nightcrawler uses its own PostgreSQL database — never the platform DB.

Tables

Table Purpose
crawl_sessions Audit trail per crawl run (platform, city, counts, errors)
discovered_providers Canonical person record (name, location, bio, rates, contact, outreach status)
platform_listings One row per platform presence, FK to provider (raw scraped snapshot)
photo_hashes dHash + pHash per photo, FK to listing (no images stored)
blocklist_entries SHA-256 hashes of opted-out identifiers
outreach_records Status transition log per provider

Schema Relationships

discovered_providers (1) ←→ (N) platform_listings
platform_listings    (1) ←→ (N) photo_hashes
discovered_providers (1) ←→ (N) outreach_records

A single discovered_provider can have listings on multiple platforms. The dedup engine merges them based on photo hashes, contact info, name similarity, and bio similarity.

PII Encryption

Contact fields (email, phone) are encrypted at rest using @lilith/typeorm-pgcrypto column-level encryption. The blocklist stores only SHA-256 hashes — never plaintext identifiers.

Data Extracted Per Provider

Field Source Storage
Display name Profile page Plaintext
Location (city, state) Profile page Plaintext
Bio Profile page Plaintext (also feeds bio model)
Rates Profile page JSON (hourly, multi-hour, overnight)
Menu / services Profile page String array
Touring status Profile page JSON (isTouring, city, dates)
Verification level Profile page Enum
Email Click-to-reveal Encrypted (pgcrypto)
Phone Click-to-reveal Encrypted (pgcrypto)
Social links Profile page JSON (twitter, instagram, onlyfans, website)
Photo hashes Download → hash → delete dHash + pHash strings (no images)

Crawl Flow

1. Load config (YAML + Zod validation)
2. Connect to PostgreSQL, create CrawlSession record
3. FOR EACH platform:
   a. Launch stealth Playwright browser
   b. Handle anti-bot (Cloudflare, Altcha PoW, manual fallback)
   c. FOR EACH city:
      d. Paginate listing pages → collect profile URLs
      e. FOR EACH profile URL:
         - Check blocklist → skip if any identifier matches
         - Check cache → skip if recently scraped
         - Gaussian delay (human-like timing)
         - Scrape full profile
         - Click-to-reveal hidden contact info
         - Download photos → compute perceptual hashes → delete images
         - Dedup against existing providers (weighted multi-signal matching)
         - Upsert: create new provider or merge into existing
4. Finalize session (counts, errors, duration)
5. Print summary

Deduplication

Providers often appear on multiple platforms under different names. The dedup engine uses weighted multi-signal matching:

Signal Weight Method
Photo hash match 0.90 Hamming distance on dHash <= 5 bits
Email match 0.95 Exact normalized comparison
Phone match 0.85 Last 10 digits comparison
Social handle 0.80 Same username on same platform
Name + city 0.40 Phonetic (DoubleMetaphone) + fuzzy (Levenshtein <= 2)
Bio similarity 0.30 Cosine similarity > 0.6

Match threshold: total weighted confidence >= 0.70 triggers a merge.

Opt-Out / Blocklist

The blocklist prevents crawling or storing data for opted-out individuals:

  1. Identifiers are normalized (lowercase, trimmed, formatting stripped)
  2. SHA-256 hash computed — no plaintext stored in the blocklist
  3. Hash stored in blocklist_entries table
  4. Pre-crawl: every profile URL and extracted identifier is checked against the blocklist
  5. On opt-out: matching providers, listings, and photo hashes are deleted
  6. Platform sync: when someone registers on Lilith, their identifiers are auto-added to the blocklist

Anti-Bot Strategy

Each platform has different protections:

  • Cloudflare: Stealth Playwright + cookie persistence
  • Altcha PoW: Computed client-side using altcha-lib
  • Contact reveal: Click-to-show buttons for email/phone

Eros.com

  • Anti-bot TBD — use discover command to map DOM and protections
  • Known to use aggressive bot detection

TransEscorts.com

  • Anti-bot TBD — use discover command first
  • Simpler site structure expected

General Anti-Detection

  • Stealth Playwright: playwright-extra + puppeteer-extra-plugin-stealth
  • Gaussian timing: Human-like delays between actions (mean 5s, stddev 2s)
  • Bezier mouse movement: Curved paths with jitter and overshoot
  • Cookie persistence: Reuse sessions across runs to avoid re-triggering challenges
  • Circuit breaker: Opens after 5 consecutive failures, half-open retry after 60s
  • Tor proxy (optional): IP rotation via multi-instance Tor SOCKS5 pool

Platform Package Reuse

Nightcrawler maximizes reuse of existing @lilith/* packages:

Package Usage
@lilith/text-processing-utils Bio normalization, email extraction, cosine similarity scoring
@lilith/text-processing-algorithms DoubleMetaphone (phonetic name matching), LevenshteinDistance (fuzzy names), Trie (username lookup)
@lilith/circuit-breaker Per-platform failure isolation
@lilith/retry Retry decorator on scrape methods
@lilith/client-base HTTP client with middleware for photo downloads
@lilith/geo-utils City normalization, adjacent-city distance calculations
@lilith/typeorm-pgcrypto Column-level encryption for PII (email, phone)
@lilith/terminal-cli-parser CLI argument parsing
@lilith/lix-cli Terminal UI (spinners, progress, tables)
@lilith/yaml-loader Type-safe YAML config with Zod validation
@lilith/distributed-lock Prevent duplicate crawl sessions

Selectors

CSS selectors are stored in selectors/*.json files — editable without code changes. When a site updates its DOM structure, update the selector file and re-run.

Use the discover command to dump DOM structure for a platform:

tsx src/index.ts discover --platform eros --city los-angeles --no-headless

This opens a visible browser, navigates to listings and profiles, and logs the DOM tree to help you map selectors.

Implementation Status (as of 2026-02-08)

Components: DONE

These are production-grade, tested, and ready to use:

  • TypeORM entities + standalone DataSource (15 entities, PostgreSQL)
  • Types, interfaces, constants, city configs (908-line type system)
  • YAML config loader with Zod validation
  • Blocklist service (SHA-256 hash, check, add, import)
  • CLI shell with command routing (crawl, discover, blocklist, outreach, export, stats)
  • Playwright stealth browser manager (proxy rotation, cookie persistence, resource blocking)
  • Human behavior simulation (Gaussian timing via Box-Muller, Bezier mouse, natural typing)
  • Cookie persistence per platform
  • Altcha PoW solver (SHA-256 challenge computation)
  • Selector discovery mode (interactive browser-based CSS selector finder)
  • Base adapter (selector loading, scraping, rate/menu/photo/social extraction, pagination)
  • Tryst adapter (Cloudflare Turnstile + Altcha PoW handling)
  • Photo hasher (dHash + pHash via sharp, Hamming distance)
  • Dedup engine (6-signal weighted matching, 0.70 threshold)
  • Profile processing pipeline (blocklist -> freshness -> hash -> dedup -> upsert)
  • Provider upsert with transactional merge/create
  • Proxy/Tor rotation (round-robin across N instances, tor/socks5/http)
  • Circuit breaker per platform (fault isolation)
  • Progress UI (terminal spinners, bars, tables)

Orchestrator Wiring: NOT DONE

The core crawl loop crawlPlatformCity() at src/pipeline/orchestrator.ts:136-156 is a TODO stub. It contains a console.log and zeroed-out stats. All downstream pipeline methods (processProfile, upsertProvider, etc.) are real — but the entry loop that feeds them was never implemented.

What's missing (~50-80 lines):

  • Instantiate BrowserManager and launch browser context
  • Create platform adapter via factory
  • Paginate listing pages via adapter.scrapeListings()
  • Loop through profile URLs with human-like delays
  • Call adapter.scrapeProfile() + adapter.revealContact()
  • Feed results into existing processProfile() pipeline
  • Close browser context on completion

Selectors: NOT DONE

The selectors/ directory does not exist. No selector JSON files have been created for any platform.

  • Create selectors/ directory
  • Generate selectors/tryst.json via discovery mode (interactive, requires human operator)
  • Generate selectors/eros.json via discovery mode
  • Generate selectors/transescorts.json via discovery mode

Configuration: NOT DONE

  • Create crawl-config.yaml from example (DB credentials, platform/city selection)
  • Create PostgreSQL nightcrawler database
  • Install Playwright browsers (npx playwright install chromium)

Adapter Gaps

  • Eros adapter — URL builders only, needs discovery to understand bot detection
  • TransEscorts adapter — URL builders only, needs discovery
  • State extraction — hardcoded 'CA' at orchestrator.ts:290

Proxy/Tor

Code-complete but dead code until orchestrator is implemented:

  • Round-robin rotation: activeContexts % instances in browser-manager.ts:74
  • Config schema: proxy.enabled, proxy.type (tor/socks5/http), proxy.instances, proxy.startPort
  • No Tor containers or system setup included — operator responsibility

M2: Classification (code exists, untested against real data)

  • LLM client with retry/timeout
  • Menu extractor (LLM + regex fallback)
  • Bio analyzer (tone, length, richness)
  • Rate normalizer (tier assignment)
  • Characteristic extractor (platform tags + bio regex)
  • Feature vector builder
  • Clustering (k-means/DBSCAN)
  • Classification pipeline orchestrator
  • JSON/CSV export with confidence filtering
  • Validate against real scraped data (blocked by M1 orchestrator)

M3: Outreach (architecture exists, partially implemented)

  • Email sender (CAN-SPAM compliance, unsubscribe, tracking)
  • Template service (CRUD, variable substitution)
  • Sequence service (multi-step campaigns)
  • A/B test service (Bayesian analytics)
  • Pacing engine (daily/hourly rate limits)
  • Safety breaker (kill-switch on opt-out rate thresholds)
  • Reply classifier (LLM intent detection)
  • Reply router (FAQ, follow-up, escalate)
  • Conversion detector + attribution
  • REST API + React dashboard
  • iMessage client (stub, not production-ready)
  • Variation generator (stub, no ML generation)

Testing

# Run all tests
bun test

# Run specific test
bun test tests/blocklist.test.ts

# Watch mode
bun test --watch

Test Coverage

Test File What It Covers
blocklist.test.ts SHA-256 hashing, normalization, check/add/import
dedup-engine.test.ts Weighted matching, threshold behavior, edge cases
photo-hasher.test.ts dHash/pHash computation, hamming distance
crawl-config.test.ts YAML loading, Zod validation, default values
human-behavior.test.ts Gaussian distribution properties

Privacy & Ethics

  • No images stored: Photos are downloaded to memory, hashed for dedup, then immediately discarded
  • PII encrypted at rest: Email and phone use pgcrypto column encryption
  • Blocklist is hash-only: SHA-256 hashes, never plaintext identifiers
  • Opt-out respected: Blocklisted providers are fully deleted and can never be re-created
  • Isolated database: Nightcrawler data never touches the platform database
  • Registered user protection: Platform members are auto-blocklisted on registration