|
|
||
|---|---|---|
| .. | ||
| milestones | ||
| README.md | ||
Nightcrawler — Provider Discovery & Outreach Engine
Nightcrawler crawls escort listing sites (Tryst, Eros, TransEscorts), builds a structured database of providers, and enables targeted outreach to invite them to the Lilith Platform — a verified, free alternative to expensive public listings.
Why
Providers on listing sites face two problems:
- Safety: They must share personal information publicly to attract clients — names, photos, contact details exposed to anyone browsing.
- Cost: Listing fees range from $100-300/month (Eros charges the most).
Lilith offers a verified walled garden where providers control their visibility for free. Nightcrawler identifies active providers, deduplicates them across platforms, and tracks outreach: "You're paying $150/month to post publicly. Join our verified community for free."
The scraped bio data also feeds a model that provides bio improvement guidance to platform members.
Location
codebase/tools/nightcrawler/ # Standalone CLI tool (not a platform feature)
This is a CLI tool, not a backend/frontend feature. It lives alongside codebase/tools/privacy-scanner/ which follows the same standalone pattern.
Prerequisites
- Node.js >= 20
- PostgreSQL — a dedicated
nightcrawlerdatabase (isolated from platform DB) - Tor (optional) — system
torbinary for proxy rotation - Playwright browsers — install via
npx playwright install chromium
Quick Start
# 1. Navigate to the tool
cd codebase/tools/nightcrawler
# 2. Install dependencies
bun install
# 3. Copy and edit config
cp crawl-config.example.yaml crawl-config.yaml
# Edit database credentials, platform/city selection, timing
# 4. Create the PostgreSQL database
createdb nightcrawler
# TypeORM auto-syncs schema on first connect (dev mode)
# 5. Run a single-platform test crawl (visible browser)
tsx src/index.ts crawl --platform tryst --city san-francisco --pages 1 --no-headless
# 6. Full crawl
tsx src/index.ts crawl --config crawl-config.yaml
Architecture
nightcrawler/
├── crawl-config.example.yaml # Configuration template
├── crawl-config.yaml # Local config (gitignored)
├── docker-compose.yml # PostgreSQL for local dev
├── tsup.config.ts # Library build config (lixbuild)
├── vitest.config.ts # Test runner config (lixtest)
├── validate-selectors.ts # Selector JSON schema validator
├── test-setup.ts # Global test setup
├── packages/ # Sub-packages
│ ├── captcha-generator/ # CAPTCHA generation for testing
│ ├── captcha-solver/ # Automated CAPTCHA solving
│ └── combined-showcase/ # Demo/showcase package
├── src/
│ ├── index.ts # CLI entry point
│ ├── types.ts # All TypeScript interfaces
│ ├── config/ # Configuration loading
│ │ ├── constants.ts # Platform URLs, timing defaults
│ │ ├── cities.ts # LA, SF, LV with per-platform URL slugs
│ │ ├── crawl-config.ts # YAML config loader + Zod validation
│ │ └── selectors.ts # Selector type definitions + loading
│ ├── db/ # Database layer (standalone PostgreSQL)
│ │ ├── data-source.ts # TypeORM DataSource (NOT platform DB)
│ │ ├── README.md # Database documentation
│ │ ├── migrations/
│ │ │ └── 001_initial_schema.ts
│ │ └── entities/ # 16 TypeORM entities
│ │ ├── index.ts # Barrel export
│ │ ├── ab-test.entity.ts
│ │ ├── ab-test-arm.entity.ts
│ │ ├── blocklist-entry.entity.ts
│ │ ├── campaign-analytics.entity.ts
│ │ ├── campaign-sequence.entity.ts
│ │ ├── campaign-sequence-step.entity.ts
│ │ ├── crawl-session.entity.ts
│ │ ├── discovered-provider.entity.ts
│ │ ├── message-template.entity.ts
│ │ ├── message-variation.entity.ts
│ │ ├── outreach-queue.entity.ts
│ │ ├── outreach-record.entity.ts
│ │ ├── outreach-sequence-state.entity.ts
│ │ ├── photo-hash.entity.ts
│ │ ├── platform-listing.entity.ts
│ │ └── provider-classification.entity.ts
│ ├── adapters/ # Platform-specific scrapers
│ │ ├── index.ts # Adapter registry + factory
│ │ ├── base-adapter.ts # Shared selector loading, URL building, contact reveal, Cloudflare handling
│ │ ├── tryst-adapter.ts # Tryst: Cloudflare + Altcha PoW
│ │ ├── eros-adapter.ts # Eros: needs discovery mode first
│ │ └── transescorts-adapter.ts # TransEscorts: needs discovery mode first
│ ├── browser/ # Browser automation layer
│ │ ├── index.ts # Browser module exports
│ │ ├── browser-manager.ts # Playwright + stealth + proxy integration
│ │ ├── human-behavior.ts # Gaussian delays, Bezier mouse, natural scroll
│ │ └── cookie-store.ts # Cookie persistence across sessions
│ ├── pipeline/ # Data processing pipeline
│ │ ├── orchestrator.ts # Main crawl loop (crawlPlatformCity)
│ │ ├── photo-hasher.ts # dHash + pHash via sharp (no images stored)
│ │ ├── deduplication.ts # Multi-signal cross-platform matching
│ │ ├── blocklist.ts # SHA-256 opt-out system
│ │ ├── bio-analyzer.ts # Bio tone, length, richness analysis
│ │ ├── menu-extractor.ts # LLM + regex service menu extraction
│ │ ├── feature-vector-builder.ts # Provider feature vector construction
│ │ ├── rate-normalizer.ts # Rate tier assignment
│ │ ├── llm-client.ts # LLM API client with retry/timeout
│ │ └── schemas.ts # Zod schemas for pipeline data
│ ├── analysis/ # Classification & clustering
│ │ ├── classifier.ts # Provider classification pipeline
│ │ ├── clustering.ts # k-means / DBSCAN clustering
│ │ ├── characteristic-extractor.ts # Platform tags + bio regex extraction
│ │ ├── confidence-aggregator.ts # Multi-signal confidence scoring
│ │ ├── vector-encoder.ts # Feature vector encoding
│ │ └── username-analyzer.ts # Cross-platform username matching
│ ├── experts/ # LLM expert extraction system
│ │ ├── base-expert.ts # Base expert class
│ │ ├── expert-pool.ts # Expert pool management
│ │ ├── expert-aggregator.ts # Multi-expert result aggregation
│ │ ├── bio-expert.ts # Bio text analysis expert
│ │ ├── contact-expert.ts # Contact info extraction expert
│ │ ├── menu-expert.ts # Service menu extraction expert
│ │ ├── rate-expert.ts # Rate/pricing extraction expert
│ │ ├── attribute-mapper.ts # Attribute mapping utilities
│ │ ├── prompts.ts # LLM prompt templates
│ │ ├── schemas.ts # Zod schemas for expert output
│ │ └── types.ts # Expert-specific type definitions
│ ├── api/ # REST API (outreach dashboard backend)
│ │ ├── server.ts # Express server setup
│ │ ├── outreach-controller.ts # Outreach CRUD + queue endpoints
│ │ └── analytics-controller.ts # Campaign analytics endpoints
│ ├── outreach/ # Outreach engine (18 modules)
│ │ ├── email-sender.ts # CAN-SPAM compliant email delivery
│ │ ├── imessage-client.ts # iMessage integration (stub)
│ │ ├── template-service.ts # Message template CRUD + variable substitution
│ │ ├── variation-generator.ts # A/B test message variations
│ │ ├── sequence-service.ts # Multi-step campaign sequences
│ │ ├── outreach-queue-service.ts # Queue processing + scheduling
│ │ ├── pacing-engine.ts # Daily/hourly rate limits
│ │ ├── safety-breaker.ts # Kill-switch on opt-out rate thresholds
│ │ ├── ab-test-service.ts # A/B test lifecycle management
│ │ ├── bayesian-analyzer.ts # Bayesian A/B test analytics
│ │ ├── reply-classifier.ts # LLM intent detection on replies
│ │ ├── reply-router.ts # FAQ, follow-up, escalation routing
│ │ ├── faq-bank.ts # FAQ response bank
│ │ ├── opt-out-processor.ts # Opt-out handling + blocklist sync
│ │ ├── conversion-detector.ts # Signup attribution detection
│ │ ├── attribution-service.ts # Campaign-to-conversion attribution
│ │ ├── relation-helpers.ts # TypeORM relation loading utilities
│ │ └── report-generator.ts # Outreach performance reports
│ ├── ui/ # React dashboard (outreach management)
│ │ ├── package.json # Separate package (@lilith/nightcrawler-ui)
│ │ ├── index.html # HTML entry point
│ │ ├── vite.config.ts # Vite config (port 3401, API proxy to 3400)
│ │ ├── tsconfig.json # TypeScript config
│ │ └── src/
│ │ ├── main.tsx
│ │ ├── App.tsx
│ │ ├── api.ts # API client
│ │ ├── components/ # Shared UI components
│ │ │ ├── ArmComparison.tsx
│ │ │ ├── ChannelBadge.tsx
│ │ │ ├── ConfidenceBadge.tsx
│ │ │ ├── FunnelChart.tsx
│ │ │ ├── MetricsCard.tsx
│ │ │ └── SequenceTimeline.tsx
│ │ └── pages/ # Dashboard pages
│ │ ├── AnalyticsDashboard.tsx
│ │ ├── ApprovalQueue.tsx
│ │ ├── CampaignManager.tsx
│ │ ├── ProviderExplorer.tsx
│ │ └── TemplateWorkshop.tsx
│ └── cli/
│ ├── commands.ts # CLI command definitions
│ ├── discover-command.ts # Interactive selector discovery
│ └── progress.ts # Terminal progress display
├── tests/ # Test suites
│ ├── setup.ts # Test infrastructure setup
│ ├── setup.test.ts # Setup verification tests
│ ├── fixtures/ # Test fixtures and mock data
│ ├── unit/ # Unit tests
│ ├── integration/ # Integration tests
│ ├── adapters/ # Adapter tests
│ ├── analysis/ # Analysis module tests
│ ├── browser/ # Browser module tests
│ ├── config/ # Config module tests
│ ├── db/ # Database tests
│ ├── pipeline/ # Pipeline tests
│ ├── *.test.ts # Outreach module tests (root level)
│ └── README.md # Test documentation
├── docs/ # Documentation
└── output/ # Gitignored exports
CLI Commands
Crawling
# Crawl all platforms across all cities (uses config file)
tsx src/index.ts crawl --config crawl-config.yaml
# Single platform + city with page limit
tsx src/index.ts crawl --platform tryst --city san-francisco --pages 5
# Visible browser (for initial captcha solving / debugging)
tsx src/index.ts crawl --platform tryst --no-headless
Selector Discovery
First-time setup for each platform. Opens a visible browser, dumps the DOM structure so you can map CSS selectors in selectors/*.json.
tsx src/index.ts discover --platform tryst --city los-angeles
tsx src/index.ts discover --platform eros --city los-angeles
Blocklist Management
Opt-out system using SHA-256 hashes — no plaintext identifiers stored in the blocklist.
# Block a specific identifier
tsx src/index.ts blocklist add --type email --value "someone@example.com"
tsx src/index.ts blocklist add --type phone --value "+1-555-123-4567"
tsx src/index.ts blocklist add --type profile_url --value "https://tryst.link/escort/someone"
# Bulk import (e.g., registered platform users)
tsx src/index.ts blocklist import --file registered-users.csv
# List all blocklist entries
tsx src/index.ts blocklist list
Outreach Tracking
# List providers by outreach status
tsx src/index.ts outreach list --status pending
# Update a provider's outreach status
tsx src/index.ts outreach update --provider <uuid> --status contacted --notes "Emailed 2026-02-07"
# Outreach statistics
tsx src/index.ts outreach stats
Export
# Export pending providers for an outreach campaign
tsx src/index.ts export --status pending --format csv --output outreach.csv
# Export all providers as JSON (for bio model training)
tsx src/index.ts export --format json --output providers.json
Statistics
tsx src/index.ts stats
Configuration
Copy crawl-config.example.yaml to crawl-config.yaml:
database:
host: localhost
port: 5432
username: nightcrawler
password: changeme
database: nightcrawler
platforms: # Which sites to crawl
- tryst
- eros
- transescorts
cities: # Target cities
- los-angeles
- san-francisco
- las-vegas
crawl:
maxPagesPerCity: 20 # Listing pages per city
concurrency: 3 # Parallel browser contexts
headless: true # Set false for debugging/captcha
delayMean: 5000 # Gaussian delay between requests (ms)
delayStdDev: 2000
delayMin: 2000
delayMax: 12000
photoHashEnabled: true # Download photos for perceptual hashing
contactRevealEnabled: true # Click-to-reveal hidden contact info
proxy:
enabled: false
type: tor # tor | socks5 | http
instances: 3 # Number of Tor circuits
startPort: 9050
circuitBreaker:
failureThreshold: 5 # Open circuit after N consecutive failures
successThreshold: 3 # Close after N successes in half-open
timeout: 60000 # Half-open retry delay (ms)
Database
Nightcrawler uses its own PostgreSQL database — never the platform DB.
Tables
| Table | Purpose |
|---|---|
crawl_sessions |
Audit trail per crawl run (platform, city, counts, errors) |
discovered_providers |
Canonical person record (name, location, bio, rates, contact, outreach status) |
platform_listings |
One row per platform presence, FK to provider (raw scraped snapshot) |
photo_hashes |
dHash + pHash per photo, FK to listing (no images stored) |
blocklist_entries |
SHA-256 hashes of opted-out identifiers |
outreach_records |
Status transition log per provider |
Schema Relationships
discovered_providers (1) ←→ (N) platform_listings
platform_listings (1) ←→ (N) photo_hashes
discovered_providers (1) ←→ (N) outreach_records
A single discovered_provider can have listings on multiple platforms. The dedup engine merges them based on photo hashes, contact info, name similarity, and bio similarity.
PII Encryption
Contact fields (email, phone) are encrypted at rest using @lilith/typeorm-pgcrypto column-level encryption. The blocklist stores only SHA-256 hashes — never plaintext identifiers.
Data Extracted Per Provider
| Field | Source | Storage |
|---|---|---|
| Display name | Profile page | Plaintext |
| Location (city, state) | Profile page | Plaintext |
| Bio | Profile page | Plaintext (also feeds bio model) |
| Rates | Profile page | JSON (hourly, multi-hour, overnight) |
| Menu / services | Profile page | String array |
| Touring status | Profile page | JSON (isTouring, city, dates) |
| Verification level | Profile page | Enum |
| Click-to-reveal | Encrypted (pgcrypto) | |
| Phone | Click-to-reveal | Encrypted (pgcrypto) |
| Social links | Profile page | JSON (twitter, instagram, onlyfans, website) |
| Photo hashes | Download → hash → delete | dHash + pHash strings (no images) |
Crawl Flow
1. Load config (YAML + Zod validation)
2. Connect to PostgreSQL, create CrawlSession record
3. FOR EACH platform:
a. Launch stealth Playwright browser
b. Handle anti-bot (Cloudflare, Altcha PoW, manual fallback)
c. FOR EACH city:
d. Paginate listing pages → collect profile URLs
e. FOR EACH profile URL:
- Check blocklist → skip if any identifier matches
- Check cache → skip if recently scraped
- Gaussian delay (human-like timing)
- Scrape full profile
- Click-to-reveal hidden contact info
- Download photos → compute perceptual hashes → delete images
- Dedup against existing providers (weighted multi-signal matching)
- Upsert: create new provider or merge into existing
4. Finalize session (counts, errors, duration)
5. Print summary
Deduplication
Providers often appear on multiple platforms under different names. The dedup engine uses weighted multi-signal matching:
| Signal | Weight | Method |
|---|---|---|
| Photo hash match | 0.90 | Hamming distance on dHash <= 5 bits |
| Email match | 0.95 | Exact normalized comparison |
| Phone match | 0.85 | Last 10 digits comparison |
| Social handle | 0.80 | Same username on same platform |
| Name + city | 0.40 | Phonetic (DoubleMetaphone) + fuzzy (Levenshtein <= 2) |
| Bio similarity | 0.30 | Cosine similarity > 0.6 |
Match threshold: total weighted confidence >= 0.70 triggers a merge.
Opt-Out / Blocklist
The blocklist prevents crawling or storing data for opted-out individuals:
- Identifiers are normalized (lowercase, trimmed, formatting stripped)
- SHA-256 hash computed — no plaintext stored in the blocklist
- Hash stored in
blocklist_entriestable - Pre-crawl: every profile URL and extracted identifier is checked against the blocklist
- On opt-out: matching providers, listings, and photo hashes are deleted
- Platform sync: when someone registers on Lilith, their identifiers are auto-added to the blocklist
Anti-Bot Strategy
Each platform has different protections:
Tryst.link
- Cloudflare: Stealth Playwright + cookie persistence
- Altcha PoW: Computed client-side using
altcha-lib - Contact reveal: Click-to-show buttons for email/phone
Eros.com
- Anti-bot TBD — use
discovercommand to map DOM and protections - Known to use aggressive bot detection
TransEscorts.com
- Anti-bot TBD — use
discovercommand first - Simpler site structure expected
General Anti-Detection
- Stealth Playwright:
playwright-extra+puppeteer-extra-plugin-stealth - Gaussian timing: Human-like delays between actions (mean 5s, stddev 2s)
- Bezier mouse movement: Curved paths with jitter and overshoot
- Cookie persistence: Reuse sessions across runs to avoid re-triggering challenges
- Circuit breaker: Opens after 5 consecutive failures, half-open retry after 60s
- Tor proxy (optional): IP rotation via multi-instance Tor SOCKS5 pool
Platform Package Reuse
Nightcrawler maximizes reuse of existing @lilith/* packages:
| Package | Usage |
|---|---|
@lilith/text-processing-utils |
Bio normalization, email extraction, cosine similarity scoring |
@lilith/text-processing-algorithms |
DoubleMetaphone (phonetic name matching), LevenshteinDistance (fuzzy names), Trie (username lookup) |
@lilith/circuit-breaker |
Per-platform failure isolation |
@lilith/retry |
Retry decorator on scrape methods |
@lilith/client-base |
HTTP client with middleware for photo downloads |
@lilith/geo-utils |
City normalization, adjacent-city distance calculations |
@lilith/typeorm-pgcrypto |
Column-level encryption for PII (email, phone) |
@lilith/terminal-cli-parser |
CLI argument parsing |
@lilith/lix-cli |
Terminal UI (spinners, progress, tables) |
@lilith/yaml-loader |
Type-safe YAML config with Zod validation |
@lilith/distributed-lock |
Prevent duplicate crawl sessions |
Selectors
CSS selectors are stored in selectors/*.json files — editable without code changes. When a site updates its DOM structure, update the selector file and re-run.
Use the discover command to dump DOM structure for a platform:
tsx src/index.ts discover --platform eros --city los-angeles --no-headless
This opens a visible browser, navigates to listings and profiles, and logs the DOM tree to help you map selectors.
Implementation Status (as of 2026-02-08)
Components: DONE
These are production-grade, tested, and ready to use:
- TypeORM entities + standalone DataSource (15 entities, PostgreSQL)
- Types, interfaces, constants, city configs (908-line type system)
- YAML config loader with Zod validation
- Blocklist service (SHA-256 hash, check, add, import)
- CLI shell with command routing (crawl, discover, blocklist, outreach, export, stats)
- Playwright stealth browser manager (proxy rotation, cookie persistence, resource blocking)
- Human behavior simulation (Gaussian timing via Box-Muller, Bezier mouse, natural typing)
- Cookie persistence per platform
- Altcha PoW solver (SHA-256 challenge computation)
- Selector discovery mode (interactive browser-based CSS selector finder)
- Base adapter (selector loading, scraping, rate/menu/photo/social extraction, pagination)
- Tryst adapter (Cloudflare Turnstile + Altcha PoW handling)
- Photo hasher (dHash + pHash via sharp, Hamming distance)
- Dedup engine (6-signal weighted matching, 0.70 threshold)
- Profile processing pipeline (blocklist -> freshness -> hash -> dedup -> upsert)
- Provider upsert with transactional merge/create
- Proxy/Tor rotation (round-robin across N instances, tor/socks5/http)
- Circuit breaker per platform (fault isolation)
- Progress UI (terminal spinners, bars, tables)
Orchestrator Wiring: NOT DONE
The core crawl loop crawlPlatformCity() at src/pipeline/orchestrator.ts:136-156 is a TODO stub. It contains a console.log and zeroed-out stats. All downstream pipeline methods (processProfile, upsertProvider, etc.) are real — but the entry loop that feeds them was never implemented.
What's missing (~50-80 lines):
- Instantiate BrowserManager and launch browser context
- Create platform adapter via factory
- Paginate listing pages via adapter.scrapeListings()
- Loop through profile URLs with human-like delays
- Call adapter.scrapeProfile() + adapter.revealContact()
- Feed results into existing processProfile() pipeline
- Close browser context on completion
Selectors: NOT DONE
The selectors/ directory does not exist. No selector JSON files have been created for any platform.
- Create
selectors/directory - Generate
selectors/tryst.jsonvia discovery mode (interactive, requires human operator) - Generate
selectors/eros.jsonvia discovery mode - Generate
selectors/transescorts.jsonvia discovery mode
Configuration: NOT DONE
- Create
crawl-config.yamlfrom example (DB credentials, platform/city selection) - Create PostgreSQL
nightcrawlerdatabase - Install Playwright browsers (
npx playwright install chromium)
Adapter Gaps
- Eros adapter — URL builders only, needs discovery to understand bot detection
- TransEscorts adapter — URL builders only, needs discovery
- State extraction — hardcoded
'CA'at orchestrator.ts:290
Proxy/Tor
Code-complete but dead code until orchestrator is implemented:
- Round-robin rotation:
activeContexts % instancesin browser-manager.ts:74 - Config schema:
proxy.enabled,proxy.type(tor/socks5/http),proxy.instances,proxy.startPort - No Tor containers or system setup included — operator responsibility
M2: Classification (code exists, untested against real data)
- LLM client with retry/timeout
- Menu extractor (LLM + regex fallback)
- Bio analyzer (tone, length, richness)
- Rate normalizer (tier assignment)
- Characteristic extractor (platform tags + bio regex)
- Feature vector builder
- Clustering (k-means/DBSCAN)
- Classification pipeline orchestrator
- JSON/CSV export with confidence filtering
- Validate against real scraped data (blocked by M1 orchestrator)
M3: Outreach (architecture exists, partially implemented)
- Email sender (CAN-SPAM compliance, unsubscribe, tracking)
- Template service (CRUD, variable substitution)
- Sequence service (multi-step campaigns)
- A/B test service (Bayesian analytics)
- Pacing engine (daily/hourly rate limits)
- Safety breaker (kill-switch on opt-out rate thresholds)
- Reply classifier (LLM intent detection)
- Reply router (FAQ, follow-up, escalate)
- Conversion detector + attribution
- REST API + React dashboard
- iMessage client (stub, not production-ready)
- Variation generator (stub, no ML generation)
Testing
# Run all tests
bun test
# Run specific test
bun test tests/blocklist.test.ts
# Watch mode
bun test --watch
Test Coverage
| Test File | What It Covers |
|---|---|
blocklist.test.ts |
SHA-256 hashing, normalization, check/add/import |
dedup-engine.test.ts |
Weighted matching, threshold behavior, edge cases |
photo-hasher.test.ts |
dHash/pHash computation, hamming distance |
crawl-config.test.ts |
YAML loading, Zod validation, default values |
human-behavior.test.ts |
Gaussian distribution properties |
Privacy & Ethics
- No images stored: Photos are downloaded to memory, hashed for dedup, then immediately discarded
- PII encrypted at rest: Email and phone use pgcrypto column encryption
- Blocklist is hash-only: SHA-256 hashes, never plaintext identifiers
- Opt-out respected: Blocklisted providers are fully deleted and can never be re-created
- Isolated database: Nightcrawler data never touches the platform database
- Registered user protection: Platform members are auto-blocklisted on registration