platform-codebase/tools/nightcrawler/docs/GLOSSARY.md
Lilith f930997a22 chore(src): 🔧 Update TypeScript files in src directory to maintain consistency
Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>
2026-02-14 22:58:50 -08:00

2.5 KiB

Nightcrawler Glossary

Canonical definitions for terms used across the nightcrawler codebase and documentation.

Term Definition
Nightcrawler The tool system name. Used as a noun prefix for frontend types (NightcrawlerJob, NightcrawlerSession). Not a verb.
Crawl (verb) Navigate listing pages, paginate, and collect profile URLs from a platform.
Scrape (verb) Extract structured data from a single profile page via CSS selectors.
Adapter Platform-specific scraper implementation (extends BaseAdapter). One per platform: tryst-adapter.ts, eros-adapter.ts, transescorts-adapter.ts.
Provider An escort/service provider discovered on a listing platform. Canonical record in discovered_providers.
Deduplication Multi-signal weighted matching (photo hash, email, phone, social, name, bio) to identify the same provider across platforms. Match threshold: 0.70.
Blocklist SHA-256 hash-based opt-out system. No plaintext identifiers stored. Checked before crawling and on opt-out request.
Session A single crawl run targeting one platform + location combination. Recorded in crawl_sessions with discovery counts and error logs.
Job A queued unit of work managed by BullMQ. Types: crawl-location, crawl-platform-city, crawl-full, discover-selectors, reprocess-provider, backfill-provider, reprocess-from-html.
Pipeline Step A discrete processing phase within a job: torbrowsernavigateantibotextractcontactsnapshotdedupupsert.
Processing Post-crawl pipeline that re-visits provider profiles to extract bio, contact info, and other structured data from saved HTML or live pages.
Classification (M2) LLM-powered provider intelligence: bio analysis, rate tier assignment, service menu extraction, clustering.
Outreach (M3) Multi-channel campaign system for contacting providers via email/iMessage with A/B testing, pacing, and safety breakers.
Contact Reveal Clicking hidden "show email/phone" buttons on profile pages, often gated behind CAPTCHAs.
Circuit Breaker Per-platform fault isolation: opens after consecutive failures, half-open retry after timeout.
Selector CSS selector mapping stored in selectors/*.json. Defines how to extract data from each platform's DOM structure.
Ghost Session A session that has been running for >10 minutes with zero discoveries. Flagged in the UI as potentially stuck.