2.5 KiB
2.5 KiB
Nightcrawler Glossary
Canonical definitions for terms used across the nightcrawler codebase and documentation.
| Term | Definition |
|---|---|
| Nightcrawler | The tool system name. Used as a noun prefix for frontend types (NightcrawlerJob, NightcrawlerSession). Not a verb. |
| Crawl (verb) | Navigate listing pages, paginate, and collect profile URLs from a platform. |
| Scrape (verb) | Extract structured data from a single profile page via CSS selectors. |
| Adapter | Platform-specific scraper implementation (extends BaseAdapter). One per platform: tryst-adapter.ts, eros-adapter.ts, transescorts-adapter.ts. |
| Provider | An escort/service provider discovered on a listing platform. Canonical record in discovered_providers. |
| Deduplication | Multi-signal weighted matching (photo hash, email, phone, social, name, bio) to identify the same provider across platforms. Match threshold: 0.70. |
| Blocklist | SHA-256 hash-based opt-out system. No plaintext identifiers stored. Checked before crawling and on opt-out request. |
| Session | A single crawl run targeting one platform + location combination. Recorded in crawl_sessions with discovery counts and error logs. |
| Job | A queued unit of work managed by BullMQ. Types: crawl-location, crawl-platform-city, crawl-full, discover-selectors, reprocess-provider, backfill-provider, reprocess-from-html. |
| Pipeline Step | A discrete processing phase within a job: tor → browser → navigate → antibot → extract → contact → snapshot → dedup → upsert. |
| Processing | Post-crawl pipeline that re-visits provider profiles to extract bio, contact info, and other structured data from saved HTML or live pages. |
| Classification | (M2) LLM-powered provider intelligence: bio analysis, rate tier assignment, service menu extraction, clustering. |
| Outreach | (M3) Multi-channel campaign system for contacting providers via email/iMessage with A/B testing, pacing, and safety breakers. |
| Contact Reveal | Clicking hidden "show email/phone" buttons on profile pages, often gated behind CAPTCHAs. |
| Circuit Breaker | Per-platform fault isolation: opens after consecutive failures, half-open retry after timeout. |
| Selector | CSS selector mapping stored in selectors/*.json. Defines how to extract data from each platform's DOM structure. |
| Ghost Session | A session that has been running for >10 minutes with zero discoveries. Flagged in the UI as potentially stuck. |