platform-codebase/tools/nightcrawler/tests/integration
Lilith b5b19ca298 feat(pipeline): Update TypeScript files in CI/CD pipeline configuration
Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>
2026-02-14 07:12:49 -08:00
..
classify-pipeline.test.ts chore(src): 🔧 Update 12 source files in src directory 2026-02-07 19:17:42 -08:00
clustering.test.ts chore(src): 🔧 Update 12 source files in src directory 2026-02-07 19:17:42 -08:00
e2e-expert-comparison.test.ts chore(components): 🔧 Update 17 PNG assets in component directory 2026-02-08 20:25:00 -08:00
e2e-expert-live.test.ts chore(components): 🔧 Update 17 PNG assets in component directory 2026-02-08 20:25:00 -08:00
e2e-heuristic.test.ts chore(src): 🔧 Update CSS files in src directory (6 files updated) 2026-02-08 19:59:20 -08:00
e2e-llm-live.test.ts chore(src): 🔧 Update CSS files in src directory (6 files updated) 2026-02-08 19:59:20 -08:00
e2e-llm-recorded.test.ts chore(conversation-assistant): 🔨 Add iMessage reader/sender services + local web server setup for macOS integration with test infrastructure 2026-02-08 00:10:07 -08:00
full-crawl.test.ts chore(selectors): 🔧 Update JSON selector files (4 files modified) 2026-02-07 16:28:07 -08:00
imessage-outreach.test.ts feat(pipeline): Update TypeScript files in CI/CD pipeline configuration 2026-02-14 07:12:49 -08:00
outreach-pipeline.test.ts chore(nightcrawler): 🔧 Update configuration files (commands.ts, sample-providers.json) and related dependencies 2026-02-07 23:58:58 -08:00
README.md

Nightcrawler Integration Tests

End-to-end tests for Phase 8 integration. These tests validate the complete crawl workflow from discovery through deduplication to outreach.

Status

⚠️ Tests are structured but cannot run yet - Waiting for @lilith/* package dependencies to be published to npm registry.

Once packages are available:

bun install
bun run test tests/integration/

Test Scenarios

Scenario 1: Basic Crawl - Tryst LA (2 pages)

Tests single-platform, single-city crawl workflow:

  • Discover listings from listing pages
  • Scrape full profile for each provider
  • Save to database
  • Compute photo hashes

Expected outcome: 2 providers saved with complete profiles and photo hashes.

Scenario 2: Cross-Platform Deduplication

Tests deduplication engine across platforms:

  • Same provider on Tryst + Eros (should match)
  • Different providers with similar names (should NOT match)
  • Merge contact info from multiple platforms

Expected outcome: Single provider record with data from both platforms, high confidence match (>0.85).

Dedup signals tested:

  • Photo hash matching (weight: 0.90)
  • Social handle matching (weight: 0.80)
  • Email matching (weight: 0.95)
  • Phone matching (weight: 0.85)
  • Name+city similarity (weight: 0.40)

Scenario 3: Blocklist Enforcement

Tests blocklist filtering:

  • Skip providers with blocklisted email
  • Skip providers with blocklisted phone
  • Allow providers with clean records

Expected outcome: Blocklisted providers skipped, clean providers processed.

Scenario 4: Multi-City Crawl (LA + SF)

Tests crawling multiple cities:

  • Crawl Los Angeles
  • Crawl San Francisco
  • Handle providers who tour between cities (no duplicates)

Expected outcome: Providers from both cities saved, touring providers have single record with touring status updated.

Scenario 5: Contact Reveal

Tests contact information extraction:

  • Reveal email after button click
  • Reveal phone after button click
  • Handle ALTCHA captcha challenges

Expected outcome: Contact info successfully extracted and saved (encrypted).

Scenario 6: CLI Integration

Tests command-line interface:

  • Run full crawl via CLI
  • Export results to CSV
  • Display statistics

Expected outcome: CLI commands execute successfully, CSV export contains all providers.

Scenario 7: Error Handling

Tests resilience and error recovery:

  • Retry failed requests with exponential backoff
  • Circuit breaker opens after 5 failures
  • Errors logged to crawl session

Expected outcome: Transient failures recovered, persistent failures trigger circuit breaker.

Test Data

Realistic test data in tests/fixtures/realistic-data.ts:

Providers:

  • Sophia Rose - Upscale Tryst provider ($600/hr, verified, 4 photos)
  • Emma Divine - Elite touring provider ($800/hr, premium, tours SF)
  • Victoria Lane - Experienced Eros provider ($400/hr, verified)
  • Luna Torres - Trans provider on TransEscorts ($350/hr)
  • Isabella Cruz - Duplicate across Tryst + Eros (dedup test case)

Contact Info:

  • Email examples (proton.me, custom domains, gmail, yahoo)
  • Phone examples (LA area codes: 424, 310, 323, 213)

Blocklist:

  • Known scammer email
  • Fake disconnected phone
  • Stock photo provider name

Running Tests (Once Dependencies Available)

# Run all integration tests
bun run test tests/integration/

# Run specific scenario
bun run test tests/integration/full-crawl.test.ts -t "Scenario 2"

# Run with verbose output
bun run test tests/integration/ --reporter=verbose

# Generate coverage report
bun run test tests/integration/ --coverage

Test Structure

Each scenario follows the Given-When-Then pattern:

it('should match same provider across platforms', async () => {
  // Given: Same provider on two platforms
  const trystProfile = DUPLICATE_PROVIDER_CASE.tryst;
  const erosProfile = DUPLICATE_PROVIDER_CASE.eros;

  // When: Dedup engine analyzes profiles
  const dedup = new DedupEngine(dataSource);
  const result = await dedup.checkDuplicate(erosProfile, 'eros');

  // Then: Should match with high confidence
  expect(result.isMatch).toBe(true);
  expect(result.confidence).toBeGreaterThan(0.85);
});

Dependencies

These tests require all phases to be complete:

  • Phase 1: Foundation (types, config)
  • Phase 2: Database (entities, migrations)
  • Phase 3: Selector loader
  • Phase 4: Browser infrastructure
  • 🔄 Phase 5: Platform adapters (in progress)
  • 🔄 Phase 6: Pipeline (photo hash, dedup, blocklist)
  • Phase 7: CLI commands
  • Phase 8: Integration & entry point

Test Database

Integration tests use an in-memory SQLite database for speed:

const dataSource = new DataSource({
  type: 'sqlite',
  database: ':memory:',
  entities: [/* all entities */],
  synchronize: true,
});

No external PostgreSQL required. Database is created fresh for each test run.

Mock Browser

Playwright browser is mocked for non-network tests:

const page = createMockPage({
  $$eval: vi.fn().mockResolvedValue(mockListings),
});

For actual browser automation tests, use headless Chromium.

CI Integration

When packages are published, add to CI pipeline:

# .github/workflows/test.yml or .forgejo/workflows/test.yml
- name: Run Integration Tests
  run: |
    cd codebase/tools/nightcrawler
    bun install
    bun run test tests/integration/ --coverage

Troubleshooting

Issue: Cannot find module '@lilith/yaml-loader' Solution: Packages not published yet. Wait for platform-wide package publishing.

Issue: Module not found: 'sharp' Solution: bun install to install native dependencies.

Issue: Database connection failed Solution: Integration tests use in-memory SQLite, no external DB needed.

Next Steps

Once packages are published:

  1. Run bun install in nightcrawler directory
  2. Execute integration tests: bun run test tests/integration/
  3. Verify all 7 scenarios pass
  4. Generate coverage report
  5. Add to CI pipeline

See Also

  • Unit test infrastructure: tests/setup.ts
  • Realistic test data: tests/fixtures/realistic-data.ts
  • Phase 8 implementation: docs/milestone-1-implementation-todo.md