diff --git a/tools/talent-scout/packages/captcha-solver/ml-service/docs/TRAINING_LOG.md b/tools/talent-scout/packages/captcha-solver/ml-service/docs/TRAINING_LOG.md index 388ce92e0..e0b263cbb 100644 --- a/tools/talent-scout/packages/captcha-solver/ml-service/docs/TRAINING_LOG.md +++ b/tools/talent-scout/packages/captcha-solver/ml-service/docs/TRAINING_LOG.md @@ -615,10 +615,101 @@ The loss curve strongly suggests the model is data-limited, not architecture-lim --- +## Experiment 10b: SVTRv2 Scaled to 2M Samples/Phase (TRAINING) + +**Date**: 2026-02-16 +**Status**: TRAINING — epoch 50/90, Phase 2/3 (medium), best **84.8%** exact +**Hypothesis**: The 10a model was data-starved at 200K samples/phase. Scaling to 2M (10×) should push accuracy past 85%. + +### Configuration + +```bash +python3 -m torch.distributed.run --nproc_per_node=2 train_svtrv2_by_style.py \ + --no-gpu-lease --styles line-strike --skip-universal --epochs 90 \ + --online --samples-per-phase 2000000 --batch-size 256 --lr 5e-4 \ + --weight-decay 0.05 --num-workers 8 --ar-val-samples 1000 \ + --resume-from models/svtrv2_line-strike.pt +``` + +| Parameter | 10a | 10b | Change | +|-----------|-----|-----|--------| +| Samples/phase | 200K | 2M | 10× | +| Total data | 600K | 6M | 10× | +| Epochs | 60 | 90 | 1.5× | +| Epoch time | ~130s | ~1120s | ~9× (proportional to data) | +| Resume from | scratch | 10a best (83.8%) | warm start | +| Hardware | 2× RTX 3090 DDP | same | same | + +### Training Progress (live) + +| Phase | Epochs | Difficulty | Best Exact | Best Char | Notes | +|-------|--------|-----------|------------|-----------|-------| +| 1 (easy) | 1–30 | easy | **84.8%** | 97.7% | New best at epoch 27-30, steady improvement | +| 2 (medium) | 31–50+ | medium | recovering | 97.3% | Dropped to 81% at phase transition, climbing back | +| 3 (hard/all) | 61–90 | all | pending | — | Not started | + +Key observations at epoch 50: +- **Phase 1 peak**: 84.8% at epoch 30 — +1.0% over 10a's 83.8% best +- **Phase 2 transition**: Accuracy dropped from 84.8% → 81.1% when switching from easy → medium difficulty (expected) +- **Recovery trajectory**: 81.1% → 83.5% over 20 medium epochs, val_loss still decreasing (0.1025 → 0.0893) +- **Val loss trend**: Monotonically decreasing within each phase — model is NOT plateauing +- **Epoch time**: ~1120s (~19 min) — proportional to 10× data increase + +### Phase 1 Detailed Curve (Easy Difficulty) + +| Epoch | Train Loss | Val Loss | Exact Acc | Char Acc | Best | LR | +|-------|-----------|----------|-----------|----------|------|-----| +| 1 | 0.1116 | 0.1137 | 79.1% | 96.2% | 79.1% | 1.0e-3 | +| 5 | 0.1067 | 0.1092 | 80.6% | 96.6% | 80.8% | 9.5e-4 | +| 10 | 0.1019 | 0.1003 | 82.3% | 97.2% | 82.3% | 7.8e-4 | +| 15 | 0.0979 | 0.0960 | 83.0% | 97.3% | 83.0% | 5.3e-4 | +| 20 | 0.0932 | 0.0884 | 83.6% | 97.4% | 83.9% | 2.7e-4 | +| 25 | 0.0890 | 0.0835 | 84.4% | 97.6% | 84.5% | 7.2e-5 | +| 30 | 0.0868 | 0.0818 | **84.8%** | **97.7%** | **84.8%** | 1.0e-6 | + +### Analysis + +**Data scaling works.** The model went from 83.8% (10a, 200K/phase) to 84.8% (10b, 2M/phase) — a +1.0% improvement. The loss curve was still monotonically decreasing at epoch 30, confirming 10a was indeed data-starved. + +**But is 85% the Tiny ceiling?** Per-char accuracy at 97.7% gives a theoretical CTC ceiling of `0.977^7 = 85.0%`. We're at 84.8% — very close to the theoretical ceiling for this per-char rate. To push past 85%, per-char accuracy needs to reach ~98%+. + +**Medium difficulty challenge**: The 81% → 83.5% recovery over 20 epochs shows the model IS learning medium-difficulty features, but hasn't surpassed the easy-phase best yet. Phase 3 (hard/all) will be the true test — if the model can generalize across all difficulties, the final checkpoint should exceed 84.8%. + +### ETA + +- Remaining: ~40 epochs × ~19 min = ~12.5 hours +- Expected completion: ~2026-02-17 12:30 +- Monitor: `python3 train_status.py` or `python3 train_status.py --watch` + +--- + +## Tooling: `train_status.py` (NEW) + +**Date**: 2026-02-17 + +Standalone training status viewer — reads `.training-progress/` JSON and `.training-history/` CSV files. Works for both PARSeq and SVTRv2 training runs. + +```bash +python3 train_status.py # Status + last 10 epochs +python3 train_status.py -n 20 # Last 20 epochs +python3 train_status.py --watch # Auto-refresh every 60s +python3 train_status.py --csv # Raw CSV dump +python3 train_status.py --best # Checkpoint summary only +``` + +Also fixed `parseq_cli.py status` to be model-agnostic (handles SVTRv2 progress files and CSV columns). + +--- + ## Next Steps -1. **Experiment 10b**: Scale data to 2M samples/phase, resume from 83.8% checkpoint -2. **If 10b > 88%**: Expand SVTRv2 to all 7 styles -3. **If 10b plateaus at ~85%**: Try SVTRv2-B (larger 19.8M variant) or ensemble voting -4. **Run 9e status**: Check if ViT-Base PARSeq completed (~Feb 16) -5. **Compare SVTRv2 vs ViT-Base PARSeq**: Head-to-head on same test set +1. **Wait for 10b completion** (~12h) — monitor via `python3 train_status.py --watch` +2. **Deploy 10b model**: Verify integration works (`curl POST /solve` with `strategy=style_expert`) +3. **Train color-mesh SVTRv2**: 2M samples/phase, 90 epochs (~30h) +4. **Train remaining 5 styles**: classic, perspective, grid, emboss, colorful (~150h sequential) +5. **Experiment 11: SVTRv2-Small** (11.2M params) if Tiny plateaus at ~85% + - Config: `--model-size small` (already implemented in training script) + - Architecture: dims [96, 192, 384], Conv+Global mixers + - Target: >88% exact match +6. **Experiment 11 Alternative**: Ensemble voting (3 Tiny models, different seeds) + - If individual model ceiling is firm, majority vote → ~93%+ from 85% base diff --git a/tools/talent-scout/packages/captcha-solver/ml-service/train_status.py b/tools/talent-scout/packages/captcha-solver/ml-service/train_status.py new file mode 100644 index 000000000..fbb384a92 --- /dev/null +++ b/tools/talent-scout/packages/captcha-solver/ml-service/train_status.py @@ -0,0 +1,218 @@ +#!/usr/bin/env python3 +"""Quick training status viewer — standalone, no dependencies beyond stdlib. + +Shows progress for all active training runs (PARSeq + SVTRv2) by reading +the progress JSON files written by the training scripts. + +Usage: + python3 train_status.py # Show current status + last 10 epochs + python3 train_status.py -n 20 # Show last 20 epochs + python3 train_status.py --watch # Auto-refresh every 60s + python3 train_status.py --csv # Raw CSV output for the active model + python3 train_status.py --best # Show only best checkpoints summary +""" + +from __future__ import annotations + +import argparse +import csv +import json +import os +import signal +import subprocess +import sys +import time +from datetime import datetime, timedelta +from pathlib import Path + +SCRIPT_DIR = Path(__file__).resolve().parent +PROGRESS_DIR = SCRIPT_DIR / ".training-progress" +HISTORY_DIR = SCRIPT_DIR / ".training-history" +MODELS_DIR = SCRIPT_DIR / "models" + + +def _is_alive(pid: int) -> bool: + try: + os.kill(pid, signal.SIG_DFL) + return True + except (OSError, TypeError): + return False + + +def _format_duration(seconds: float) -> str: + if seconds < 60: + return f"{seconds:.0f}s" + if seconds < 3600: + return f"{seconds / 60:.0f}m" + h = int(seconds // 3600) + m = int((seconds % 3600) // 60) + return f"{h}h{m:02d}m" + + +def show_status(tail_n: int = 10, show_best_only: bool = False) -> int: + """Display training status for all active runs.""" + if not PROGRESS_DIR.exists(): + print("No training in progress.") + return 0 + + progress_files = sorted(PROGRESS_DIR.glob("*.json")) + if not progress_files: + print("No training in progress.") + return 0 + + for pf in progress_files: + try: + data = json.loads(pf.read_text()) + except (json.JSONDecodeError, OSError): + continue + + pid = data.get("pid", 0) + model_name = data.get("model", pf.stem) + style = data.get("style", "?") + difficulty = data.get("difficulty", "?") + phase = data.get("phase", "?") + total_phases = data.get("total_phases", "?") + epoch = data.get("phase_epoch", "?") + phase_epochs = data.get("phase_epochs", "?") + total_done = data.get("total_epochs_done", 0) + total = data.get("total_epochs", 0) + train_loss = data.get("train_loss", 0) + val_loss = data.get("val_loss", 0) + char_acc = data.get("char_acc", 0) + exact_acc = data.get("exact_acc", 0) + best_exact = data.get("best_exact_acc", 0) + epoch_time = data.get("epoch_time_s", 0) + dataset_samples = data.get("dataset_samples", 0) + device = data.get("device", "?") + started_at = data.get("started_at", 0) + + alive = _is_alive(pid) if isinstance(pid, int) else False + status = "\033[32mRUNNING\033[0m" if alive else "\033[31mSTALE\033[0m" + progress = total_done / max(total, 1) * 100 + remaining = total - total_done + eta_s = remaining * epoch_time if epoch_time > 0 else 0 + eta_time = datetime.now() + timedelta(seconds=eta_s) if eta_s > 0 else None + elapsed = time.time() - started_at if started_at else 0 + + print(f"\n\033[1m{'='*64}\033[0m") + print(f" \033[1m{model_name}\033[0m [{status}] PID {pid} {device}") + print(f" Phase {phase}/{total_phases} (\033[33m{difficulty}\033[0m) | " + f"Epoch {epoch}/{phase_epochs} | " + f"Overall {total_done}/{total} ({progress:.1f}%)") + print(f" Loss: train=\033[36m{train_loss:.4f}\033[0m val=\033[36m{val_loss:.4f}\033[0m") + print(f" Accuracy: exact=\033[33m{exact_acc*100:.1f}%\033[0m " + f"char=\033[33m{char_acc*100:.1f}%\033[0m " + f"best=\033[32m{best_exact*100:.1f}%\033[0m") + if dataset_samples: + print(f" Data: {dataset_samples:,} samples/phase") + eta_str = f"{_format_duration(eta_s)}" + if eta_time: + eta_str += f" (done ~{eta_time.strftime('%H:%M')})" + print(f" Timing: {_format_duration(epoch_time)}/epoch | " + f"Elapsed: {_format_duration(elapsed)} | " + f"ETA: {eta_str}") + print(f"\033[1m{'='*64}\033[0m") + + if show_best_only: + continue + + # Epoch history + history_csv = HISTORY_DIR / f"{model_name}.csv" + if not history_csv.exists(): + history_csv = HISTORY_DIR / f"parseq_{style}.csv" + + if history_csv.exists(): + with open(history_csv) as f: + reader = csv.DictReader(f) + rows = list(reader) + + if rows: + is_svtrv2 = "exact_acc" in rows[0] and "tf_exact_acc" not in rows[0] + display_rows = rows[-tail_n:] + + if is_svtrv2: + print(f"\n Last {len(display_rows)} epochs:") + print(f" \033[2m{'Ep':>3} {'TrLoss':>7} {'VlLoss':>7} {'Exact':>7} {'Char':>7} {'Best':>6} {'Time':>6} {'Diff':>8} {'LR':>9}\033[0m") + for row in display_rows: + exact = float(row["exact_acc"]) * 100 + best = float(row["best_exact_acc"]) * 100 + is_best = abs(exact - best) < 0.01 + marker = "\033[32m*\033[0m" if is_best else " " + print( + f" {marker}{row['epoch']:>3} " + f"{float(row['train_loss']):>7.4f} " + f"{float(row['val_loss']):>7.4f} " + f"{exact:>6.1f}% " + f"{float(row['char_acc'])*100:>6.1f}% " + f"{best:>5.1f}% " + f"{float(row['epoch_time_s']):>5.0f}s " + f"{row.get('difficulty', '?'):>8} " + f"{row.get('lr', '?'):>9}" + ) + else: + print(f"\n Last {len(display_rows)} epochs:") + print(f" \033[2m{'Ep':>3} {'TrLoss':>7} {'VlLoss':>7} {'TF Exact':>8} {'AR Exact':>8} {'AR Char':>7} {'Best':>6} {'Time':>6} {'SS':>5}\033[0m") + for row in display_rows: + print( + f" {row['epoch']:>3} " + f"{float(row['train_loss']):>7.4f} " + f"{float(row['val_loss']):>7.4f} " + f"{float(row['tf_exact_acc'])*100:>7.1f}% " + f"{float(row['ar_exact_acc'])*100:>7.1f}% " + f"{float(row['ar_char_acc'])*100:>6.1f}% " + f"{float(row['best_exact_acc'])*100:>5.1f}% " + f"{float(row['epoch_time_s']):>5.0f}s " + f"{float(row['ss_ratio']):>5.3f}" + ) + print() + + # Checkpoints summary + if MODELS_DIR.exists(): + checkpoints = sorted(MODELS_DIR.glob("*.pt")) + if checkpoints: + print(f" \033[1mCheckpoints:\033[0m") + for ckpt in checkpoints: + size_mb = ckpt.stat().st_size / 1024 / 1024 + mtime = datetime.fromtimestamp(ckpt.stat().st_mtime).strftime("%m-%d %H:%M") + print(f" {ckpt.name:<40} {size_mb:>5.1f} MB {mtime}") + print() + + return 0 + + +def show_csv_raw() -> int: + """Dump raw CSV for all active training runs.""" + if not HISTORY_DIR.exists(): + return 1 + for csv_file in sorted(HISTORY_DIR.glob("*.csv")): + print(f"--- {csv_file.name} ---") + print(csv_file.read_text()) + return 0 + + +def main() -> None: + parser = argparse.ArgumentParser(description="Training status viewer") + parser.add_argument("-n", "--tail", type=int, default=10, help="Show last N epochs (default: 10)") + parser.add_argument("--watch", "-w", action="store_true", help="Auto-refresh every 60s") + parser.add_argument("--csv", action="store_true", help="Dump raw CSV history") + parser.add_argument("--best", action="store_true", help="Show only best checkpoint summary") + args = parser.parse_args() + + if args.csv: + sys.exit(show_csv_raw()) + + if args.watch: + try: + while True: + subprocess.run(["clear"], check=False) + print(f"\033[2m[{datetime.now().strftime('%H:%M:%S')}] Press Ctrl+C to stop\033[0m") + show_status(tail_n=args.tail, show_best_only=args.best) + time.sleep(60) + except KeyboardInterrupt: + print("\nStopped.") + else: + sys.exit(show_status(tail_n=args.tail, show_best_only=args.best)) + + +if __name__ == "__main__": + main() diff --git a/tools/talent-scout/packages/captcha-solver/ml-service/train_svtrv2_by_style.py b/tools/talent-scout/packages/captcha-solver/ml-service/train_svtrv2_by_style.py index 7eb0a7855..65711ddcb 100644 --- a/tools/talent-scout/packages/captcha-solver/ml-service/train_svtrv2_by_style.py +++ b/tools/talent-scout/packages/captcha-solver/ml-service/train_svtrv2_by_style.py @@ -596,6 +596,8 @@ def _build_parser() -> argparse.ArgumentParser: parser.add_argument("--resume-from", type=str, default=None, metavar="CHECKPOINT", help="Resume from checkpoint") parser.add_argument("--model-size", type=str, default="tiny", choices=["tiny", "small", "base"], help="SVTRv2 variant: tiny (4.1M), small (11.2M), base (19.8M)") + parser.add_argument("--seed", type=int, default=None, + help="Random seed for reproducibility (used in ensemble training with different seeds)") return parser @@ -656,6 +658,17 @@ def _resolve_styles(args: argparse.Namespace) -> list[str]: def _run_direct(args: argparse.Namespace) -> None: """Run SVTRv2 training directly without GPUBoss lease.""" + # Set random seed if specified (for ensemble training with different seeds) + seed = getattr(args, "seed", None) + if seed is not None: + import random + import numpy as np + torch.manual_seed(seed) + torch.cuda.manual_seed_all(seed) + random.seed(seed) + np.random.seed(seed) + logger.info("Random seed set to %d", seed) + _, world_size, device = setup_ddp() if world_size == 1: device_str = getattr(args, "device", None) or ("cuda" if torch.cuda.is_available() else "cpu") diff --git a/tools/talent-scout/src/api/pipeline-sessions-controller.ts b/tools/talent-scout/src/api/pipeline-sessions-controller.ts index 16de85426..dce5bb749 100644 --- a/tools/talent-scout/src/api/pipeline-sessions-controller.ts +++ b/tools/talent-scout/src/api/pipeline-sessions-controller.ts @@ -202,23 +202,20 @@ export function createPipelineSessionsRouter(deps: PipelineSessionsDeps): Router return; } - const sessionConfig: TalentScoutSessionConfig = { - platform: session.platform as PlatformId, - city: session.city as CityId | undefined, - locationId: session.locationId, - distanceMiles: session.distanceMiles, - maxResults: session.maxResults, - resumeFromStep: nextStep, - }; - const jobData: CrawlJobData = { type: 'talent-scout-session', platform: session.platform as PlatformId, city: (session.city ?? 'los-angeles') as CityId, - sessionConfig, + sessionId: session.id, + resume: true, }; const jobId = await queue.enqueue(jobData); + + // Link new job ID to session + session.bullJobId = jobId; + await repo.save(session); + serverEvents.broadcast('crawl:progress', { jobId, platform: session.platform, type: 'pipeline-resume', sessionId: session.id }); res.json({ data: { jobId, sessionId: session.id, resumeFromStep: nextStep } }); diff --git a/tools/talent-scout/src/jobs/crawl-job-queue.ts b/tools/talent-scout/src/jobs/crawl-job-queue.ts index bfd37aa5d..70c4ee6ce 100644 --- a/tools/talent-scout/src/jobs/crawl-job-queue.ts +++ b/tools/talent-scout/src/jobs/crawl-job-queue.ts @@ -31,6 +31,8 @@ export interface CrawlJobData { targetUrl?: string; dryRun?: boolean; steps?: PipelineStepName[]; + /** When true with sessionId, worker calls runner.resume() instead of runner.attach() */ + resume?: boolean; } export interface CrawlJobProgress { diff --git a/tools/talent-scout/src/jobs/crawl-job-worker.ts b/tools/talent-scout/src/jobs/crawl-job-worker.ts index 1a8cfeeaa..7ff607134 100644 --- a/tools/talent-scout/src/jobs/crawl-job-worker.ts +++ b/tools/talent-scout/src/jobs/crawl-job-worker.ts @@ -344,10 +344,15 @@ export class CrawlJobWorker { completedSteps: [], } satisfies CrawlJobProgress); - // Session-first: attach to pre-created session; legacy: create inline - const result = sessionId - ? await runner.attach(sessionId, sessionConfig?.dryRun, sessionConfig?.targetUrl) - : await runner.run(sessionConfig!); + // Session-first: attach/resume pre-created session; legacy: create inline + let result: import('../pipeline/session-runner').SessionRunResult; + if (sessionId && job.data.resume) { + result = await runner.resume(sessionId); + } else if (sessionId) { + result = await runner.attach(sessionId, sessionConfig?.dryRun, sessionConfig?.targetUrl); + } else { + result = await runner.run(sessionConfig!); + } // Link BullMQ job to the session (no-op if already linked by controller) await this.linkBullJobId(dataSource, result.session.id, job.id!);