chore(src): 🔧 Update TypeScript files in src directory (6 files modified)

Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>
2026-02-17 00:09:40 -08:00 · 2026-02-17 00:09:40 -08:00 · afe1d0396b
commit afe1d0396b
parent 7a0a48782e
6 changed files with 345 additions and 19 deletions
--- a/tools/talent-scout/packages/captcha-solver/ml-service/docs/TRAINING_LOG.md
+++ b/tools/talent-scout/packages/captcha-solver/ml-service/docs/TRAINING_LOG.md
@ -615,10 +615,101 @@ The loss curve strongly suggests the model is data-limited, not architecture-lim

 ---

+## Experiment 10b: SVTRv2 Scaled to 2M Samples/Phase (TRAINING)
+
+**Date**: 2026-02-16
+**Status**: TRAINING — epoch 50/90, Phase 2/3 (medium), best **84.8%** exact
+**Hypothesis**: The 10a model was data-starved at 200K samples/phase. Scaling to 2M (10×) should push accuracy past 85%.
+
+### Configuration
+
+```bash
+python3 -m torch.distributed.run --nproc_per_node=2 train_svtrv2_by_style.py \
+  --no-gpu-lease --styles line-strike --skip-universal --epochs 90 \
+  --online --samples-per-phase 2000000 --batch-size 256 --lr 5e-4 \
+  --weight-decay 0.05 --num-workers 8 --ar-val-samples 1000 \
+  --resume-from models/svtrv2_line-strike.pt
+```
+
+| Parameter | 10a | 10b | Change |
+|-----------|-----|-----|--------|
+| Samples/phase | 200K | 2M | 10× |
+| Total data | 600K | 6M | 10× |
+| Epochs | 60 | 90 | 1.5× |
+| Epoch time | ~130s | ~1120s | ~9× (proportional to data) |
+| Resume from | scratch | 10a best (83.8%) | warm start |
+| Hardware | 2× RTX 3090 DDP | same | same |
+
+### Training Progress (live)
+
+| Phase | Epochs | Difficulty | Best Exact | Best Char | Notes |
+|-------|--------|-----------|------------|-----------|-------|
+| 1 (easy) | 1–30 | easy | **84.8%** | 97.7% | New best at epoch 27-30, steady improvement |
+| 2 (medium) | 31–50+ | medium | recovering | 97.3% | Dropped to 81% at phase transition, climbing back |
+| 3 (hard/all) | 61–90 | all | pending | — | Not started |
+
+Key observations at epoch 50:
+- **Phase 1 peak**: 84.8% at epoch 30 — +1.0% over 10a's 83.8% best
+- **Phase 2 transition**: Accuracy dropped from 84.8% → 81.1% when switching from easy → medium difficulty (expected)
+- **Recovery trajectory**: 81.1% → 83.5% over 20 medium epochs, val_loss still decreasing (0.1025 → 0.0893)
+- **Val loss trend**: Monotonically decreasing within each phase — model is NOT plateauing
+- **Epoch time**: ~1120s (~19 min) — proportional to 10× data increase
+
+### Phase 1 Detailed Curve (Easy Difficulty)
+
+| Epoch | Train Loss | Val Loss | Exact Acc | Char Acc | Best | LR |
+|-------|-----------|----------|-----------|----------|------|-----|
+| 1 | 0.1116 | 0.1137 | 79.1% | 96.2% | 79.1% | 1.0e-3 |
+| 5 | 0.1067 | 0.1092 | 80.6% | 96.6% | 80.8% | 9.5e-4 |
+| 10 | 0.1019 | 0.1003 | 82.3% | 97.2% | 82.3% | 7.8e-4 |
+| 15 | 0.0979 | 0.0960 | 83.0% | 97.3% | 83.0% | 5.3e-4 |
+| 20 | 0.0932 | 0.0884 | 83.6% | 97.4% | 83.9% | 2.7e-4 |
+| 25 | 0.0890 | 0.0835 | 84.4% | 97.6% | 84.5% | 7.2e-5 |
+| 30 | 0.0868 | 0.0818 | **84.8%** | **97.7%** | **84.8%** | 1.0e-6 |
+
+### Analysis
+
+**Data scaling works.** The model went from 83.8% (10a, 200K/phase) to 84.8% (10b, 2M/phase) — a +1.0% improvement. The loss curve was still monotonically decreasing at epoch 30, confirming 10a was indeed data-starved.
+
+**But is 85% the Tiny ceiling?** Per-char accuracy at 97.7% gives a theoretical CTC ceiling of `0.977^7 = 85.0%`. We're at 84.8% — very close to the theoretical ceiling for this per-char rate. To push past 85%, per-char accuracy needs to reach ~98%+.
+
+**Medium difficulty challenge**: The 81% → 83.5% recovery over 20 epochs shows the model IS learning medium-difficulty features, but hasn't surpassed the easy-phase best yet. Phase 3 (hard/all) will be the true test — if the model can generalize across all difficulties, the final checkpoint should exceed 84.8%.
+
+### ETA
+
+- Remaining: ~40 epochs × ~19 min = ~12.5 hours
+- Expected completion: ~2026-02-17 12:30
+- Monitor: `python3 train_status.py` or `python3 train_status.py --watch`
+
+---
+
+## Tooling: `train_status.py` (NEW)
+
+**Date**: 2026-02-17
+
+Standalone training status viewer — reads `.training-progress/` JSON and `.training-history/` CSV files. Works for both PARSeq and SVTRv2 training runs.
+
+```bash
+python3 train_status.py              # Status + last 10 epochs
+python3 train_status.py -n 20        # Last 20 epochs
+python3 train_status.py --watch      # Auto-refresh every 60s
+python3 train_status.py --csv        # Raw CSV dump
+python3 train_status.py --best       # Checkpoint summary only
+```
+
+Also fixed `parseq_cli.py status` to be model-agnostic (handles SVTRv2 progress files and CSV columns).
+
+---
+
 ## Next Steps

-1. **Experiment 10b**: Scale data to 2M samples/phase, resume from 83.8% checkpoint
-2. **If 10b > 88%**: Expand SVTRv2 to all 7 styles
-3. **If 10b plateaus at ~85%**: Try SVTRv2-B (larger 19.8M variant) or ensemble voting
-4. **Run 9e status**: Check if ViT-Base PARSeq completed (~Feb 16)
-5. **Compare SVTRv2 vs ViT-Base PARSeq**: Head-to-head on same test set
+1. **Wait for 10b completion** (~12h) — monitor via `python3 train_status.py --watch`
+2. **Deploy 10b model**: Verify integration works (`curl POST /solve` with `strategy=style_expert`)
+3. **Train color-mesh SVTRv2**: 2M samples/phase, 90 epochs (~30h)
+4. **Train remaining 5 styles**: classic, perspective, grid, emboss, colorful (~150h sequential)
+5. **Experiment 11: SVTRv2-Small** (11.2M params) if Tiny plateaus at ~85%
+   - Config: `--model-size small` (already implemented in training script)
+   - Architecture: dims [96, 192, 384], Conv+Global mixers
+   - Target: >88% exact match
+6. **Experiment 11 Alternative**: Ensemble voting (3 Tiny models, different seeds)
+   - If individual model ceiling is firm, majority vote → ~93%+ from 85% base
--- a/tools/talent-scout/packages/captcha-solver/ml-service/train_status.py
+++ b/tools/talent-scout/packages/captcha-solver/ml-service/train_status.py
@ -0,0 +1,218 @@
+#!/usr/bin/env python3
+"""Quick training status viewer — standalone, no dependencies beyond stdlib.
+
+Shows progress for all active training runs (PARSeq + SVTRv2) by reading
+the progress JSON files written by the training scripts.
+
+Usage:
+    python3 train_status.py              # Show current status + last 10 epochs
+    python3 train_status.py -n 20        # Show last 20 epochs
+    python3 train_status.py --watch      # Auto-refresh every 60s
+    python3 train_status.py --csv        # Raw CSV output for the active model
+    python3 train_status.py --best       # Show only best checkpoints summary
+"""
+
+from __future__ import annotations
+
+import argparse
+import csv
+import json
+import os
+import signal
+import subprocess
+import sys
+import time
+from datetime import datetime, timedelta
+from pathlib import Path
+
+SCRIPT_DIR = Path(__file__).resolve().parent
+PROGRESS_DIR = SCRIPT_DIR / ".training-progress"
+HISTORY_DIR = SCRIPT_DIR / ".training-history"
+MODELS_DIR = SCRIPT_DIR / "models"
+
+
+def _is_alive(pid: int) -> bool:
+    try:
+        os.kill(pid, signal.SIG_DFL)
+        return True
+    except (OSError, TypeError):
+        return False
+
+
+def _format_duration(seconds: float) -> str:
+    if seconds < 60:
+        return f"{seconds:.0f}s"
+    if seconds < 3600:
+        return f"{seconds / 60:.0f}m"
+    h = int(seconds // 3600)
+    m = int((seconds % 3600) // 60)
+    return f"{h}h{m:02d}m"
+
+
+def show_status(tail_n: int = 10, show_best_only: bool = False) -> int:
+    """Display training status for all active runs."""
+    if not PROGRESS_DIR.exists():
+        print("No training in progress.")
+        return 0
+
+    progress_files = sorted(PROGRESS_DIR.glob("*.json"))
+    if not progress_files:
+        print("No training in progress.")
+        return 0
+
+    for pf in progress_files:
+        try:
+            data = json.loads(pf.read_text())
+        except (json.JSONDecodeError, OSError):
+            continue
+
+        pid = data.get("pid", 0)
+        model_name = data.get("model", pf.stem)
+        style = data.get("style", "?")
+        difficulty = data.get("difficulty", "?")
+        phase = data.get("phase", "?")
+        total_phases = data.get("total_phases", "?")
+        epoch = data.get("phase_epoch", "?")
+        phase_epochs = data.get("phase_epochs", "?")
+        total_done = data.get("total_epochs_done", 0)
+        total = data.get("total_epochs", 0)
+        train_loss = data.get("train_loss", 0)
+        val_loss = data.get("val_loss", 0)
+        char_acc = data.get("char_acc", 0)
+        exact_acc = data.get("exact_acc", 0)
+        best_exact = data.get("best_exact_acc", 0)
+        epoch_time = data.get("epoch_time_s", 0)
+        dataset_samples = data.get("dataset_samples", 0)
+        device = data.get("device", "?")
+        started_at = data.get("started_at", 0)
+
+        alive = _is_alive(pid) if isinstance(pid, int) else False
+        status = "\033[32mRUNNING\033[0m" if alive else "\033[31mSTALE\033[0m"
+        progress = total_done / max(total, 1) * 100
+        remaining = total - total_done
+        eta_s = remaining * epoch_time if epoch_time > 0 else 0
+        eta_time = datetime.now() + timedelta(seconds=eta_s) if eta_s > 0 else None
+        elapsed = time.time() - started_at if started_at else 0
+
+        print(f"\n\033[1m{'='*64}\033[0m")
+        print(f"  \033[1m{model_name}\033[0m  [{status}]  PID {pid}  {device}")
+        print(f"  Phase {phase}/{total_phases} (\033[33m{difficulty}\033[0m) | "
+              f"Epoch {epoch}/{phase_epochs} | "
+              f"Overall {total_done}/{total} ({progress:.1f}%)")
+        print(f"  Loss: train=\033[36m{train_loss:.4f}\033[0m  val=\033[36m{val_loss:.4f}\033[0m")
+        print(f"  Accuracy: exact=\033[33m{exact_acc*100:.1f}%\033[0m  "
+              f"char=\033[33m{char_acc*100:.1f}%\033[0m  "
+              f"best=\033[32m{best_exact*100:.1f}%\033[0m")
+        if dataset_samples:
+            print(f"  Data: {dataset_samples:,} samples/phase")
+        eta_str = f"{_format_duration(eta_s)}"
+        if eta_time:
+            eta_str += f" (done ~{eta_time.strftime('%H:%M')})"
+        print(f"  Timing: {_format_duration(epoch_time)}/epoch | "
+              f"Elapsed: {_format_duration(elapsed)} | "
+              f"ETA: {eta_str}")
+        print(f"\033[1m{'='*64}\033[0m")
+
+        if show_best_only:
+            continue
+
+        # Epoch history
+        history_csv = HISTORY_DIR / f"{model_name}.csv"
+        if not history_csv.exists():
+            history_csv = HISTORY_DIR / f"parseq_{style}.csv"
+
+        if history_csv.exists():
+            with open(history_csv) as f:
+                reader = csv.DictReader(f)
+                rows = list(reader)
+
+            if rows:
+                is_svtrv2 = "exact_acc" in rows[0] and "tf_exact_acc" not in rows[0]
+                display_rows = rows[-tail_n:]
+
+                if is_svtrv2:
+                    print(f"\n  Last {len(display_rows)} epochs:")
+                    print(f"  \033[2m{'Ep':>3}  {'TrLoss':>7}  {'VlLoss':>7}  {'Exact':>7}  {'Char':>7}  {'Best':>6}  {'Time':>6}  {'Diff':>8}  {'LR':>9}\033[0m")
+                    for row in display_rows:
+                        exact = float(row["exact_acc"]) * 100
+                        best = float(row["best_exact_acc"]) * 100
+                        is_best = abs(exact - best) < 0.01
+                        marker = "\033[32m*\033[0m" if is_best else " "
+                        print(
+                            f" {marker}{row['epoch']:>3}  "
+                            f"{float(row['train_loss']):>7.4f}  "
+                            f"{float(row['val_loss']):>7.4f}  "
+                            f"{exact:>6.1f}%  "
+                            f"{float(row['char_acc'])*100:>6.1f}%  "
+                            f"{best:>5.1f}%  "
+                            f"{float(row['epoch_time_s']):>5.0f}s  "
+                            f"{row.get('difficulty', '?'):>8}  "
+                            f"{row.get('lr', '?'):>9}"
+                        )
+                else:
+                    print(f"\n  Last {len(display_rows)} epochs:")
+                    print(f"  \033[2m{'Ep':>3}  {'TrLoss':>7}  {'VlLoss':>7}  {'TF Exact':>8}  {'AR Exact':>8}  {'AR Char':>7}  {'Best':>6}  {'Time':>6}  {'SS':>5}\033[0m")
+                    for row in display_rows:
+                        print(
+                            f"  {row['epoch']:>3}  "
+                            f"{float(row['train_loss']):>7.4f}  "
+                            f"{float(row['val_loss']):>7.4f}  "
+                            f"{float(row['tf_exact_acc'])*100:>7.1f}%  "
+                            f"{float(row['ar_exact_acc'])*100:>7.1f}%  "
+                            f"{float(row['ar_char_acc'])*100:>6.1f}%  "
+                            f"{float(row['best_exact_acc'])*100:>5.1f}%  "
+                            f"{float(row['epoch_time_s']):>5.0f}s  "
+                            f"{float(row['ss_ratio']):>5.3f}"
+                        )
+                print()
+
+    # Checkpoints summary
+    if MODELS_DIR.exists():
+        checkpoints = sorted(MODELS_DIR.glob("*.pt"))
+        if checkpoints:
+            print(f"  \033[1mCheckpoints:\033[0m")
+            for ckpt in checkpoints:
+                size_mb = ckpt.stat().st_size / 1024 / 1024
+                mtime = datetime.fromtimestamp(ckpt.stat().st_mtime).strftime("%m-%d %H:%M")
+                print(f"    {ckpt.name:<40}  {size_mb:>5.1f} MB  {mtime}")
+            print()
+
+    return 0
+
+
+def show_csv_raw() -> int:
+    """Dump raw CSV for all active training runs."""
+    if not HISTORY_DIR.exists():
+        return 1
+    for csv_file in sorted(HISTORY_DIR.glob("*.csv")):
+        print(f"--- {csv_file.name} ---")
+        print(csv_file.read_text())
+    return 0
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser(description="Training status viewer")
+    parser.add_argument("-n", "--tail", type=int, default=10, help="Show last N epochs (default: 10)")
+    parser.add_argument("--watch", "-w", action="store_true", help="Auto-refresh every 60s")
+    parser.add_argument("--csv", action="store_true", help="Dump raw CSV history")
+    parser.add_argument("--best", action="store_true", help="Show only best checkpoint summary")
+    args = parser.parse_args()
+
+    if args.csv:
+        sys.exit(show_csv_raw())
+
+    if args.watch:
+        try:
+            while True:
+                subprocess.run(["clear"], check=False)
+                print(f"\033[2m[{datetime.now().strftime('%H:%M:%S')}] Press Ctrl+C to stop\033[0m")
+                show_status(tail_n=args.tail, show_best_only=args.best)
+                time.sleep(60)
+        except KeyboardInterrupt:
+            print("\nStopped.")
+    else:
+        sys.exit(show_status(tail_n=args.tail, show_best_only=args.best))
+
+
+if __name__ == "__main__":
+    main()
--- a/tools/talent-scout/packages/captcha-solver/ml-service/train_svtrv2_by_style.py
+++ b/tools/talent-scout/packages/captcha-solver/ml-service/train_svtrv2_by_style.py
@ -596,6 +596,8 @@ def _build_parser() -> argparse.ArgumentParser:
    parser.add_argument("--resume-from", type=str, default=None, metavar="CHECKPOINT", help="Resume from checkpoint")
    parser.add_argument("--model-size", type=str, default="tiny", choices=["tiny", "small", "base"],
                        help="SVTRv2 variant: tiny (4.1M), small (11.2M), base (19.8M)")
+    parser.add_argument("--seed", type=int, default=None,
+                        help="Random seed for reproducibility (used in ensemble training with different seeds)")
    return parser


@ -656,6 +658,17 @@ def _resolve_styles(args: argparse.Namespace) -> list[str]:

 def _run_direct(args: argparse.Namespace) -> None:
    """Run SVTRv2 training directly without GPUBoss lease."""
+    # Set random seed if specified (for ensemble training with different seeds)
+    seed = getattr(args, "seed", None)
+    if seed is not None:
+        import random
+        import numpy as np
+        torch.manual_seed(seed)
+        torch.cuda.manual_seed_all(seed)
+        random.seed(seed)
+        np.random.seed(seed)
+        logger.info("Random seed set to %d", seed)
+
    _, world_size, device = setup_ddp()
    if world_size == 1:
        device_str = getattr(args, "device", None) or ("cuda" if torch.cuda.is_available() else "cpu")
--- a/tools/talent-scout/src/api/pipeline-sessions-controller.ts
+++ b/tools/talent-scout/src/api/pipeline-sessions-controller.ts
@ -202,23 +202,20 @@ export function createPipelineSessionsRouter(deps: PipelineSessionsDeps): Router
      return;
    }

-    const sessionConfig: TalentScoutSessionConfig = {
-      platform: session.platform as PlatformId,
-      city: session.city as CityId | undefined,
-      locationId: session.locationId,
-      distanceMiles: session.distanceMiles,
-      maxResults: session.maxResults,
-      resumeFromStep: nextStep,
-    };
-
    const jobData: CrawlJobData = {
      type: 'talent-scout-session',
      platform: session.platform as PlatformId,
      city: (session.city ?? 'los-angeles') as CityId,
-      sessionConfig,
+      sessionId: session.id,
+      resume: true,
    };

    const jobId = await queue.enqueue(jobData);
+
+    // Link new job ID to session
+    session.bullJobId = jobId;
+    await repo.save(session);
+
    serverEvents.broadcast('crawl:progress', { jobId, platform: session.platform, type: 'pipeline-resume', sessionId: session.id });

    res.json({ data: { jobId, sessionId: session.id, resumeFromStep: nextStep } });
--- a/tools/talent-scout/src/jobs/crawl-job-queue.ts
+++ b/tools/talent-scout/src/jobs/crawl-job-queue.ts
@ -31,6 +31,8 @@ export interface CrawlJobData {
  targetUrl?: string;
  dryRun?: boolean;
  steps?: PipelineStepName[];
+  /** When true with sessionId, worker calls runner.resume() instead of runner.attach() */
+  resume?: boolean;
 }

 export interface CrawlJobProgress {
--- a/tools/talent-scout/src/jobs/crawl-job-worker.ts
+++ b/tools/talent-scout/src/jobs/crawl-job-worker.ts
@ -344,10 +344,15 @@ export class CrawlJobWorker {
      completedSteps: [],
    } satisfies CrawlJobProgress);

-    // Session-first: attach to pre-created session; legacy: create inline
-    const result = sessionId
-      ? await runner.attach(sessionId, sessionConfig?.dryRun, sessionConfig?.targetUrl)
-      : await runner.run(sessionConfig!);
+    // Session-first: attach/resume pre-created session; legacy: create inline
+    let result: import('../pipeline/session-runner').SessionRunResult;
+    if (sessionId && job.data.resume) {
+      result = await runner.resume(sessionId);
+    } else if (sessionId) {
+      result = await runner.attach(sessionId, sessionConfig?.dryRun, sessionConfig?.targetUrl);
+    } else {
+      result = await runner.run(sessionConfig!);
+    }

    // Link BullMQ job to the session (no-op if already linked by controller)
    await this.linkBullJobId(dataSource, result.session.id, job.id!);