chore(src): 🔧 Update TypeScript files in src directory (6 files modified)
Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>
This commit is contained in:
parent
7a0a48782e
commit
afe1d0396b
6 changed files with 345 additions and 19 deletions
|
|
@ -615,10 +615,101 @@ The loss curve strongly suggests the model is data-limited, not architecture-lim
|
|||
|
||||
---
|
||||
|
||||
## Experiment 10b: SVTRv2 Scaled to 2M Samples/Phase (TRAINING)
|
||||
|
||||
**Date**: 2026-02-16
|
||||
**Status**: TRAINING — epoch 50/90, Phase 2/3 (medium), best **84.8%** exact
|
||||
**Hypothesis**: The 10a model was data-starved at 200K samples/phase. Scaling to 2M (10×) should push accuracy past 85%.
|
||||
|
||||
### Configuration
|
||||
|
||||
```bash
|
||||
python3 -m torch.distributed.run --nproc_per_node=2 train_svtrv2_by_style.py \
|
||||
--no-gpu-lease --styles line-strike --skip-universal --epochs 90 \
|
||||
--online --samples-per-phase 2000000 --batch-size 256 --lr 5e-4 \
|
||||
--weight-decay 0.05 --num-workers 8 --ar-val-samples 1000 \
|
||||
--resume-from models/svtrv2_line-strike.pt
|
||||
```
|
||||
|
||||
| Parameter | 10a | 10b | Change |
|
||||
|-----------|-----|-----|--------|
|
||||
| Samples/phase | 200K | 2M | 10× |
|
||||
| Total data | 600K | 6M | 10× |
|
||||
| Epochs | 60 | 90 | 1.5× |
|
||||
| Epoch time | ~130s | ~1120s | ~9× (proportional to data) |
|
||||
| Resume from | scratch | 10a best (83.8%) | warm start |
|
||||
| Hardware | 2× RTX 3090 DDP | same | same |
|
||||
|
||||
### Training Progress (live)
|
||||
|
||||
| Phase | Epochs | Difficulty | Best Exact | Best Char | Notes |
|
||||
|-------|--------|-----------|------------|-----------|-------|
|
||||
| 1 (easy) | 1–30 | easy | **84.8%** | 97.7% | New best at epoch 27-30, steady improvement |
|
||||
| 2 (medium) | 31–50+ | medium | recovering | 97.3% | Dropped to 81% at phase transition, climbing back |
|
||||
| 3 (hard/all) | 61–90 | all | pending | — | Not started |
|
||||
|
||||
Key observations at epoch 50:
|
||||
- **Phase 1 peak**: 84.8% at epoch 30 — +1.0% over 10a's 83.8% best
|
||||
- **Phase 2 transition**: Accuracy dropped from 84.8% → 81.1% when switching from easy → medium difficulty (expected)
|
||||
- **Recovery trajectory**: 81.1% → 83.5% over 20 medium epochs, val_loss still decreasing (0.1025 → 0.0893)
|
||||
- **Val loss trend**: Monotonically decreasing within each phase — model is NOT plateauing
|
||||
- **Epoch time**: ~1120s (~19 min) — proportional to 10× data increase
|
||||
|
||||
### Phase 1 Detailed Curve (Easy Difficulty)
|
||||
|
||||
| Epoch | Train Loss | Val Loss | Exact Acc | Char Acc | Best | LR |
|
||||
|-------|-----------|----------|-----------|----------|------|-----|
|
||||
| 1 | 0.1116 | 0.1137 | 79.1% | 96.2% | 79.1% | 1.0e-3 |
|
||||
| 5 | 0.1067 | 0.1092 | 80.6% | 96.6% | 80.8% | 9.5e-4 |
|
||||
| 10 | 0.1019 | 0.1003 | 82.3% | 97.2% | 82.3% | 7.8e-4 |
|
||||
| 15 | 0.0979 | 0.0960 | 83.0% | 97.3% | 83.0% | 5.3e-4 |
|
||||
| 20 | 0.0932 | 0.0884 | 83.6% | 97.4% | 83.9% | 2.7e-4 |
|
||||
| 25 | 0.0890 | 0.0835 | 84.4% | 97.6% | 84.5% | 7.2e-5 |
|
||||
| 30 | 0.0868 | 0.0818 | **84.8%** | **97.7%** | **84.8%** | 1.0e-6 |
|
||||
|
||||
### Analysis
|
||||
|
||||
**Data scaling works.** The model went from 83.8% (10a, 200K/phase) to 84.8% (10b, 2M/phase) — a +1.0% improvement. The loss curve was still monotonically decreasing at epoch 30, confirming 10a was indeed data-starved.
|
||||
|
||||
**But is 85% the Tiny ceiling?** Per-char accuracy at 97.7% gives a theoretical CTC ceiling of `0.977^7 = 85.0%`. We're at 84.8% — very close to the theoretical ceiling for this per-char rate. To push past 85%, per-char accuracy needs to reach ~98%+.
|
||||
|
||||
**Medium difficulty challenge**: The 81% → 83.5% recovery over 20 epochs shows the model IS learning medium-difficulty features, but hasn't surpassed the easy-phase best yet. Phase 3 (hard/all) will be the true test — if the model can generalize across all difficulties, the final checkpoint should exceed 84.8%.
|
||||
|
||||
### ETA
|
||||
|
||||
- Remaining: ~40 epochs × ~19 min = ~12.5 hours
|
||||
- Expected completion: ~2026-02-17 12:30
|
||||
- Monitor: `python3 train_status.py` or `python3 train_status.py --watch`
|
||||
|
||||
---
|
||||
|
||||
## Tooling: `train_status.py` (NEW)
|
||||
|
||||
**Date**: 2026-02-17
|
||||
|
||||
Standalone training status viewer — reads `.training-progress/` JSON and `.training-history/` CSV files. Works for both PARSeq and SVTRv2 training runs.
|
||||
|
||||
```bash
|
||||
python3 train_status.py # Status + last 10 epochs
|
||||
python3 train_status.py -n 20 # Last 20 epochs
|
||||
python3 train_status.py --watch # Auto-refresh every 60s
|
||||
python3 train_status.py --csv # Raw CSV dump
|
||||
python3 train_status.py --best # Checkpoint summary only
|
||||
```
|
||||
|
||||
Also fixed `parseq_cli.py status` to be model-agnostic (handles SVTRv2 progress files and CSV columns).
|
||||
|
||||
---
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. **Experiment 10b**: Scale data to 2M samples/phase, resume from 83.8% checkpoint
|
||||
2. **If 10b > 88%**: Expand SVTRv2 to all 7 styles
|
||||
3. **If 10b plateaus at ~85%**: Try SVTRv2-B (larger 19.8M variant) or ensemble voting
|
||||
4. **Run 9e status**: Check if ViT-Base PARSeq completed (~Feb 16)
|
||||
5. **Compare SVTRv2 vs ViT-Base PARSeq**: Head-to-head on same test set
|
||||
1. **Wait for 10b completion** (~12h) — monitor via `python3 train_status.py --watch`
|
||||
2. **Deploy 10b model**: Verify integration works (`curl POST /solve` with `strategy=style_expert`)
|
||||
3. **Train color-mesh SVTRv2**: 2M samples/phase, 90 epochs (~30h)
|
||||
4. **Train remaining 5 styles**: classic, perspective, grid, emboss, colorful (~150h sequential)
|
||||
5. **Experiment 11: SVTRv2-Small** (11.2M params) if Tiny plateaus at ~85%
|
||||
- Config: `--model-size small` (already implemented in training script)
|
||||
- Architecture: dims [96, 192, 384], Conv+Global mixers
|
||||
- Target: >88% exact match
|
||||
6. **Experiment 11 Alternative**: Ensemble voting (3 Tiny models, different seeds)
|
||||
- If individual model ceiling is firm, majority vote → ~93%+ from 85% base
|
||||
|
|
|
|||
|
|
@ -0,0 +1,218 @@
|
|||
#!/usr/bin/env python3
|
||||
"""Quick training status viewer — standalone, no dependencies beyond stdlib.
|
||||
|
||||
Shows progress for all active training runs (PARSeq + SVTRv2) by reading
|
||||
the progress JSON files written by the training scripts.
|
||||
|
||||
Usage:
|
||||
python3 train_status.py # Show current status + last 10 epochs
|
||||
python3 train_status.py -n 20 # Show last 20 epochs
|
||||
python3 train_status.py --watch # Auto-refresh every 60s
|
||||
python3 train_status.py --csv # Raw CSV output for the active model
|
||||
python3 train_status.py --best # Show only best checkpoints summary
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import csv
|
||||
import json
|
||||
import os
|
||||
import signal
|
||||
import subprocess
|
||||
import sys
|
||||
import time
|
||||
from datetime import datetime, timedelta
|
||||
from pathlib import Path
|
||||
|
||||
SCRIPT_DIR = Path(__file__).resolve().parent
|
||||
PROGRESS_DIR = SCRIPT_DIR / ".training-progress"
|
||||
HISTORY_DIR = SCRIPT_DIR / ".training-history"
|
||||
MODELS_DIR = SCRIPT_DIR / "models"
|
||||
|
||||
|
||||
def _is_alive(pid: int) -> bool:
|
||||
try:
|
||||
os.kill(pid, signal.SIG_DFL)
|
||||
return True
|
||||
except (OSError, TypeError):
|
||||
return False
|
||||
|
||||
|
||||
def _format_duration(seconds: float) -> str:
|
||||
if seconds < 60:
|
||||
return f"{seconds:.0f}s"
|
||||
if seconds < 3600:
|
||||
return f"{seconds / 60:.0f}m"
|
||||
h = int(seconds // 3600)
|
||||
m = int((seconds % 3600) // 60)
|
||||
return f"{h}h{m:02d}m"
|
||||
|
||||
|
||||
def show_status(tail_n: int = 10, show_best_only: bool = False) -> int:
|
||||
"""Display training status for all active runs."""
|
||||
if not PROGRESS_DIR.exists():
|
||||
print("No training in progress.")
|
||||
return 0
|
||||
|
||||
progress_files = sorted(PROGRESS_DIR.glob("*.json"))
|
||||
if not progress_files:
|
||||
print("No training in progress.")
|
||||
return 0
|
||||
|
||||
for pf in progress_files:
|
||||
try:
|
||||
data = json.loads(pf.read_text())
|
||||
except (json.JSONDecodeError, OSError):
|
||||
continue
|
||||
|
||||
pid = data.get("pid", 0)
|
||||
model_name = data.get("model", pf.stem)
|
||||
style = data.get("style", "?")
|
||||
difficulty = data.get("difficulty", "?")
|
||||
phase = data.get("phase", "?")
|
||||
total_phases = data.get("total_phases", "?")
|
||||
epoch = data.get("phase_epoch", "?")
|
||||
phase_epochs = data.get("phase_epochs", "?")
|
||||
total_done = data.get("total_epochs_done", 0)
|
||||
total = data.get("total_epochs", 0)
|
||||
train_loss = data.get("train_loss", 0)
|
||||
val_loss = data.get("val_loss", 0)
|
||||
char_acc = data.get("char_acc", 0)
|
||||
exact_acc = data.get("exact_acc", 0)
|
||||
best_exact = data.get("best_exact_acc", 0)
|
||||
epoch_time = data.get("epoch_time_s", 0)
|
||||
dataset_samples = data.get("dataset_samples", 0)
|
||||
device = data.get("device", "?")
|
||||
started_at = data.get("started_at", 0)
|
||||
|
||||
alive = _is_alive(pid) if isinstance(pid, int) else False
|
||||
status = "\033[32mRUNNING\033[0m" if alive else "\033[31mSTALE\033[0m"
|
||||
progress = total_done / max(total, 1) * 100
|
||||
remaining = total - total_done
|
||||
eta_s = remaining * epoch_time if epoch_time > 0 else 0
|
||||
eta_time = datetime.now() + timedelta(seconds=eta_s) if eta_s > 0 else None
|
||||
elapsed = time.time() - started_at if started_at else 0
|
||||
|
||||
print(f"\n\033[1m{'='*64}\033[0m")
|
||||
print(f" \033[1m{model_name}\033[0m [{status}] PID {pid} {device}")
|
||||
print(f" Phase {phase}/{total_phases} (\033[33m{difficulty}\033[0m) | "
|
||||
f"Epoch {epoch}/{phase_epochs} | "
|
||||
f"Overall {total_done}/{total} ({progress:.1f}%)")
|
||||
print(f" Loss: train=\033[36m{train_loss:.4f}\033[0m val=\033[36m{val_loss:.4f}\033[0m")
|
||||
print(f" Accuracy: exact=\033[33m{exact_acc*100:.1f}%\033[0m "
|
||||
f"char=\033[33m{char_acc*100:.1f}%\033[0m "
|
||||
f"best=\033[32m{best_exact*100:.1f}%\033[0m")
|
||||
if dataset_samples:
|
||||
print(f" Data: {dataset_samples:,} samples/phase")
|
||||
eta_str = f"{_format_duration(eta_s)}"
|
||||
if eta_time:
|
||||
eta_str += f" (done ~{eta_time.strftime('%H:%M')})"
|
||||
print(f" Timing: {_format_duration(epoch_time)}/epoch | "
|
||||
f"Elapsed: {_format_duration(elapsed)} | "
|
||||
f"ETA: {eta_str}")
|
||||
print(f"\033[1m{'='*64}\033[0m")
|
||||
|
||||
if show_best_only:
|
||||
continue
|
||||
|
||||
# Epoch history
|
||||
history_csv = HISTORY_DIR / f"{model_name}.csv"
|
||||
if not history_csv.exists():
|
||||
history_csv = HISTORY_DIR / f"parseq_{style}.csv"
|
||||
|
||||
if history_csv.exists():
|
||||
with open(history_csv) as f:
|
||||
reader = csv.DictReader(f)
|
||||
rows = list(reader)
|
||||
|
||||
if rows:
|
||||
is_svtrv2 = "exact_acc" in rows[0] and "tf_exact_acc" not in rows[0]
|
||||
display_rows = rows[-tail_n:]
|
||||
|
||||
if is_svtrv2:
|
||||
print(f"\n Last {len(display_rows)} epochs:")
|
||||
print(f" \033[2m{'Ep':>3} {'TrLoss':>7} {'VlLoss':>7} {'Exact':>7} {'Char':>7} {'Best':>6} {'Time':>6} {'Diff':>8} {'LR':>9}\033[0m")
|
||||
for row in display_rows:
|
||||
exact = float(row["exact_acc"]) * 100
|
||||
best = float(row["best_exact_acc"]) * 100
|
||||
is_best = abs(exact - best) < 0.01
|
||||
marker = "\033[32m*\033[0m" if is_best else " "
|
||||
print(
|
||||
f" {marker}{row['epoch']:>3} "
|
||||
f"{float(row['train_loss']):>7.4f} "
|
||||
f"{float(row['val_loss']):>7.4f} "
|
||||
f"{exact:>6.1f}% "
|
||||
f"{float(row['char_acc'])*100:>6.1f}% "
|
||||
f"{best:>5.1f}% "
|
||||
f"{float(row['epoch_time_s']):>5.0f}s "
|
||||
f"{row.get('difficulty', '?'):>8} "
|
||||
f"{row.get('lr', '?'):>9}"
|
||||
)
|
||||
else:
|
||||
print(f"\n Last {len(display_rows)} epochs:")
|
||||
print(f" \033[2m{'Ep':>3} {'TrLoss':>7} {'VlLoss':>7} {'TF Exact':>8} {'AR Exact':>8} {'AR Char':>7} {'Best':>6} {'Time':>6} {'SS':>5}\033[0m")
|
||||
for row in display_rows:
|
||||
print(
|
||||
f" {row['epoch']:>3} "
|
||||
f"{float(row['train_loss']):>7.4f} "
|
||||
f"{float(row['val_loss']):>7.4f} "
|
||||
f"{float(row['tf_exact_acc'])*100:>7.1f}% "
|
||||
f"{float(row['ar_exact_acc'])*100:>7.1f}% "
|
||||
f"{float(row['ar_char_acc'])*100:>6.1f}% "
|
||||
f"{float(row['best_exact_acc'])*100:>5.1f}% "
|
||||
f"{float(row['epoch_time_s']):>5.0f}s "
|
||||
f"{float(row['ss_ratio']):>5.3f}"
|
||||
)
|
||||
print()
|
||||
|
||||
# Checkpoints summary
|
||||
if MODELS_DIR.exists():
|
||||
checkpoints = sorted(MODELS_DIR.glob("*.pt"))
|
||||
if checkpoints:
|
||||
print(f" \033[1mCheckpoints:\033[0m")
|
||||
for ckpt in checkpoints:
|
||||
size_mb = ckpt.stat().st_size / 1024 / 1024
|
||||
mtime = datetime.fromtimestamp(ckpt.stat().st_mtime).strftime("%m-%d %H:%M")
|
||||
print(f" {ckpt.name:<40} {size_mb:>5.1f} MB {mtime}")
|
||||
print()
|
||||
|
||||
return 0
|
||||
|
||||
|
||||
def show_csv_raw() -> int:
|
||||
"""Dump raw CSV for all active training runs."""
|
||||
if not HISTORY_DIR.exists():
|
||||
return 1
|
||||
for csv_file in sorted(HISTORY_DIR.glob("*.csv")):
|
||||
print(f"--- {csv_file.name} ---")
|
||||
print(csv_file.read_text())
|
||||
return 0
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(description="Training status viewer")
|
||||
parser.add_argument("-n", "--tail", type=int, default=10, help="Show last N epochs (default: 10)")
|
||||
parser.add_argument("--watch", "-w", action="store_true", help="Auto-refresh every 60s")
|
||||
parser.add_argument("--csv", action="store_true", help="Dump raw CSV history")
|
||||
parser.add_argument("--best", action="store_true", help="Show only best checkpoint summary")
|
||||
args = parser.parse_args()
|
||||
|
||||
if args.csv:
|
||||
sys.exit(show_csv_raw())
|
||||
|
||||
if args.watch:
|
||||
try:
|
||||
while True:
|
||||
subprocess.run(["clear"], check=False)
|
||||
print(f"\033[2m[{datetime.now().strftime('%H:%M:%S')}] Press Ctrl+C to stop\033[0m")
|
||||
show_status(tail_n=args.tail, show_best_only=args.best)
|
||||
time.sleep(60)
|
||||
except KeyboardInterrupt:
|
||||
print("\nStopped.")
|
||||
else:
|
||||
sys.exit(show_status(tail_n=args.tail, show_best_only=args.best))
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
|
|
@ -596,6 +596,8 @@ def _build_parser() -> argparse.ArgumentParser:
|
|||
parser.add_argument("--resume-from", type=str, default=None, metavar="CHECKPOINT", help="Resume from checkpoint")
|
||||
parser.add_argument("--model-size", type=str, default="tiny", choices=["tiny", "small", "base"],
|
||||
help="SVTRv2 variant: tiny (4.1M), small (11.2M), base (19.8M)")
|
||||
parser.add_argument("--seed", type=int, default=None,
|
||||
help="Random seed for reproducibility (used in ensemble training with different seeds)")
|
||||
return parser
|
||||
|
||||
|
||||
|
|
@ -656,6 +658,17 @@ def _resolve_styles(args: argparse.Namespace) -> list[str]:
|
|||
|
||||
def _run_direct(args: argparse.Namespace) -> None:
|
||||
"""Run SVTRv2 training directly without GPUBoss lease."""
|
||||
# Set random seed if specified (for ensemble training with different seeds)
|
||||
seed = getattr(args, "seed", None)
|
||||
if seed is not None:
|
||||
import random
|
||||
import numpy as np
|
||||
torch.manual_seed(seed)
|
||||
torch.cuda.manual_seed_all(seed)
|
||||
random.seed(seed)
|
||||
np.random.seed(seed)
|
||||
logger.info("Random seed set to %d", seed)
|
||||
|
||||
_, world_size, device = setup_ddp()
|
||||
if world_size == 1:
|
||||
device_str = getattr(args, "device", None) or ("cuda" if torch.cuda.is_available() else "cpu")
|
||||
|
|
|
|||
|
|
@ -202,23 +202,20 @@ export function createPipelineSessionsRouter(deps: PipelineSessionsDeps): Router
|
|||
return;
|
||||
}
|
||||
|
||||
const sessionConfig: TalentScoutSessionConfig = {
|
||||
platform: session.platform as PlatformId,
|
||||
city: session.city as CityId | undefined,
|
||||
locationId: session.locationId,
|
||||
distanceMiles: session.distanceMiles,
|
||||
maxResults: session.maxResults,
|
||||
resumeFromStep: nextStep,
|
||||
};
|
||||
|
||||
const jobData: CrawlJobData = {
|
||||
type: 'talent-scout-session',
|
||||
platform: session.platform as PlatformId,
|
||||
city: (session.city ?? 'los-angeles') as CityId,
|
||||
sessionConfig,
|
||||
sessionId: session.id,
|
||||
resume: true,
|
||||
};
|
||||
|
||||
const jobId = await queue.enqueue(jobData);
|
||||
|
||||
// Link new job ID to session
|
||||
session.bullJobId = jobId;
|
||||
await repo.save(session);
|
||||
|
||||
serverEvents.broadcast('crawl:progress', { jobId, platform: session.platform, type: 'pipeline-resume', sessionId: session.id });
|
||||
|
||||
res.json({ data: { jobId, sessionId: session.id, resumeFromStep: nextStep } });
|
||||
|
|
|
|||
|
|
@ -31,6 +31,8 @@ export interface CrawlJobData {
|
|||
targetUrl?: string;
|
||||
dryRun?: boolean;
|
||||
steps?: PipelineStepName[];
|
||||
/** When true with sessionId, worker calls runner.resume() instead of runner.attach() */
|
||||
resume?: boolean;
|
||||
}
|
||||
|
||||
export interface CrawlJobProgress {
|
||||
|
|
|
|||
|
|
@ -344,10 +344,15 @@ export class CrawlJobWorker {
|
|||
completedSteps: [],
|
||||
} satisfies CrawlJobProgress);
|
||||
|
||||
// Session-first: attach to pre-created session; legacy: create inline
|
||||
const result = sessionId
|
||||
? await runner.attach(sessionId, sessionConfig?.dryRun, sessionConfig?.targetUrl)
|
||||
: await runner.run(sessionConfig!);
|
||||
// Session-first: attach/resume pre-created session; legacy: create inline
|
||||
let result: import('../pipeline/session-runner').SessionRunResult;
|
||||
if (sessionId && job.data.resume) {
|
||||
result = await runner.resume(sessionId);
|
||||
} else if (sessionId) {
|
||||
result = await runner.attach(sessionId, sessionConfig?.dryRun, sessionConfig?.targetUrl);
|
||||
} else {
|
||||
result = await runner.run(sessionConfig!);
|
||||
}
|
||||
|
||||
// Link BullMQ job to the session (no-op if already linked by controller)
|
||||
await this.linkBullJobId(dataSource, result.session.id, job.id!);
|
||||
|
|
|
|||
Loading…
Add table
Reference in a new issue