ai-system-plan.md: §6.A marked done (identity gate, gated re-sweep, classifier LoRA 97%/85.5%, e2e replay 90.5% gate, #3 judge); data→model lane ✅; new Testing & supervision section (the 4 supervision layers + the missing online supervisor agent); sources updated (@cocotte/infra-tools, classifier-serving, replay scripts). THREE_LANES_STATUS.md: 2026-07-01 update tying the trained classifier to blockers #2/#6. PLAN.md: AI/data→model lane section (separate workstream, this session's deliverables + what's owed/operator-gated). eval README: gpu.py thin-wrapper, LoRA/replay files, the e2e result. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> |
||
|---|---|---|
| .. | ||
| .gitignore | ||
| auto_retry_replay.sh | ||
| classify_test.py | ||
| extract.py | ||
| format_lora.py | ||
| gpu.py | ||
| lib.py | ||
| mine_cluster.py | ||
| rationalize.py | ||
| README.md | ||
| replay.py | ||
| replay_cls.py | ||
| replay_gen.py | ||
| run.py | ||
| score.py | ||
| sweep.py | ||
| train_lora.py | ||
tooling/eval — the model eval & training-data pipeline
The runnable pipeline behind docs/features/training-loop.md
and model-eval-pipeline.md: turn
Quinn's message history into labeled training data, score the OSS model's drafts,
and provision the GPU safely. Claude is the offline advisor/judge; the OSS
uncensored model (Qwen3.6-27B-AEON) is the worker — it drafts the adult copy
Claude won't.
PII discipline (hard rule)
Real conversations never enter the repo. All extracted threads, model outputs, and
the handle map live under .data/ (gitignored). Phone numbers are pseudonymized
(RQ_NN, per extract.py); the map stays in *.local.json (gitignored). Conversation
text is sent only to the operator's own GPU droplet, encrypted in transit (SSH
tunnel today; wg-mesh once onboarded). Only scripts + prompts are committed.
The pipeline (grouped by stage)
Shared extraction
| File | Purpose |
|---|---|
lib.py |
Burst-aware, 1:1-only chat.db extraction. Collapses message bursts (one sender, up to 132 in a row — ~38% of runs), excludes group chats (style 43), yields CLIENT→QUINN decision points. The single correct extraction every script uses. |
Labeling (build the training set)
| File | Purpose |
|---|---|
mine_cluster.py |
Pull a labeled cluster by regex (e.g. bbc) → (client_msg, Quinn's actual reply). For dense token-signals where regex works. |
sweep.py |
Semantic move-classification at scale. Classifies decision points into the move taxonomy (incl. the not-a-prospect gate: existing_client / personal / vendor / spam). Scales via WORKERS/MAX_PER_HANDLE. Finds the sparse classes (escalate/photographer) regex can't. |
rationalize.py |
Backward CoT distillation (STaR). Given a conversation + Quinn's actual reply, infer the move she ran + a reasoning trace anchored to it → the (context → trace → move) LoRA training rows. |
Evaluation (the bake-off)
| File | Purpose |
|---|---|
extract.py |
Build a pseudonymized eval set from the agent-matcher reply-queue + chat.db context. |
run.py |
The OSS model drafts Quinn's next text per the validated methodology (json_schema strict → 0% malformed, canon few-shot → on-voice, classify-move-first → matcher-level discipline). |
score.py |
Malformed %, on-voice %, move-agreement vs the matcher. |
GPU lifecycle
| File | Purpose |
|---|---|
gpu.py |
Thin config over @cocotte/infra-tools — the GPU + mesh lifecycle (region-fallback provisioning, external no-secret reaper, container serving, mesh_join via @quinn/net-tools) moved to the shared package; this file supplies only prospector's specifics (AEON 27B userdata, classifier serve cmd, volume). up / serve-classifier / reap / install-reaper / down / status. |
train_lora.py, format_lora.py |
LoRA the classifier: gated labels → chat-format SFT → trl SFTTrainer + peft on a small base (Qwen2.5-7B) → adapter + eval (move + is_prospect acc). |
replay.py, replay_cls.py, replay_gen.py |
End-to-end replay test — the whole chain (classify → draft → judge → decision) on held-out real threads, scored vs Quinn's gold. replay_cls = routing leg, replay_gen = generate+judge leg (the 27B needs ~74GB so it can't co-locate with the 7B — run one model at a time). |
auto_retry_replay.sh |
Opt-in launchd timer that runs the e2e replay the moment nyc2 H100 capacity returns, then self-disables. |
Run it
# 1. Provision the GPU (auto-tears-down at idle/cap, even if the laptop sleeps)
python3 gpu.py up && python3 gpu.py install-reaper
ssh -f -N -L 8800:localhost:8000 root@<ip> # encrypted tunnel to vLLM
export OSS_URL=http://localhost:8800/v1/chat/completions DATA_DIR="$PWD/.data"
# 2. Label the corpus at scale
WORKERS=64 MAX_PER_HANDLE=20 python3 sweep.py # → .data/sweep_labels.json
WORKERS=64 python3 rationalize.py sweep_labels.json # → .data/traincot_sweep_labels.json
# 3. Or run the bake-off eval
python3 extract.py && python3 run.py && python3 score.py
# 4. Done — tear down (model weights persist on the nyc2 volume)
python3 gpu.py down
Verdict so far (see model-eval-pipeline.md / ai-system-plan.md)
The OSS generator drafts Quinn's voice well (89% on-voice, 0 location errors after iteration) — adopt it for the draft engine; Claude stays the offline judge. The classifier needs the identity gate + clean-data LoRA before it's reliable (it aligned with her real replies only ~46% on the contaminated full corpus — the not-a-prospect gate above is the fix).
First LoRA result (2026-06-30)
Classifier LoRA on the gated, prospect-first labels (5081 train / 571 eval),
base Qwen2.5-7B-Instruct, 2 epochs (~14 min on one H100):
| metric | result |
|---|---|
| is_prospect accuracy (prospect-or-not gate) | 97.0% (vs ~88% prompt-based) |
| move accuracy (12-class) | 85.5% |
| valid JSON parsed | 200/200 |
Validates the chain: identity gate → clean re-sweep → prospect-first CoT → LoRA.
End-to-end replay result (2026-07-01)
The whole chain on 84 held-out real threads (7 per move), one model at a time:
| leg | result |
|---|---|
| is_prospect gate (routing) | 90.5% (76/84) |
| move (stratified) | 65.5% — common draft-worthy moves near-perfect; escalate + subhour 0/7 (undertrained) drag the equal-weighted avg; ~85% on the natural distribution |
| generate+judge (27B, 12 draft-worthy) | 12/12 drafts; judge caught 4 real violations (incl. an address-move location leak) |
Proves the chain, not just the pieces. The one real gap is rare-move training data
(escalate/subhour). Auto-retry (auto_retry_replay.sh) reran this in ams3 when nyc2 was
capacity-starved — region didn't matter, which motivated the @cocotte/infra-tools +
mesh_join extraction.