prospector/tooling/eval
Natalie 40e5ea9d39
Some checks failed
CI/CD / verify (push) Failing after 3m11s
CI/CD / deploy (push) Has been skipped
docs: update PLAN + AI docs for the data→model lane (LoRA, e2e, infra-tools)
ai-system-plan.md: §6.A marked done (identity gate, gated re-sweep, classifier LoRA
97%/85.5%, e2e replay 90.5% gate, #3 judge); data→model lane ; new Testing &
supervision section (the 4 supervision layers + the missing online supervisor agent);
sources updated (@cocotte/infra-tools, classifier-serving, replay scripts).
THREE_LANES_STATUS.md: 2026-07-01 update tying the trained classifier to blockers #2/#6.
PLAN.md: AI/data→model lane section (separate workstream, this session's deliverables
+ what's owed/operator-gated). eval README: gpu.py thin-wrapper, LoRA/replay files, the
e2e result.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-01 07:37:06 -04:00
..
.gitignore docs(prospector): grouped feature documentation + eval pipeline docs 2026-06-30 10:41:16 -04:00
auto_retry_replay.sh feat(eval): auto-retry script for the e2e replay test (launchd, opt-in install) 2026-06-30 20:14:05 -04:00
classify_test.py test(eval): live classify test — trained LoRA classifier proven serving 2026-06-30 17:16:35 -04:00
extract.py feat(prospector): add tooling/eval draft-engine bake-off harness 2026-06-30 01:47:56 -04:00
format_lora.py feat(eval): LoRA classifier pipeline — format + train on gated prospect-first labels 2026-06-30 14:37:42 -04:00
gpu.py feat(eval): wire gpu.py mesh_join via @cocotte/infra-tools + net-tools 2026-07-01 07:18:49 -04:00
lib.py feat(eval): identity gate layer 2 — AddressBook known-contact exclusion 2026-06-30 12:12:23 -04:00
mine_cluster.py docs(prospector): grouped feature documentation + eval pipeline docs 2026-06-30 10:41:16 -04:00
rationalize.py feat(prospector): classifier detects NOT-A-PROSPECT as first-class output 2026-06-30 09:30:14 -04:00
README.md docs: update PLAN + AI docs for the data→model lane (LoRA, e2e, infra-tools) 2026-07-01 07:37:06 -04:00
replay.py feat(eval): end-to-end replay test — proves the whole chain on real threads 2026-06-30 20:00:27 -04:00
replay_cls.py feat(eval): two-leg e2e replay (single-model-at-a-time) + first e2e result 2026-07-01 06:49:14 -04:00
replay_gen.py feat(eval): two-leg e2e replay (single-model-at-a-time) + first e2e result 2026-07-01 06:49:14 -04:00
run.py feat(prospector): add tooling/eval draft-engine bake-off harness 2026-06-30 01:47:56 -04:00
score.py feat(prospector): add tooling/eval draft-engine bake-off harness 2026-06-30 01:47:56 -04:00
sweep.py feat(eval): explicit prospect-first CoT step (is_prospect) 2026-06-30 12:52:02 -04:00
train_lora.py feat(eval): LoRA classifier pipeline — format + train on gated prospect-first labels 2026-06-30 14:37:42 -04:00

tooling/eval — the model eval & training-data pipeline

The runnable pipeline behind docs/features/training-loop.md and model-eval-pipeline.md: turn Quinn's message history into labeled training data, score the OSS model's drafts, and provision the GPU safely. Claude is the offline advisor/judge; the OSS uncensored model (Qwen3.6-27B-AEON) is the worker — it drafts the adult copy Claude won't.

PII discipline (hard rule)

Real conversations never enter the repo. All extracted threads, model outputs, and the handle map live under .data/ (gitignored). Phone numbers are pseudonymized (RQ_NN, per extract.py); the map stays in *.local.json (gitignored). Conversation text is sent only to the operator's own GPU droplet, encrypted in transit (SSH tunnel today; wg-mesh once onboarded). Only scripts + prompts are committed.

The pipeline (grouped by stage)

Shared extraction

File Purpose
lib.py Burst-aware, 1:1-only chat.db extraction. Collapses message bursts (one sender, up to 132 in a row — ~38% of runs), excludes group chats (style 43), yields CLIENT→QUINN decision points. The single correct extraction every script uses.

Labeling (build the training set)

File Purpose
mine_cluster.py Pull a labeled cluster by regex (e.g. bbc) → (client_msg, Quinn's actual reply). For dense token-signals where regex works.
sweep.py Semantic move-classification at scale. Classifies decision points into the move taxonomy (incl. the not-a-prospect gate: existing_client / personal / vendor / spam). Scales via WORKERS/MAX_PER_HANDLE. Finds the sparse classes (escalate/photographer) regex can't.
rationalize.py Backward CoT distillation (STaR). Given a conversation + Quinn's actual reply, infer the move she ran + a reasoning trace anchored to it → the (context → trace → move) LoRA training rows.

Evaluation (the bake-off)

File Purpose
extract.py Build a pseudonymized eval set from the agent-matcher reply-queue + chat.db context.
run.py The OSS model drafts Quinn's next text per the validated methodology (json_schema strict → 0% malformed, canon few-shot → on-voice, classify-move-first → matcher-level discipline).
score.py Malformed %, on-voice %, move-agreement vs the matcher.

GPU lifecycle

File Purpose
gpu.py Thin config over @cocotte/infra-tools — the GPU + mesh lifecycle (region-fallback provisioning, external no-secret reaper, container serving, mesh_join via @quinn/net-tools) moved to the shared package; this file supplies only prospector's specifics (AEON 27B userdata, classifier serve cmd, volume). up / serve-classifier / reap / install-reaper / down / status.
train_lora.py, format_lora.py LoRA the classifier: gated labels → chat-format SFT → trl SFTTrainer + peft on a small base (Qwen2.5-7B) → adapter + eval (move + is_prospect acc).
replay.py, replay_cls.py, replay_gen.py End-to-end replay test — the whole chain (classify → draft → judge → decision) on held-out real threads, scored vs Quinn's gold. replay_cls = routing leg, replay_gen = generate+judge leg (the 27B needs ~74GB so it can't co-locate with the 7B — run one model at a time).
auto_retry_replay.sh Opt-in launchd timer that runs the e2e replay the moment nyc2 H100 capacity returns, then self-disables.

Run it

# 1. Provision the GPU (auto-tears-down at idle/cap, even if the laptop sleeps)
python3 gpu.py up && python3 gpu.py install-reaper
ssh -f -N -L 8800:localhost:8000 root@<ip>          # encrypted tunnel to vLLM
export OSS_URL=http://localhost:8800/v1/chat/completions DATA_DIR="$PWD/.data"

# 2. Label the corpus at scale
WORKERS=64 MAX_PER_HANDLE=20 python3 sweep.py        # → .data/sweep_labels.json
WORKERS=64 python3 rationalize.py sweep_labels.json  # → .data/traincot_sweep_labels.json

# 3. Or run the bake-off eval
python3 extract.py && python3 run.py && python3 score.py

# 4. Done — tear down (model weights persist on the nyc2 volume)
python3 gpu.py down

Verdict so far (see model-eval-pipeline.md / ai-system-plan.md)

The OSS generator drafts Quinn's voice well (89% on-voice, 0 location errors after iteration) — adopt it for the draft engine; Claude stays the offline judge. The classifier needs the identity gate + clean-data LoRA before it's reliable (it aligned with her real replies only ~46% on the contaminated full corpus — the not-a-prospect gate above is the fix).

First LoRA result (2026-06-30)

Classifier LoRA on the gated, prospect-first labels (5081 train / 571 eval), base Qwen2.5-7B-Instruct, 2 epochs (~14 min on one H100):

metric result
is_prospect accuracy (prospect-or-not gate) 97.0% (vs ~88% prompt-based)
move accuracy (12-class) 85.5%
valid JSON parsed 200/200

Validates the chain: identity gate → clean re-sweep → prospect-first CoT → LoRA.

End-to-end replay result (2026-07-01)

The whole chain on 84 held-out real threads (7 per move), one model at a time:

leg result
is_prospect gate (routing) 90.5% (76/84)
move (stratified) 65.5% — common draft-worthy moves near-perfect; escalate + subhour 0/7 (undertrained) drag the equal-weighted avg; ~85% on the natural distribution
generate+judge (27B, 12 draft-worthy) 12/12 drafts; judge caught 4 real violations (incl. an address-move location leak)

Proves the chain, not just the pieces. The one real gap is rare-move training data (escalate/subhour). Auto-retry (auto_retry_replay.sh) reran this in ams3 when nyc2 was capacity-starved — region didn't matter, which motivated the @cocotte/infra-tools + mesh_join extraction.