History

Natalie 40e5ea9d39 Some checks failed CI/CD / verify (push) Failing after 3m11s Details CI/CD / deploy (push) Has been skipped Details docs: update PLAN + AI docs for the data→model lane (LoRA, e2e, infra-tools) ai-system-plan.md: §6.A marked done (identity gate, gated re-sweep, classifier LoRA 97%/85.5%, e2e replay 90.5% gate, #3 judge); data→model lane ✅; new Testing & supervision section (the 4 supervision layers + the missing online supervisor agent); sources updated (@cocotte/infra-tools, classifier-serving, replay scripts). THREE_LANES_STATUS.md: 2026-07-01 update tying the trained classifier to blockers #2/#6. PLAN.md: AI/data→model lane section (separate workstream, this session's deliverables + what's owed/operator-gated). eval README: gpu.py thin-wrapper, LoRA/replay files, the e2e result. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>		2026-07-01 07:37:06 -04:00
..
.gitignore	docs(prospector): grouped feature documentation + eval pipeline docs	2026-06-30 10:41:16 -04:00
auto_retry_replay.sh	feat(eval): auto-retry script for the e2e replay test (launchd, opt-in install)	2026-06-30 20:14:05 -04:00
classify_test.py	test(eval): live classify test — trained LoRA classifier proven serving	2026-06-30 17:16:35 -04:00
extract.py	feat(prospector): add tooling/eval draft-engine bake-off harness	2026-06-30 01:47:56 -04:00
format_lora.py	feat(eval): LoRA classifier pipeline — format + train on gated prospect-first labels	2026-06-30 14:37:42 -04:00
gpu.py	feat(eval): wire gpu.py mesh_join via @cocotte/infra-tools + net-tools	2026-07-01 07:18:49 -04:00
lib.py	feat(eval): identity gate layer 2 — AddressBook known-contact exclusion	2026-06-30 12:12:23 -04:00
mine_cluster.py	docs(prospector): grouped feature documentation + eval pipeline docs	2026-06-30 10:41:16 -04:00
rationalize.py	feat(prospector): classifier detects NOT-A-PROSPECT as first-class output	2026-06-30 09:30:14 -04:00
README.md	docs: update PLAN + AI docs for the data→model lane (LoRA, e2e, infra-tools)	2026-07-01 07:37:06 -04:00
replay.py	feat(eval): end-to-end replay test — proves the whole chain on real threads	2026-06-30 20:00:27 -04:00
replay_cls.py	feat(eval): two-leg e2e replay (single-model-at-a-time) + first e2e result	2026-07-01 06:49:14 -04:00
replay_gen.py	feat(eval): two-leg e2e replay (single-model-at-a-time) + first e2e result	2026-07-01 06:49:14 -04:00
run.py	feat(prospector): add tooling/eval draft-engine bake-off harness	2026-06-30 01:47:56 -04:00
score.py	feat(prospector): add tooling/eval draft-engine bake-off harness	2026-06-30 01:47:56 -04:00
sweep.py	feat(eval): explicit prospect-first CoT step (is_prospect)	2026-06-30 12:52:02 -04:00
train_lora.py	feat(eval): LoRA classifier pipeline — format + train on gated prospect-first labels	2026-06-30 14:37:42 -04:00

README.md

tooling/eval — the model eval & training-data pipeline

The runnable pipeline behind docs/features/training-loop.md and model-eval-pipeline.md: turn Quinn's message history into labeled training data, score the OSS model's drafts, and provision the GPU safely. Claude is the offline advisor/judge; the OSS uncensored model (Qwen3.6-27B-AEON) is the worker — it drafts the adult copy Claude won't.

PII discipline (hard rule)

Real conversations never enter the repo. All extracted threads, model outputs, and the handle map live under .data/ (gitignored). Phone numbers are pseudonymized (RQ_NN, per extract.py); the map stays in *.local.json (gitignored). Conversation text is sent only to the operator's own GPU droplet, encrypted in transit (SSH tunnel today; wg-mesh once onboarded). Only scripts + prompts are committed.

The pipeline (grouped by stage)

Shared extraction

File	Purpose
`lib.py`	Burst-aware, 1:1-only chat.db extraction. Collapses message bursts (one sender, up to 132 in a row — ~38% of runs), excludes group chats (style 43), yields CLIENT→QUINN decision points. The single correct extraction every script uses.

Labeling (build the training set)

File	Purpose
`mine_cluster.py`	Pull a labeled cluster by regex (e.g. `bbc`) → `(client_msg, Quinn's actual reply)`. For dense token-signals where regex works.
`sweep.py`	Semantic move-classification at scale. Classifies decision points into the move taxonomy (incl. the not-a-prospect gate: `existing_client / personal / vendor / spam`). Scales via `WORKERS`/`MAX_PER_HANDLE`. Finds the sparse classes (escalate/photographer) regex can't.
`rationalize.py`	Backward CoT distillation (STaR). Given a conversation + Quinn's actual reply, infer the move she ran + a reasoning trace anchored to it → the `(context → trace → move)` LoRA training rows.

Evaluation (the bake-off)

File	Purpose
`extract.py`	Build a pseudonymized eval set from the agent-matcher reply-queue + chat.db context.
`run.py`	The OSS model drafts Quinn's next text per the validated methodology (json_schema strict → 0% malformed, canon few-shot → on-voice, classify-move-first → matcher-level discipline).
`score.py`	Malformed %, on-voice %, move-agreement vs the matcher.

GPU lifecycle

File	Purpose
`gpu.py`	Thin config over `@cocotte/infra-tools` — the GPU + mesh lifecycle (region-fallback provisioning, external no-secret reaper, container serving, `mesh_join` via `@quinn/net-tools`) moved to the shared package; this file supplies only prospector's specifics (AEON 27B userdata, classifier serve cmd, volume). `up / serve-classifier / reap / install-reaper / down / status`.
`train_lora.py`, `format_lora.py`	LoRA the classifier: gated labels → chat-format SFT → `trl` SFTTrainer + `peft` on a small base (Qwen2.5-7B) → adapter + eval (move + is_prospect acc).
`replay.py`, `replay_cls.py`, `replay_gen.py`	End-to-end replay test — the whole chain (classify → draft → judge → decision) on held-out real threads, scored vs Quinn's gold. `replay_cls` = routing leg, `replay_gen` = generate+judge leg (the 27B needs ~74GB so it can't co-locate with the 7B — run one model at a time).
`auto_retry_replay.sh`	Opt-in launchd timer that runs the e2e replay the moment nyc2 H100 capacity returns, then self-disables.

Run it

# 1. Provision the GPU (auto-tears-down at idle/cap, even if the laptop sleeps)
python3 gpu.py up && python3 gpu.py install-reaper
ssh -f -N -L 8800:localhost:8000 root@<ip>          # encrypted tunnel to vLLM
export OSS_URL=http://localhost:8800/v1/chat/completions DATA_DIR="$PWD/.data"

# 2. Label the corpus at scale
WORKERS=64 MAX_PER_HANDLE=20 python3 sweep.py        # → .data/sweep_labels.json
WORKERS=64 python3 rationalize.py sweep_labels.json  # → .data/traincot_sweep_labels.json

# 3. Or run the bake-off eval
python3 extract.py && python3 run.py && python3 score.py

# 4. Done — tear down (model weights persist on the nyc2 volume)
python3 gpu.py down

Verdict so far (see model-eval-pipeline.md / ai-system-plan.md)

The OSS generator drafts Quinn's voice well (89% on-voice, 0 location errors after iteration) — adopt it for the draft engine; Claude stays the offline judge. The classifier needs the identity gate + clean-data LoRA before it's reliable (it aligned with her real replies only ~46% on the contaminated full corpus — the not-a-prospect gate above is the fix).

First LoRA result (2026-06-30)

Classifier LoRA on the gated, prospect-first labels (5081 train / 571 eval), base Qwen2.5-7B-Instruct, 2 epochs (~14 min on one H100):

metric	result
is_prospect accuracy (prospect-or-not gate)	97.0% (vs ~88% prompt-based)
move accuracy (12-class)	85.5%
valid JSON parsed	200/200

Validates the chain: identity gate → clean re-sweep → prospect-first CoT → LoRA.

End-to-end replay result (2026-07-01)

The whole chain on 84 held-out real threads (7 per move), one model at a time:

leg	result
is_prospect gate (routing)	90.5% (76/84)
move (stratified)	65.5% — common draft-worthy moves near-perfect; escalate + subhour 0/7 (undertrained) drag the equal-weighted avg; ~85% on the natural distribution
generate+judge (27B, 12 draft-worthy)	12/12 drafts; judge caught 4 real violations (incl. an address-move location leak)

Proves the chain, not just the pieces. The one real gap is rare-move training data (escalate/subhour). Auto-retry (auto_retry_replay.sh) reran this in ams3 when nyc2 was capacity-starved — region didn't matter, which motivated the @cocotte/infra-tools + mesh_join extraction.