prospector/docs/features/classifier-serving.md
Natalie 2e035355f5
Some checks failed
CI/CD / verify (push) Failing after 1m58s
CI/CD / deploy (push) Has been skipped
feat(eval): gpu.py serve-classifier + classifier-serving integration contract
serve-classifier: launch the trained LoRA classifier (Qwen2.5-7B +
/mnt/models/lora-classifier adapter) via vLLM --enable-lora on :8001,
coexisting with the 27B generator on :8000 (gpu-mem 0.25, /mnt/models ro).
classifier-serving.md: the integration contract — how prospect.classify routes
to the served quinn-classifier model (the one ai-harness/backend change), env,
the 12-move + is_prospect-first schema. Docs/my-lane only; backend wiring
punch-listed for the parallel agent.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 17:12:22 -04:00

10 KiB

Classifier serving — integration contract

Status: contract / not-yet-wired. The LoRA classifier exists and trains reproducibly; this doc is the contract for serving it on the GPU droplet and routing @prospector/ai-harness's prospect.classify task at it, distinct from the 27B generator (quinn-oss).

This is the integration boundary only. It does not restate the training pipeline — see training-loop.md, model-eval-pipeline.md, and tooling/eval/README.md.


1. What exists (the artifact)

A LoRA adapter trained by tooling/eval/train_lora.py:

property value source
base model Qwen/Qwen2.5-7B-Instruct train_lora.py BASE
adapter path /mnt/models/lora-classifier (on the prospector-models nyc2 volume) train_lora.py OUT
LoRA config r=16, alpha=32, dropout=0.05, all attn+MLP proj modules train_lora.py peft_cfg
training data 5081 train / 571 eval, gated prospect-first labels tooling/eval/README.md
is_prospect accuracy 97.0% (vs ~88% prompt-based) eval, 2026-06-30
move accuracy (12-class) 85.5% eval, 2026-06-30
valid JSON 200/200 eval

The prospector-models volume is nyc2-only (gpu.py VOL_REGION), so the adapter is reachable only when the droplet lands in nyc2; off-region runs would have to re-materialize it. The adapter from the first run was lost to an ephemeral teardown but is fully reproducible via gpu.py up + format_lora.py + train_lora.py.

Output schema (the SFT contract)

The adapter is trained to emit exactly this object and nothing else (format_lora.py SYSTEM + example, sweep.py SCHEMA):

{ "is_prospect": <bool>, "move": "<one of 12 classes>", "trace": "<one sentence>" }
// JSON Schema the served request must pin (strict)
{
  "type": "object",
  "properties": {
    "is_prospect": { "type": "boolean" },
    "move":        { "type": "string", "enum": [ /* the 12 moves, §3 */ ] },
    "trace":       { "type": "string" }
  },
  "required": ["is_prospect", "move", "trace"],
  "additionalProperties": false
}

This is not the 22-atom prospect_atoms / ATOMS_JSON_SCHEMA object the existing generator-based classifyRich produces. The LoRA classifier is a lighter, single-purpose head that commits the move taxonomy directly.


2. Serving it ("quinn-classifier" on vLLM)

tooling/eval/gpu.py today provisions vLLM serving one model — the 27B generator — via USERDATA:

--model AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16 --served-model-name quinn-oss

vLLM serves the classifier adapter as a second served-model-name on the same server using its multi-LoRA support, so the backend reaches both the generator (quinn-oss) and the classifier (quinn-classifier) at one GPU_INFERENCE_URL:

# added to the vLLM launch in gpu.py USERDATA
--enable-lora \
--lora-modules quinn-classifier=/mnt/models/lora-classifier \
--max-lora-rank 16

The OpenAI-compatible request then selects the head by its model field: "quinn-oss" hits the 27B base, "quinn-classifier" hits the base+adapter. Both answer at the same /v1/chat/completions and share /health.

The 7B base must co-resident with the 27B on the H100, or the classifier runs on a separate vLLM. If a separate server is chosen, expose it as CLASSIFIER_INFERENCE_URL and instantiate a second VllmClient (see §4, option B). The single-server multi-LoRA path (option A) is preferred — it reuses the existing droplet, breaker, and idle-teardown timer.


3. Move taxonomy + is_prospect-first contract

Twelve classes, the exact MOVES enum in tooling/eval/sweep.py. The model is trained to decide is_prospect FIRST, then the move, then a one-sentence trace — the prospect-first CoT (format_lora.py SYSTEM).

Hard invariant: is_prospect is false iff the move is one of the four not-a-prospect classes. A consumer can derive one from the other; if a served response ever violates this, treat it as malformed.

Prospect moves (is_prospect: true)

move meaning
opener new hello / general interest, no specifics yet
pursue real booking interest / rate / scheduling → engage toward booking
subhour asks for a <1hr / half-hour rate
address asks address before a time is locked → withhold
out_of_area asks if she's in another city → pursue with outcall/FMTY if viable
of harvester / out of budget → OnlyFans (selling sexting herself = pursue, not of)
disengage lowballer / hostile / offering his body free → brief brush-off
escalate photographer / collab / business opportunity → hold and surface to her

Not-a-prospect moves (is_prospect: false)

move meaning
existing_client already a client — mid-booking logistics, ongoing relationship
personal friend / family / non-work
vendor someone selling her a service (ads, hotel, rideshare, salon)
spam bot / marketing / scam / wrong number

The identity gate behind these four classes is what lifted real-reply alignment from ~46% (contaminated full corpus) to the 97% is_prospect gate — it is the load-bearing part of the contract, not cosmetic.


4. Backend routing (the one change)

The serving layer already exists. @prospector/ai-harness's VllmClient (@packages/ai-harness/src/vllm-client.ts) is a direct OpenAI-compatible client whose chatJson sends model: opts.model || this.model || '' — so a request selects its served model purely by passing an explicit model string. No new client code is required for the single-server path; the classifier is reached by naming it.

What's needed is to (a) register a classifier task identity and (b) target it.

a. Task registry — @packages/ai-harness/src/task-registry.ts

prospect.classify is already a registered TaskKey (priority normal, timeoutMs = ENRICH_TIMEOUT_MS = 240s, schemaName prospect_atoms). That entry drives the generator-based 22-atom classify. The LoRA classifier is a distinct task with its own strict-JSON schema name, so it gets its own entry:

export type TaskKey =
  | 'prospect.classify'   // generator 22-atom rich classify (quinn-oss, prospect_atoms)
  | 'prospect.move'       // NEW: LoRA move classifier (quinn-classifier, prospect_move)
  | 'prospect.draft'
  | 'prospect.judge';

'prospect.move': {
  key: 'prospect.move',
  priority: 'high',         // cheap 7B head, gates everything downstream
  timeoutMs: ENRICH_TIMEOUT_MS,
  schemaName: 'prospect_move',
},

(If the move classifier is meant to replace the 22-atom classify rather than run alongside it, repoint prospect.classify's schemaName to prospect_move instead of adding a key — but keep the two schemas distinct; they are not interchangeable.)

b. Call site — src/gpu/gpu-enriched-classify.service.ts

classifyRich currently calls vllm.chatJson with model: '', which falls through to the configured GPU_LLM_MODEL (quinn-oss) and the ATOMS_JSON_SCHEMA. The classifier path passes the explicit served-model name and the move schema instead. The minimal routing change is a sibling method (or a branch) that supplies model: <classifier name>:

// reads the classifier served-model name from config (see §5); falls back to
// '' only if you intend single-model deployments to no-op the classifier.
const out = await this.vllm.chatJson<ProspectMove>({
  systemPrompt: system,
  messages: [{ role: 'user', content: user }],
  model: this.classifierModel,        // <-- 'quinn-classifier', the ONE routing bit
  task: 'prospect.move',
  priority: 'high',
  schema: PROSPECT_MOVE_SCHEMA,        // the §1 strict schema
  schemaName: 'prospect_move',
  parse: parseProspectMove,            // validates the is_prospect⇔move invariant
  timeoutMs: TASK_REGISTRY['prospect.move'].timeoutMs,
});

Everything else — the instance circuit breaker, recordActivity() feeding the idle-teardown timer, the null-on-any-failure fallback contract — is inherited unchanged from the existing enrich path.

Option B (separate server)

If the classifier runs on its own vLLM, add a second provider in src/gpu/gpu.module.ts mirroring vllmClientProvider, reading CLASSIFIER_INFERENCE_URL + a fixed quinn-classifier model, and inject that instance into the classifier method. The single-server multi-LoRA path (option A) avoids this and is preferred.


5. Environment

The backend builds its VllmClient from config in src/gpu/gpu.module.ts (config.get(...), never process.env directly).

var meaning option A (one server) option B (split)
GPU_INFERENCE_URL OpenAI-compatible base URL; null disables enrich shared by both heads generator only
GPU_LLM_MODEL generator served-model name quinn-oss quinn-oss
GPU_CLASSIFIER_MODEL new — classifier served-model name quinn-classifier quinn-classifier
CLASSIFIER_INFERENCE_URL new, option B only — classifier base URL unset the classifier vLLM

With GPU_INFERENCE_URL absent the whole enrich path stays disabled and classifyRich (and the classifier method) return null, so the module boots clean and callers fall back to the fast/pastebin paths — the classifier is additive, never load-bearing for boot.


6. Verification before declaring done

  1. gpu.py up lands in nyc2 (volume attached), vLLM /health 200.
  2. curl .../v1/models lists both quinn-oss and quinn-classifier.
  3. A quinn-classifier completion returns strict {is_prospect, move, trace}, move ∈ the 12-class enum, and the is_prospect ⇔ move invariant holds.
  4. Backend npm test + typecheck green; the classifier method returns null (not throws) when GPU_INFERENCE_URL is unset.

Last updated: 2026-06-30. Contract for plugging the 97%-gate LoRA classifier into the backend; serving + routing not yet wired in code.