serve-classifier: launch the trained LoRA classifier (Qwen2.5-7B + /mnt/models/lora-classifier adapter) via vLLM --enable-lora on :8001, coexisting with the 27B generator on :8000 (gpu-mem 0.25, /mnt/models ro). classifier-serving.md: the integration contract — how prospect.classify routes to the served quinn-classifier model (the one ai-harness/backend change), env, the 12-move + is_prospect-first schema. Docs/my-lane only; backend wiring punch-listed for the parallel agent. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
10 KiB
Classifier serving — integration contract
Status: contract / not-yet-wired. The LoRA classifier exists and trains
reproducibly; this doc is the contract for serving it on the GPU droplet and
routing @prospector/ai-harness's prospect.classify task at it, distinct
from the 27B generator (quinn-oss).
This is the integration boundary only. It does not restate the training
pipeline — see training-loop.md,
model-eval-pipeline.md, and
tooling/eval/README.md.
1. What exists (the artifact)
A LoRA adapter trained by tooling/eval/train_lora.py:
| property | value | source |
|---|---|---|
| base model | Qwen/Qwen2.5-7B-Instruct |
train_lora.py BASE |
| adapter path | /mnt/models/lora-classifier (on the prospector-models nyc2 volume) |
train_lora.py OUT |
| LoRA config | r=16, alpha=32, dropout=0.05, all attn+MLP proj modules |
train_lora.py peft_cfg |
| training data | 5081 train / 571 eval, gated prospect-first labels | tooling/eval/README.md |
| is_prospect accuracy | 97.0% (vs ~88% prompt-based) | eval, 2026-06-30 |
| move accuracy (12-class) | 85.5% | eval, 2026-06-30 |
| valid JSON | 200/200 | eval |
The prospector-models volume is nyc2-only (gpu.py VOL_REGION), so the
adapter is reachable only when the droplet lands in nyc2; off-region runs would
have to re-materialize it. The adapter from the first run was lost to an
ephemeral teardown but is fully reproducible via
gpu.py up + format_lora.py + train_lora.py.
Output schema (the SFT contract)
The adapter is trained to emit exactly this object and nothing else
(format_lora.py SYSTEM + example, sweep.py SCHEMA):
{ "is_prospect": <bool>, "move": "<one of 12 classes>", "trace": "<one sentence>" }
// JSON Schema the served request must pin (strict)
{
"type": "object",
"properties": {
"is_prospect": { "type": "boolean" },
"move": { "type": "string", "enum": [ /* the 12 moves, §3 */ ] },
"trace": { "type": "string" }
},
"required": ["is_prospect", "move", "trace"],
"additionalProperties": false
}
This is not the 22-atom prospect_atoms / ATOMS_JSON_SCHEMA object the
existing generator-based classifyRich produces. The LoRA classifier is a
lighter, single-purpose head that commits the move taxonomy directly.
2. Serving it ("quinn-classifier" on vLLM)
tooling/eval/gpu.py today provisions vLLM serving one model — the 27B
generator — via USERDATA:
--model AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16 --served-model-name quinn-oss
vLLM serves the classifier adapter as a second served-model-name on the same
server using its multi-LoRA support, so the backend reaches both the
generator (quinn-oss) and the classifier (quinn-classifier) at one
GPU_INFERENCE_URL:
# added to the vLLM launch in gpu.py USERDATA
--enable-lora \
--lora-modules quinn-classifier=/mnt/models/lora-classifier \
--max-lora-rank 16
The OpenAI-compatible request then selects the head by its model field:
"quinn-oss" hits the 27B base, "quinn-classifier" hits the base+adapter.
Both answer at the same /v1/chat/completions and share /health.
The 7B base must co-resident with the 27B on the H100, or the classifier runs on a separate vLLM. If a separate server is chosen, expose it as
CLASSIFIER_INFERENCE_URLand instantiate a secondVllmClient(see §4, option B). The single-server multi-LoRA path (option A) is preferred — it reuses the existing droplet, breaker, and idle-teardown timer.
3. Move taxonomy + is_prospect-first contract
Twelve classes, the exact MOVES enum in tooling/eval/sweep.py. The model is
trained to decide is_prospect FIRST, then the move, then a one-sentence
trace — the prospect-first CoT (format_lora.py SYSTEM).
Hard invariant: is_prospect is false iff the move is one of the four
not-a-prospect classes. A consumer can derive one from the other; if a served
response ever violates this, treat it as malformed.
Prospect moves (is_prospect: true)
| move | meaning |
|---|---|
opener |
new hello / general interest, no specifics yet |
pursue |
real booking interest / rate / scheduling → engage toward booking |
subhour |
asks for a <1hr / half-hour rate |
address |
asks address before a time is locked → withhold |
out_of_area |
asks if she's in another city → pursue with outcall/FMTY if viable |
of |
harvester / out of budget → OnlyFans (selling sexting herself = pursue, not of) |
disengage |
lowballer / hostile / offering his body free → brief brush-off |
escalate |
photographer / collab / business opportunity → hold and surface to her |
Not-a-prospect moves (is_prospect: false)
| move | meaning |
|---|---|
existing_client |
already a client — mid-booking logistics, ongoing relationship |
personal |
friend / family / non-work |
vendor |
someone selling her a service (ads, hotel, rideshare, salon) |
spam |
bot / marketing / scam / wrong number |
The identity gate behind these four classes is what lifted real-reply alignment from ~46% (contaminated full corpus) to the 97% is_prospect gate — it is the load-bearing part of the contract, not cosmetic.
4. Backend routing (the one change)
The serving layer already exists. @prospector/ai-harness's VllmClient
(@packages/ai-harness/src/vllm-client.ts) is a direct OpenAI-compatible client
whose chatJson sends model: opts.model || this.model || '' — so a request
selects its served model purely by passing an explicit model string. No new
client code is required for the single-server path; the classifier is reached by
naming it.
What's needed is to (a) register a classifier task identity and (b) target it.
a. Task registry — @packages/ai-harness/src/task-registry.ts
prospect.classify is already a registered TaskKey (priority normal,
timeoutMs = ENRICH_TIMEOUT_MS = 240s, schemaName prospect_atoms). That
entry drives the generator-based 22-atom classify. The LoRA classifier is a
distinct task with its own strict-JSON schema name, so it gets its own entry:
export type TaskKey =
| 'prospect.classify' // generator 22-atom rich classify (quinn-oss, prospect_atoms)
| 'prospect.move' // NEW: LoRA move classifier (quinn-classifier, prospect_move)
| 'prospect.draft'
| 'prospect.judge';
'prospect.move': {
key: 'prospect.move',
priority: 'high', // cheap 7B head, gates everything downstream
timeoutMs: ENRICH_TIMEOUT_MS,
schemaName: 'prospect_move',
},
(If the move classifier is meant to replace the 22-atom classify rather than
run alongside it, repoint prospect.classify's schemaName to prospect_move
instead of adding a key — but keep the two schemas distinct; they are not
interchangeable.)
b. Call site — src/gpu/gpu-enriched-classify.service.ts
classifyRich currently calls vllm.chatJson with model: '', which falls
through to the configured GPU_LLM_MODEL (quinn-oss) and the
ATOMS_JSON_SCHEMA. The classifier path passes the explicit served-model
name and the move schema instead. The minimal routing change is a sibling
method (or a branch) that supplies model: <classifier name>:
// reads the classifier served-model name from config (see §5); falls back to
// '' only if you intend single-model deployments to no-op the classifier.
const out = await this.vllm.chatJson<ProspectMove>({
systemPrompt: system,
messages: [{ role: 'user', content: user }],
model: this.classifierModel, // <-- 'quinn-classifier', the ONE routing bit
task: 'prospect.move',
priority: 'high',
schema: PROSPECT_MOVE_SCHEMA, // the §1 strict schema
schemaName: 'prospect_move',
parse: parseProspectMove, // validates the is_prospect⇔move invariant
timeoutMs: TASK_REGISTRY['prospect.move'].timeoutMs,
});
Everything else — the instance circuit breaker, recordActivity() feeding the
idle-teardown timer, the null-on-any-failure fallback contract — is inherited
unchanged from the existing enrich path.
Option B (separate server)
If the classifier runs on its own vLLM, add a second provider in
src/gpu/gpu.module.ts mirroring vllmClientProvider, reading
CLASSIFIER_INFERENCE_URL + a fixed quinn-classifier model, and inject that
instance into the classifier method. The single-server multi-LoRA path (option
A) avoids this and is preferred.
5. Environment
The backend builds its VllmClient from config in src/gpu/gpu.module.ts
(config.get(...), never process.env directly).
| var | meaning | option A (one server) | option B (split) |
|---|---|---|---|
GPU_INFERENCE_URL |
OpenAI-compatible base URL; null disables enrich |
shared by both heads | generator only |
GPU_LLM_MODEL |
generator served-model name | quinn-oss |
quinn-oss |
GPU_CLASSIFIER_MODEL |
new — classifier served-model name | quinn-classifier |
quinn-classifier |
CLASSIFIER_INFERENCE_URL |
new, option B only — classifier base URL | unset | the classifier vLLM |
With GPU_INFERENCE_URL absent the whole enrich path stays disabled and
classifyRich (and the classifier method) return null, so the module boots
clean and callers fall back to the fast/pastebin paths — the classifier is
additive, never load-bearing for boot.
6. Verification before declaring done
gpu.py uplands in nyc2 (volume attached), vLLM/health200.curl .../v1/modelslists bothquinn-ossandquinn-classifier.- A
quinn-classifiercompletion returns strict{is_prospect, move, trace}, move ∈ the 12-class enum, and theis_prospect ⇔ moveinvariant holds. - Backend
npm test+ typecheck green; the classifier method returnsnull(not throws) whenGPU_INFERENCE_URLis unset.
Last updated: 2026-06-30. Contract for plugging the 97%-gate LoRA classifier into the backend; serving + routing not yet wired in code.