prospector/docs/features/draft-engine.md
Natalie 2df18b5358
Some checks failed
CI / verify (push) Failing after 40s
docs(prospector): retire model-boss for @prospector/ai-harness; add DRAFT mode + alignment gate
- Replace every model-boss/coordinator reference with the @prospector/ai-harness
  story (direct vLLM client + classify/draft/judge/orchestrate task registry +
  on-demand GPU lifecycle + CoT-workflow runner + cost meter) across ai-first-v4,
  draft-engine, model-eval-pipeline, and PROSPECTOR.md; GPU_INFERENCE_URL is the
  canonical inference contract. Note ai-harness is promotable to a shared @ct
  package (onlyfans carries a parallel src/engine/classifier.ts).
- Fix migration collision: ai-first-v4 actor-attribution renumbered 0007 -> 0016
  (0007_tasks.sql exists; tree at 0013).
- Add the three missing pieces from the plan: a formal DRAFT runner mode distinct
  from PAUSE + DRAFT->GO graduation (new control-modes.md); a runtime per-draft
  alignment gate (deterministic facts/policy + GPU judge; spec_conflict/
  policy_conflict holds) in draft-engine's pipeline; and the facts/mission config
  schema (src/specs/, 0014_specs.sql) in ai-system-plan §5 + draft-engine.
- Index control-modes.md and the ai-harness rename in features/README.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-30 11:06:30 -04:00

8.1 KiB
Raw Permalink Blame History

Draft engine — OSS-on-GPU + CoT workflow builder

Why

Outbound (and rich classification) must run on OSS uncensored LLMs — never hosted Claude/OpenAI, which refuse adult-services copy. The models are tuned to Quinn's voice and the inbound/outbound task. The GPU is ON-DEMAND, not a standing droplet (no always-on GPU). Lifecycle is queue-driven, not strictly per-tick: provision on demand → warm the model → drain the work; keep it warm while the queue stays deep (a big classify/draft backlog amortizes the provision cost — don't thrash provisioning) → release when the queue drains / goes idle past a short grace window. So: hold across ticks when busy, never hold when idle. Reuse LPv2's existing on-demand DO GPU work (lilith-platform.live/scripts/provision-raw-gpu-droplet.sh + its vLLM/OpenAI-compatible inference) behind the @prospector/ai-harness package — prospector's self-hosted inference layer: a direct vLLM client + a task registry for classify / draft / judge / orchestrate, the on-demand GPU lifecycle, the prompt / CoT-workflow runner, and the cost meter. ai-harness talks to the droplet's inference endpoint directly over GPU_INFERENCE_URL (the canonical contract — see step 2 below); it does not depend on the retired model-boss coordinator.

ai-harness is prospector-local today but promotable to a shared @ct package@applications/onlyfans already carries a parallel src/engine/classifier.ts of the same inference-layer shape, so the direct-vLLM client + task registry + GPU lifecycle + cost meter want to live once and be consumed by both.

Engine identifier convention

prospector_settings.draft_engine names the active engine:

  • do-gpu-<modelname>_<version_build> — a specific OSS model + build on a DO GPU droplet, reached over its inference HTTP endpoint. Example: do-gpu-mistral-nemo-12b-uncensored_2026-06-29.1. The <version_build> pins the exact tuned weights so a decision is always traceable to the model that made it.
  • template — the MVP/fallback "static" engine: copy rendered from the 🌹 pastebin note, no LLM (zero-dependency, always-safe baseline; what ships first).

The runner records the engine id on every draft/decision (audit trail) so the trial can attribute behavior to a model build, and corrections can be bucketed per build for tuning.

From pastebin → CoT workflow builder

Today (MVP) the pastebin is a flat lookup: templateKey (①…⓱) → fixed string. That's the template engine. It is being converted into a CoT (chain-of-thought) workflow builder:

  • A workflow is a versioned, ordered chain the OSS model runs to generate the reply for a situation (archetype × state × templateKey), instead of emitting canned text:
    workflow {
      id, name, version,
      appliesTo: { archetypes[], states[], templateKeys[] },
      context: [ pastebin canon snippets, Quinn voice rules, safety rules ],   // the 🌹 note becomes injected context, not the output
      steps: [ { think: "read the thread + atoms; what does he actually want?" },
               { think: "pick the move per Quinn's rules; check Gate-2 constraints" },
               { draft: "write Quinn's reply in her voice, ≤N chars, no service/price in writing" } ],
      engine: "do-gpu-<model>_<build>"
    }
    
  • The builder lets the operator/coworker author + version these workflows (and A/B them per build). Corrections (prospect_corrections) feed back as few-shot/tuning data keyed to the workflow + engine build.
  • Pastebin canon (voice, lines, rules) moves from "the output" to "context the workflow injects" — single source of truth for voice, now consumed by the model rather than sent verbatim.

Pipeline placement (unchanged contract)

The runner stays the same; only the body-generation step swaps:

inbound → classify → decideNextAction(templateKey) → Gate-2 → mode
        → [DRAFT ENGINE] render body:
              template engine   → pastebin[templateKey]                (MVP)
              do-gpu engine     → run CoT workflow(appliesTo) on the model   (target)
        → [ALIGNMENT GATE] validate body vs current facts/policy            (new)
        → dispatch to macsync (GO) · stage pending_review (DRAFT) · hold

Fail-safe is preserved: if the engine yields no usable body (model down, no matching workflow), the candidate holds — never a placeholder send. Mode selects the terminal step — auto-dispatch (GO) vs stage for operator review (DRAFT) vs hold (PAUSE); see control-modes.md.

Runtime alignment gate (per-draft, new)

Eval (model-eval-pipeline.md) and the eval-gated build flip prove a build is safe on average on a held-out set. They do not prove that this freshly generated body didn't drift — a warm, passing model can still quote a stale rate, put service-for-money in writing, propose a call on a text_only thread, or name a city not on the current trip. The alignment gate is the per-draft check that closes that gap, inserted between generate and dispatch in advanceDraft:

  • Deterministic passsrc/engine/draft-align.ts (pure + co-located test): validateDraft(body, facts) → { ok, violations }. Checks price contradicting rate, service/price-in-writing (the existing service_in_writing / pay_after posture promoted to an outbound check), channel proposals vs text_only, a location not in facts, never_offer phrases, plus the voice hard-rules already in gpu/prompts.ts.
  • GPU judge pass (additive) — judgeDraftAlignment runs ai-harness's judge task on the same droplet; null when the GPU is down (same fail-soft as draftReply), so it only ever adds holds, never blocks on model availability.
  • Outcome — a clean draft proceeds (dispatch under GO, stage pending_review under DRAFT). A violation holds with a structured HoldReason: spec_conflict (facts) or policy_conflict (policy), extending the HoldReason union from ai-first-v4.md Phase 2.2. Bounded revise: the do-gpu engine may re-draft ≤2× with the violations injected; on continued failure it downgrades to the safe template body or HOLD. A contradicting draft is never presented or sent.

Facts / mission config — the alignment oracle

The alignment gate validates against a structured facts config, and the same config is injected into the generator prompt — one source, two consumers (generator input + gate oracle). Split by rate-of-change (see ai-system-plan.md §1): facts (rate, min duration, incall/outcall + location/window, channel flags e.g. text_only, OnlyFans handle, the never_offer list) change rarely → a stable config table in a new src/specs/ module (mirroring src/markets/), migration 0014_specs.sql; mission (today's goal/slots/market) is the daily runtime dial (Mission Control, ai-system-plan.md gap #1) and can start as a settings field. The generator reads it to steer; the alignment gate reads it to verify the body didn't drift from it.

Build path

  1. MVP (done): template engine = pastebin static render; fail-safe holds.
  2. GPU client: a GpuDraftEngine that POSTs rendered CoT prompts (the whole batch) to do-gpu-<model>_<build>'s inference endpoint (env GPU_INFERENCE_URL), behind an on-demand GPU lifecycle manager: provision when work arrives, keep warm while the queue stays deep, release after a short idle grace. Selected when draft_engine matches do-gpu-*.
  3. Workflow store + builder: persist workflows (own DB, versioned) + author/edit surface (panel or config) + bind to engine builds; wire corrections as per-build tuning data.
  4. Tuning loop: export prospect_corrections per engine build → fine-tune/eval → publish next do-gpu-<model>_<build> → flip draft_engine.

Open scope (confirm before building step 3)

  • Workflow authoring: DB-backed builder UI in the panel, vs config files in-repo, vs both?
  • Granularity: one workflow per templateKey, or per (archetype × state)?
  • Does classification also move to a CoT workflow on the GPU, or stay fast-rules + atom-extraction?