Rename the shared inference lib to the @cocotte scope + drop private so it publishes to Verdaccio; prospector now depends on it as a real registry package (resolves on a clean standalone deploy, e.g. ct.prod). engine stays a relative ../engine import (it was never an aliased package — the @prospector/engine ref was only a docstring). No tsc-alias needed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> |
||
|---|---|---|
| .. | ||
| src | ||
| package.json | ||
| README.md | ||
| tsconfig.json | ||
@prospector/ai-harness — self-hosted inference layer
Prospector's inference layer: a direct vLLM client + a typed task registry
- the Quinn-voice prompt builders. It talks to the on-demand DO GPU droplet's
vLLM (OpenAI-compatible
/v1/chat/completions) directly overGPU_INFERENCE_URL— there is no model-boss coordinator (retired). Inference is self-hosted only — never managed/Claude in the runtime loop.
Framework-agnostic: no NestJS, no process.env, no I/O beyond fetch. The backend
constructs VllmClient from ConfigService and wires it as a Nest provider
(src/gpu/gpu.module.ts); the GPU droplet lifecycle (provision / teardown /
idle-shutdown) stays in the backend's src/gpu module.
Files (one concern each — SRP)
| File | Concern |
|---|---|
vllm-client.ts |
VllmClient.chatJson (strict-JSON completion) + health, instance circuit breaker. Co-located test. |
task-registry.ts |
TASK_REGISTRY — typed prospect.classify / prospect.draft / prospect.judge (placeholder for the alignment gate) + per-task priority / timeout / schema name. |
prompts.ts |
Pure prompt builders: buildClassifyPrompt (22-atom), buildDraftPrompt (Quinn voice verbatim) + DRAFT_SCHEMA / parseDraft. |
types.ts |
ChatJsonOpts, ChatPriority, InferenceHealth. |
index.ts |
Public API manifest. |
Contract
VllmClient is fail-soft by construction: with url absent isEnabled() is
false and callers skip to their fast/pastebin fallbacks; any transport/parse error
throws VllmError, which the enrich path catches and turns into null (the
classify/draft contract is unchanged from the model-boss era). An instance circuit
breaker trips after 3 consecutive connection failures and fails fast for 60s.
import { VllmClient, TASK_REGISTRY, buildDraftPrompt, DRAFT_SCHEMA, parseDraft } from '@prospector/ai-harness';
const vllm = new VllmClient({ url: process.env.GPU_INFERENCE_URL ?? null, model: process.env.GPU_LLM_MODEL ?? null });
const t = TASK_REGISTRY['prospect.draft'];
const { system, user } = buildDraftPrompt(handle, archetype, inbound);
const out = await vllm.chatJson({
systemPrompt: system,
messages: [{ role: 'user', content: user }],
model: '',
task: t.key,
priority: t.priority,
schema: DRAFT_SCHEMA,
schemaName: t.schemaName,
parse: parseDraft,
timeoutMs: t.timeoutMs,
});
Build
npm run build --workspace @prospector/ai-harness emits dist/ + .d.ts (the
backend resolves the package from there). npm run typecheck / npm test for the
co-located checks.
Config (read at the backend wiring layer, not here)
GPU_INFERENCE_URL (the vLLM base URL; replaces the retired MODEL_BOSS_URL),
GPU_LLM_MODEL (the served model name fallback).
Promotable
Today prospector-local; promotable to a shared @ct package —
@applications/onlyfans carries a parallel inference-layer of the same shape, so
the direct-vLLM client + task registry + prompts want to live once and be consumed
by both. Keep it framework-agnostic to make that move cheap.