Natalie 20d52431bc

CI/CD / verify (push) Failing after 2s

Details

CI/CD / deploy (push) Has been skipped

Details

docs(deploy): edge basic_auth + token injection resolved; open = https registry, ssh, wg1

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-07-01 07:51:32 -04:00

15 KiB

Raw Permalink Blame History

Deploy — Prospector prod on `ct.prod` (the hardened public DMZ host)

Topology (authoritative, 2026-06-30): ct.prod is the public prod host

The public sales edge does NOT live on lime. lime is the internal store/backend box and keeps zero public app ports. Prospector's prod target is ct.prod (com.uvlava.ct.prod) — a new, dedicated, hardened DO droplet (nyc3, store VPC, joins wg1) whose only job is to face the internet:

internet --(80/443)--> Caddy on ct.prod --(127.0.0.1:3210)--> NestJS app
ct.prod  --(store VPC 10.20.0.0/24)-----> DO Managed PG (lilith-store-pg, private)
ct.prod  --(wg1 mesh 10.9.0.0/24)-------> people / mac-sync / mr-number

Public name: apps.ftw.pw (Caddy + Let's Encrypt). ftw.pw is a SEPARATE zone, not DO-managed — see the DNS step below.
The app binds 127.0.0.1:3210 only. Caddy is the sole public listener and 403s /internal/* (the mac-sync inbound webhook + peers); macsync hits /internal/inbound over the mesh (http://10.9.0.10:3210/internal/inbound), never the public leg.
DB + mesh deps over private paths only. DO Managed PG over the VPC; people/mac-sync/mr-number over wg1. mac-sync runs on the operator's Mac (not lime, not ct.prod) — MACSYNC_BASE_URL/MACSYNC_DEVICE_ID are operator-set.
lime stays internal (mesh-only; no app/edge ports).
IaC: uvlava/terraform/do/ct_prod.tf (count-gated ct_prod_enabled; droplet + reserved IP + cloud firewall 80/443 public, 22+wg mesh-only). Hardened cloud-init cloud-init/ct-prod.yaml: ufw, fail2ban, unattended-upgrades, non-root deploy user, node20. Mesh entry: mesh-hosts.json host ct.prod, wg 10.9.0.10.

As-built (2026-07-01) — first live bring-up + gotchas

The first real deploy landed. What actually happened, and the traps to avoid:

Live host: droplet 581442557 (com.uvlava.ct.prod, 2 GB — the terraform s-1vcpu-2gb), reserved IP 144.126.248.192, default IP 159.203.90.3. App up (systemd prospector), DB up, migrations applied, Caddy serving.
DNS is a CNAME into the DO zone (not a raw A at the registrar): apps.ftw.pw → CNAME → apps.ct.uvlava.com (A, digitalocean_record.ct_apps in dns.tf) → the ct.prod reserved IP. Set the joker.com CNAME once; the IP lives in IaC. Use CNAME, never url-forwarding (browser must stay on apps.ftw.pw so Caddy issues its LE cert).
The backend depends on @cocotte/ai-harness — a workspace package published to the ct-forge verdaccio (http://134.199.243.61:4873/). npm ci on ct.prod can't resolve a local workspace link, so deploy-server.sh ships an .npmrc (scope routing + read token) and explicitly installs the published tarball after npm ci. (Publishing it is CI/CD's job on main.)
DB: the prospector DB + role already exist on lilith-store-pg; the deploy only fills PROSPECTOR_DB_* in /opt/prospector/.env (private host private-lilith-store-pg-…ondigitalocean.com:25060, sslmode=require, DO CA cert). ct.prod must be a DB trusted source — doctl databases firewalls append <cluster> --rule droplet:<ct.prod-id> — or migrations/connect time out.

✅ Resolved (2026-07-01)

Duplicate droplet (forge-duplication landmine) — a second hand-created com.uvlava.ct.prod (581541024, 4 GB, reserved 134.199.244.34) was billing in parallel and apps.ct wrongly pointed at its IP (causing /prospector/* 404s + LE failures on the real host). Destroyed the droplet + released its reserved IP; only the terraform-tracked 581442557 remains.
Terraform drift on apps.ct — state tracked a stale record id; dropped it and terraform imported the live record (1824103028 → 144.126.248.192). plan now reports No changes.
LE cert — real Let's Encrypt cert (CN=YE1, non-staging) issued for apps.ftw.pw once DNS propagated to 144.126.248.192. Verified https://apps.ftw.pw/health → 200.
Edge auth + token injection (RESOLVED 2026-07-01) — the console PWA carries no bearer token, so guarded /prospector/* calls 401'd through the edge. The Caddyfile (deploy/edge/apps.ftw.pw.Caddyfile) now gates the whole site with basic_auth (operator login) and injects Authorization: Bearer {$PROSPECTOR_SERVICE_TOKEN} to the loopback app — so an authenticated operator's browser is authorized, but the token is never handed to anonymous visitors. Creds + token live in Caddy's systemd env /etc/caddy/caddy.env on ct.prod (OPERATOR_USER, OPERATOR_BCRYPT, PROSPECTOR_SERVICE_TOKEN), wired via a caddy.service.d/env.conf drop-in — not committed. Rotate the operator password with caddy hash-password --plaintext '<new>' → update OPERATOR_BCRYPT in caddy.env → systemctl restart caddy. Verified: anon→401, operator→console + /prospector/* 200, /internal→403.

⚠ Still open

Private registry over HTTPS — the deploy pulls @cocotte/ai-harness from the ct-forge verdaccio at plaintext http://134.199.243.61:4873. Once npm.ct.uvlava.com is routed to Verdaccio (TLS via the artifacts-host Caddy), point .npmrc / deploy-server.sh / the app .npmrc at https://npm.ct.uvlava.com/ — no bearer token over HTTP.
Public SSH exposure — the DO firewall's admin_ips isn't mesh-only as designed: :22 answers on the public IP. Tighten var.admin_ips to the mesh (wg-only SSH).
ct.prod not on wg1 — phase-b-mesh-join.sh wasn't run, so 10.9.0.10 is unreachable and people/mac-sync/mr-number (mesh deps) aren't reachable yet. Deploy currently runs over the public IP (SERVER_HOST=<reserved IP>); join wg1 for the mesh deps + to move SSH off the public leg.

⚠️ ct.prod must be added as a TRUSTED SOURCE on the lilith-store-pg managed cluster (DO console → Databases → firewall) or migrations + the app's DB connect will time out.

Operator runbook — bring ct.prod live (in order)

All terraform here is plan/apply with -target so the rest of the shared store tier is never dragged in. ct_prod_enabled defaults false; the -var flips it on for this targeted apply only.

cd ~/Code/@ct/infra/uvlava/terraform/do
export TF_VAR_do_token="$(cat ~/.vault/do-pat-ct.token)"

# 1. Stand up ct.prod (droplet + reserved IP + cloud firewall) — ONLY these.
terraform plan  -var=ct_prod_enabled=true \
  -target=digitalocean_droplet.ct_prod \
  -target=digitalocean_reserved_ip.ct_prod \
  -target=digitalocean_firewall.ct_prod        # review: 3 to add, 0 change, 0 destroy
terraform apply -var=ct_prod_enabled=true \
  -target=digitalocean_droplet.ct_prod \
  -target=digitalocean_reserved_ip.ct_prod \
  -target=digitalocean_firewall.ct_prod
terraform output -raw ct_prod_public_ip        # = the reserved IP (only exists now)

# 2. Join ct.prod to wg1: copy /root/wg1.pub off the box, add it as a [Peer] on
#    the nyc3 hub (citron); append the citron [Peer] block to ct.prod's
#    /etc/wireguard/wg1.conf, then `systemctl start wg-quick@wg1`
#    (phase-b-mesh-join.sh automates this). Then set mesh-hosts.json ct.prod
#    wg_pubkey + public (= the reserved IP) and re-render (net sync).

# 3. Make ct.prod a trusted source on the managed PG cluster (DO console), then
#    create the prospector DB + role ONCE (secret-bearing; not in terraform):
doctl databases db   create lilith-store-pg prospector
doctl databases user create lilith-store-pg prospector     # prints the password
#    as doadmin on the prospector DB:
#      ALTER DATABASE prospector OWNER TO prospector;
#      GRANT ALL ON SCHEMA public TO prospector; ALTER SCHEMA public OWNER TO prospector;

# 4. DNS for apps.ftw.pw — CNAME into the DO-managed uvlava zone (NOT a manual A,
#    NOT url-forwarding). The IP is IaC-owned so it is never hand-copied:
#      apps.ct.uvlava.com  A  <ct.prod reserved IP>   <- already in dns.tf
#                                                         (digitalocean_record.ct_apps,
#                                                          gated by ct_prod_enabled)
#      apps.ftw.pw         CNAME  apps.ct.uvlava.com   <- add ONCE at joker.com
#    CNAME (not url-forward) keeps the browser on apps.ftw.pw and lets Caddy issue
#    the LE cert for apps.ftw.pw. Leave the ftw.pw apex (-> vps-0 short-links) alone.
#    If ct.prod's IP ever moves, only terraform changes; the joker CNAME stays put.

# 5. Ship the app (over the mesh; fills /opt/prospector/.env, runs migrations).
cd ~/Code/@ct/@applications/prospector
./deploy/deploy-server.sh                       # SERVER_HOST defaults to 10.9.0.10 (mesh)
#    First run halts at the DB __SET_ME__ guard: fill PROSPECTOR_DB_* in
#    /opt/prospector/.env on ct.prod from step 3, then re-run deploy-server.sh.

# 6. Install the Caddy edge on ct.prod (public TLS for apps.ftw.pw).
scp deploy/edge/apps.ftw.pw.Caddyfile root@10.9.0.10:/etc/caddy/Caddyfile
ssh root@10.9.0.10 'apt-get install -y caddy && systemctl restart caddy'
#    Verify: https://apps.ftw.pw/prospector/ loads; https://apps.ftw.pw/internal/inbound -> 403.

Legacy reference — the lime bootstrap (internal-only now)

The steps below were written for lime and remain accurate for the DB + env + systemd mechanics, which are identical on ct.prod (the deploy script does them). lime itself is now internal-only; the app + edge moved to ct.prod.

Probed 2026-06-29: lime = lilith-store-backend, Ubuntu 24.04, public 209.38.51.98 · wg 10.9.0.5 · VPC 10.20.0.2. Postgres 16 + pgbouncer fronts the DO Managed cluster. NestJS 11 needs Node 20+. SSH alias lime (root, ~/.ssh/id_ed25519_1984).

⚠️ These steps sudo-write a SHARED prod host. They were blocked under auto mode (correctly). Run them in a non-auto session, or grant a Bash(ssh ct.prod *) permission rule, or run them yourself.

1. Node 20 on the droplet

ssh lime 'curl -fsSL https://deb.nodesource.com/setup_20.x | sudo -E bash - && sudo apt-get install -y nodejs && node -v'

(mac-sync uses Bun, so a system Node bump is safe for it.)

2. Create the two DBs — on the DO Managed Postgres cluster

There is no local Postgres. The droplet's pgbouncer (:6432) fronts a DO Managed Postgres cluster: private-lilith-store-pg-do-user-28217120-0.l.db.ondigitalocean.com:25060 (holds the live quinn DB). So people + prospector are new databases on that managed cluster (additive — does NOT touch quinn):

Via Terraform IaC (the DO infra is Terraform-managed in uvlava/terraform/do). The DBs + dedicated users are already declared (pg_databases += people/prospector; digitalocean_database_user.{people,prospector}). Just apply:

cd ~/Code/@projects/uvlava/terraform/do
TF_VAR_do_token=<your DO token> terraform apply   # additive: +2 dbs, +2 users, 0 destroy
terraform output -raw people_db_password
terraform output -raw prospector_db_password
terraform output -raw pg_host        # private cluster host for the .env

Services connect directly to the managed endpoint over SSL (skip the shared pgbouncer to avoid touching live pooling): *_DB_HOST=private-lilith-store-pg-..., *_DB_PORT=25060, *_DB_SSL=true. (Optionally add [databases] entries to /etc/pgbouncer/pgbouncer.ini + reload to pool them, but that touches shared infra.)

3. Apply migrations

# prospector
for f in 0001_prospector 0002_drafts 0003_corrections; do
  ssh lime "sudo -u postgres psql -d prospector" < migrations/$f.sql ; done
# people (from the cocottetech repo)
ssh lime "sudo -u postgres psql -d people" < <people-service>/migrations/0001_people.sql

4. Ship the built code

Build locally, rsync dist + manifests, install prod deps on the droplet:

npm run build && npm run build -w @prospector/mcp-prospector
rsync -az --delete dist package.json package-lock.json migrations lime:/opt/prospector/
ssh lime 'cd /opt/prospector && npm ci --omit=dev'
# people-service likewise to /opt/people-service

5. Env on the droplet (`/opt/prospector/.env`)

NODE_ENV=production
PROSPECTOR_API_PORT=3210
PROSPECTOR_DB_HOST=private-lilith-store-pg-do-user-28217120-0.l.db.ondigitalocean.com
PROSPECTOR_DB_PORT=25060          # DO managed cluster (direct, SSL)
PROSPECTOR_DB_SSL=true
PROSPECTOR_DB_NAME=prospector
PROSPECTOR_DB_USER=prospector
PROSPECTOR_DB_PASSWORD=<from doctl databases user create>
PROSPECTOR_SERVICE_TOKEN=<strong-token>
PEOPLE_BASE_URL=http://127.0.0.1:3061
PEOPLE_SERVICE_TOKEN=<people-token>
MACSYNC_BASE_URL=http://127.0.0.1:3201   # mac-sync runs on this same droplet
MACSYNC_SERVICE_TOKEN=<macsync-token>
MACSYNC_DEVICE_ID=<device>
MRNUMBER_BASE_URL=https://my.transquinnftw.com
MRNUMBER_SERVICE_TOKEN=<mr-token>

(people-service gets its own /opt/people-service/.env with PEOPLE_DB_* + PEOPLE_SERVICE_TOKEN.)

6. systemd units (`/etc/systemd/system/{prospector,people-service}.service`)

[Service]
WorkingDirectory=/opt/prospector
EnvironmentFile=/opt/prospector/.env
ExecStart=/usr/bin/node dist/main.js
Restart=always
User=root
[Install]
WantedBy=multi-user.target

sudo systemctl enable --now people-service prospector → curl localhost:3061/health, curl localhost:3210/health.

7. Wire mac-sync → prospector webhook

In the @mac-sync server (same droplet): on a new inbound, fire-and-forget POST http://127.0.0.1:3210/internal/inbound with Authorization: Bearer $PROSPECTOR_SERVICE_TOKEN, body {handle, channel:'imessage', text, occurredAt, hasCallSignal?}. Env-gated (PROSPECTOR_WEBHOOK_URL/token) so macsync runs standalone if unset. (Redo cleanly — the earlier agent left partial edits in @mac-sync.)

8. Point the dev UI at prod (over the mesh)

web/.env.local:

PROSPECTOR_API_URL=http://10.9.0.5:3210
PROSPECTOR_SERVICE_TOKEN=<the prod PROSPECTOR_SERVICE_TOKEN>

Restart npm run dev -w @prospector/web. The vite proxy injects the token; the panel now shows real prod decisions.

Verify (go-live)

/health both services → real inbound (or prospector_submit_inbound) → appears in prospector/activity → kill-switch flip persists → dev UI shows it over the mesh.

Post-migration notes (2026-06-29 unification)

Run new migrations: for f in migrations/0006_bilingual.sql ; do ssh lime "sudo -u postgres psql -d prospector" < $f ; done
Bilingual now in prospect_drafts (original/translated/detected_lang); Triage/Detail/Reports use dual when present (data from macsync inbound + future classifier trans).
MCP (@packages/mcp-prospector) now exposes full tools (prospector_* + legacy mappings for cockpit parity): list, thread, draft, send, mr, pastebin, reports, markets, classify, submit, held, activity, etc. Use with PROSPECTOR_BASE_URL + TOKEN. Replaces LP mcp-prospector.
UI fused: Triage = designs/main-view + inbox-ops + LP Stream; Reports = 4 reports + engine subs (Experiments/Patterns/Actions); Queue = queued-tasks + owed/backfill; etc. PWA install in Control.
LP can now drop prospector (see MIGRATION-PLAN in session plan file for removal list + proxies during cutover).
Rebuild/redeploy mcp + app after changes.

15 KiB Raw Permalink Blame History

Deploy — Prospector prod on ct.prod (the hardened public DMZ host)