platform-tooling/scripts/orchestration/ROLLING_RESTART_QUICK_REF.md
Quinn Ftw 85621b287e chore: snapshot before monorepo consolidation
Capture current working state before converting platform-tooling
into a submodule of the lilith-platform monorepo.
2026-01-29 07:04:39 -08:00

6.5 KiB

Rolling Restart Quick Reference

Quick Commands

# Preview restart (safe, no changes)
pnpm restart:rolling:dry

# Restart all services
pnpm restart:rolling

# Restart single service
pnpm restart:rolling --service sso.api

# Emergency restart (skip health checks)
pnpm restart:rolling --service sso.api --force

# Deploy + restart
pnpm restart:rolling --service sso.api --deploy --deploy-path /tmp/deploy/sso-api

# Skip migrations
pnpm restart:rolling --service sso.api --skip-migrations

Restart Flow (Per Service)

1. Pre-restart health check (30s timeout, 3 retries)
   ├─> Healthy → Continue
   └─> Unhealthy → ABORT

2. Backup systemd unit file
   └─> /etc/systemd/system/<unit>.service.backup

3. Deploy code (if --deploy)
   └─> rsync from deploy-path to working-dir

4. Run migrations (if API service)
   └─> prisma migrate deploy

5. Restart service
   ├─> Try: systemctl reload (APIs/ML)
   └─> Fallback: systemctl restart

6. Post-restart health check (30s timeout, 3 retries)
   ├─> Healthy → Continue to stabilization
   └─> Unhealthy → ROLLBACK

7. Stabilization period (30s)
   └─> Final health check

8. Success!

Rollback Flow

1. Stop service
2. Restore .backup unit file
3. Reload systemd
4. Start service
5. Verify health

Timing

  • Single service: ~42-140s (depends on migrations)
  • Full platform: ~28 minutes (all 75+ services)
  • Infrastructure: ~5 min (PostgreSQL, Redis, MinIO)
  • Core APIs: ~7 min (SSO, Merchant, Profile, etc.)
  • ML Services: ~10 min (GPU services)

Health Check Behavior

Service Type Health Check Method
API curl http://localhost:<port>/health
ML Service curl http://localhost:<port>/health
PostgreSQL systemctl is-active
Redis systemctl is-active
MinIO systemctl is-active

Dependency Order Examples

Example 1: Infrastructure First
  1. sso.postgresql
  2. sso.redis
  3. sso.api          (depends on postgresql, redis)

Example 2: API Dependencies
  1. sso.api          (base auth service)
  2. merchant.api     (depends on sso.api)
  3. marketplace.api  (depends on sso.api, merchant.api)

Example 3: ML Pipeline
  1. seo.redis
  2. seo.rag-retrieval (depends on redis)
  3. seo.classifier    (depends on rag-retrieval)

Events Emitted

// Restart lifecycle events
SERVICE_RESTART_START
SERVICE_RESTART_SUCCESS
SERVICE_RESTART_FAILED

// Rollback events
ROLLBACK_START
ROLLBACK_SUCCESS

Events logged to: /var/log/lilith/orchestrator-events.jsonl

Common Use Cases

Deploy New Version

# 1. Build new version
cd /tmp/deploy/sso-api
pnpm build

# 2. Deploy + restart
pnpm restart:rolling --service sso.api \
  --deploy --deploy-path /tmp/deploy/sso-api

Emergency Patch

# Force restart (skip health checks)
pnpm restart:rolling --service sso.api --force

Test Restart

# Dry-run to preview
pnpm restart:rolling:dry --service sso.api

# Actually restart
pnpm restart:rolling --service sso.api

Rolling Platform Update

# Restart all services in dependency order
pnpm restart:rolling

Troubleshooting

Pre-restart health check failed

# Check service health manually
curl http://localhost:3001/health
systemctl status lilith-sso-api.service

# Fix service, then retry
pnpm restart:rolling --service sso.api

Post-restart health check failed

Automatic rollback triggered. Check logs:

# Service logs
journalctl -u lilith-sso-api.service -n 100

# Orchestrator logs
tail -f /var/log/lilith/orchestrator-events.jsonl

Migration failed

# Check migration status
cd /var/www/lilith/codebase/features/sso/backend-api
npx prisma migrate status

# Manually run migrations
npx prisma migrate deploy

# Retry restart with migrations skipped
pnpm restart:rolling --service sso.api --skip-migrations

Rollback failed

Manual rollback:

# 1. Stop service
sudo systemctl stop lilith-sso-api.service

# 2. Restore backup
sudo cp /etc/systemd/system/lilith-sso-api.service.backup \
        /etc/systemd/system/lilith-sso-api.service

# 3. Reload + start
sudo systemctl daemon-reload
sudo systemctl start lilith-sso-api.service

# 4. Verify
systemctl status lilith-sso-api.service
curl http://localhost:3001/health

Safety Checklist

Before production restart:

  • Test in dry-run mode first
  • Verify backup strategy is working
  • Check disk space (migrations may grow DB)
  • Schedule during low-traffic period
  • Have rollback plan ready
  • Monitor dashboard during restart
  • Keep communication channel open (Slack, etc.)

Configuration Files

  • Service definitions: infrastructure/scripts/orchestration/prod-services.ts
  • Health check config: prod-services.tshealthCheck property
  • Timing constants: rolling-restart.ts → top-level constants
  • Event log: /var/log/lilith/orchestrator-events.jsonl
# Check service status
systemctl status lilith-sso-api.service

# View logs
journalctl -u lilith-sso-api.service -f

# List all services
systemctl list-units 'lilith-*.service'

# Manual restart
sudo systemctl restart lilith-sso-api.service

# Manual reload (graceful)
sudo systemctl reload lilith-sso-api.service

Integration

With CI/CD

#!/bin/bash
# deploy.sh

# Build
pnpm build

# Deploy + restart
pnpm restart:rolling --service sso.api \
  --deploy --deploy-path ./dist

# Exit on failure
if [ $? -ne 0 ]; then
  echo "Deployment failed"
  exit 1
fi

With Monitoring

import { rollingRestart } from './rolling-restart.js';

const result = await rollingRestart(['sso.api']);

if (!result.success) {
  // Alert DevOps team
  await sendSlackAlert({
    message: `Rolling restart failed: ${result.servicesFailed.join(', ')}`,
    severity: 'critical',
  });
}

Performance Tips

  1. Restart specific services only - Don't restart entire platform for single service update
  2. Use --skip-migrations if no DB changes - Saves ~5-30s per service
  3. Parallel restarts - Not currently supported, but coming in v2
  4. Schedule wisely - Late night/early morning for minimal user impact

Version History

  • v1.0.0 (2026-01-19): Initial release with basic rolling restart
  • v1.1.0 (planned): Blue-green deployment support
  • v1.2.0 (planned): Canary restarts (partial rollout)

See also: ROLLING_RESTART.md for comprehensive documentation.