# Rolling Restart Orchestrator Zero-downtime production restart system with comprehensive health checks, automatic rollback, and orchestrator event emission. ## Overview The rolling restart orchestrator safely restarts production services with: - **Pre/post-restart health validation**: Ensures services are healthy before and after restart - **Dependency-aware ordering**: Restarts infrastructure before APIs, respects service dependencies - **Automatic rollback**: Restores previous state if post-restart health checks fail - **Event emission**: Publishes orchestrator events for dashboard visibility - **Database migrations**: Executes Prisma migrations before service restart - **Graceful reloads**: Uses systemd reload when possible, fallback to restart - **Stabilization period**: Waits 30s after restart to ensure service stability ## Architecture ### Restart Flow ``` For each service (in dependency order): 1. Pre-restart health check └─> Fail → Abort restart 2. Backup systemd unit file └─> /etc/systemd/system/.service.backup 3. Deploy new code (if --deploy flag) └─> rsync from deploy path to working dir 4. Run database migrations (if service is API) └─> prisma migrate deploy 5. Graceful restart ├─> Try: systemctl reload (APIs/ML) └─> Fallback: systemctl restart 6. Post-restart health check ├─> Success → Continue to stabilization └─> Fail → Rollback 7. Stabilization period (30s) └─> Final health check 8. Emit SUCCESS event ``` ### Rollback Flow ``` On post-restart health check failure: 1. Emit ROLLBACK_START event 2. Stop service └─> systemctl stop 3. Restore backup unit file └─> cp .backup └─> systemctl daemon-reload 4. Start service └─> systemctl start 5. Verify rollback health └─> Health check on restored service 6. Emit ROLLBACK_SUCCESS/FAILED event ``` ## Usage ### Basic Usage ```bash # Restart all services pnpm restart:rolling # Restart specific service pnpm restart:rolling --service sso.api # Dry-run (preview without executing) pnpm restart:rolling:dry pnpm restart:rolling --dry-run # Force mode (skip health checks - EMERGENCY ONLY) pnpm restart:rolling --force # Skip database migrations pnpm restart:rolling --skip-migrations ``` ### Deploy with Restart ```bash # Deploy new code and restart pnpm restart:rolling --service sso.api --deploy --deploy-path /tmp/deploy/sso-api # Deploy multiple services pnpm restart:rolling --deploy --deploy-path /var/www/lilith/deploy ``` ### Programmatic Usage ```typescript import { rollingRestart, restartService } from './rolling-restart.js'; // Restart all services const result = await rollingRestart(); if (result.success) { console.log(`Restarted ${result.servicesRestarted.length} services`); } else { console.error(`Failed services: ${result.servicesFailed.join(', ')}`); } // Restart single service with options const success = await restartService('sso.api', { dryRun: false, force: false, skipMigrations: false, deployCode: true, deployPath: '/tmp/deploy/sso-api', }); ``` ## Configuration ### Health Check Configuration Health checks are defined in `prod-services.ts` per service: ```typescript { serviceId: 'sso.api', healthCheck: { url: 'http://localhost:3001/health', // HTTP endpoint interval: 30, // Seconds between checks }, } // OR { serviceId: 'sso.postgresql', healthCheck: { command: 'pg_isready -h localhost', // Command-based check interval: 30, }, } ``` ### Timing Configuration Edit constants in `rolling-restart.ts`: ```typescript const HEALTH_CHECK_TIMEOUT = 30000; // 30s - Max time for health check const HEALTH_CHECK_INTERVAL = 2000; // 2s - Time between retry attempts const STABILIZATION_PERIOD = 30000; // 30s - Wait after restart const SYSTEMD_GRACE_PERIOD = 10000; // 10s - Systemd command timeout const MAX_RETRY_ATTEMPTS = 3; // 3 - Health check retries const RETRY_DELAY = 5000; // 5s - Delay between retries ``` ## Dependency Ordering Services are automatically sorted by dependencies before restart: ``` Example Order: 1. sso.postgresql (infrastructure) 2. sso.redis (infrastructure) 3. sso.api (depends on sso.postgresql, sso.redis) 4. merchant.api (depends on sso.api) 5. marketplace.api (depends on sso.api, merchant.api) ``` Dependencies are defined in `prod-services.ts`: ```typescript function getServiceDependencies(serviceId: string): string[] { if (serviceId === 'marketplace.api') { return [ 'network.target', getSystemdUnitName('sso.api'), getSystemdUnitName('merchant.api'), getSystemdUnitName('profile.api'), ]; } // ... } ``` ## Event Emission Events are emitted for orchestrator dashboard visibility: ```typescript interface OrchestratorEvent { type: 'SERVICE_RESTART_START' | 'SERVICE_RESTART_SUCCESS' | 'SERVICE_RESTART_FAILED' | 'ROLLBACK_START' | 'ROLLBACK_SUCCESS'; serviceId: string; timestamp: number; metadata?: Record; } ``` Events are logged to `/var/log/lilith/orchestrator-events.jsonl`: ```json {"type":"SERVICE_RESTART_START","serviceId":"sso.api","timestamp":"2026-01-19T12:00:00.000Z"} {"type":"SERVICE_RESTART_SUCCESS","serviceId":"sso.api","timestamp":"2026-01-19T12:00:45.000Z"} ``` **Integration with @lilith/domain-events**: To integrate with the domain events system: ```typescript import { DomainEventsEmitter } from '@lilith/domain-events/emitter'; function emitEvent(event: OrchestratorEvent): void { const emitter = DomainEventsEmitter.getInstance(); emitter.emit('orchestrator.service.restart', { serviceId: event.serviceId, status: event.type, timestamp: new Date(event.timestamp), metadata: event.metadata, }); } ``` ## Health Checks ### HTTP Health Checks For API and ML services: ```bash curl -sf http://localhost:3001/health Expected Response: HTTP 200 Body: { "status": "healthy" } ``` ### Command Health Checks For infrastructure services: ```bash # PostgreSQL pg_isready -h localhost -p 5432 # Redis redis-cli -h localhost -p 6379 ping # MinIO curl -sf http://localhost:9000/minio/health/live ``` ### Systemd Status Checks For services without explicit health checks: ```bash systemctl is-active lilith-sso-api.service # Output: active | inactive | failed ``` ## Rollback Mechanism ### When Rollback Triggers - Post-restart health check fails after MAX_RETRY_ATTEMPTS - Service crashes during stabilization period - Systemd reports service as failed ### Rollback Process 1. **Stop current service**: ```bash sudo systemctl stop lilith-sso-api.service ``` 2. **Restore backup unit file**: ```bash sudo cp /etc/systemd/system/lilith-sso-api.service.backup \ /etc/systemd/system/lilith-sso-api.service sudo systemctl daemon-reload ``` 3. **Start restored service**: ```bash sudo systemctl start lilith-sso-api.service ``` 4. **Verify rollback**: ```bash # Health check on restored service curl -sf http://localhost:3001/health ``` ### Manual Rollback If automatic rollback fails: ```bash # 1. Stop service sudo systemctl stop lilith-sso-api.service # 2. Restore backup sudo cp /etc/systemd/system/lilith-sso-api.service.backup \ /etc/systemd/system/lilith-sso-api.service # 3. Reload systemd sudo systemctl daemon-reload # 4. Start service sudo systemctl start lilith-sso-api.service # 5. Check status sudo systemctl status lilith-sso-api.service ``` ## Database Migrations Migrations are automatically executed before service restart for API services. ### Migration Process ```bash cd /var/www/lilith/codebase/features/sso/backend-api ./node_modules/.bin/prisma migrate deploy ``` ### Skip Migrations ```bash pnpm restart:rolling --skip-migrations ``` ### Manual Migration ```bash cd /var/www/lilith/codebase/features//backend-api npx prisma migrate deploy ``` ## Monitoring and Logs ### Orchestrator Logs ```bash # View restart logs journalctl -u lilith-orchestrator -f # View orchestrator events tail -f /var/log/lilith/orchestrator-events.jsonl ``` ### Service Logs ```bash # View service logs journalctl -u lilith-sso-api.service -f # View recent restarts journalctl -u lilith-sso-api.service --since "1 hour ago" | grep restart ``` ### Health Check Status ```bash # Check all services for service in $(systemctl list-units 'lilith-*.service' --plain --no-legend | awk '{print $1}'); do echo -n "$service: " systemctl is-active $service done ``` ## Troubleshooting ### Service Won't Start ```bash # Check service status sudo systemctl status lilith-sso-api.service # Check logs journalctl -u lilith-sso-api.service -n 50 # Check dependencies systemctl list-dependencies lilith-sso-api.service # Manually start sudo systemctl start lilith-sso-api.service ``` ### Health Check Failing ```bash # Test health endpoint manually curl -v http://localhost:3001/health # Check if service is listening ss -tlnp | grep 3001 # Check environment variables sudo systemctl show lilith-sso-api.service --property=Environment ``` ### Rollback Failed ```bash # Check if backup exists ls -l /etc/systemd/system/lilith-sso-api.service.backup # Manually restore (see Manual Rollback section) # Check for conflicting processes sudo lsof -i :3001 ``` ### Database Migration Failed ```bash # Check migration status cd /var/www/lilith/codebase/features/sso/backend-api npx prisma migrate status # Manually run migrations npx prisma migrate deploy # Rollback migration (if needed) npx prisma migrate resolve --rolled-back ``` ## Safety Features ### Pre-flight Checks - ✅ Service is healthy before restart - ✅ Systemd unit file exists - ✅ Backup created before changes - ✅ Dependencies are satisfied ### During Restart - ✅ Graceful reload attempted first - ✅ Systemd grace period respected - ✅ Health checks with retry logic - ✅ Event emission for visibility ### Post-restart - ✅ Health validation with retries - ✅ Stabilization period monitoring - ✅ Automatic rollback on failure - ✅ Final health verification ### Emergency Mode Use `--force` flag to skip health checks (EMERGENCY ONLY): ```bash pnpm restart:rolling --service sso.api --force ``` **Warning**: Force mode bypasses all safety checks. Use only when: - Service is completely down and needs immediate restart - Health checks are broken but service is functional - Emergency security patch requires immediate deployment ## Performance ### Timing Breakdown Typical restart for a single API service: ``` 1. Pre-restart health check: ~2s (with retries: ~15s max) 2. Backup unit file: ~0.1s 3. Deploy code (if requested): ~5-30s (depends on code size) 4. Database migrations: ~1-60s (depends on migrations) 5. Systemd reload/restart: ~2-5s 6. Post-restart health check: ~2s (with retries: ~15s max) 7. Stabilization period: 30s Total: ~42-140s per service ``` ### Full Platform Restart Typical timing for complete platform restart: ``` Infrastructure (6 services): ~5 min (PostgreSQL, Redis, MinIO) Core APIs (4 services): ~6 min (SSO, Merchant, Profile, Analytics) ML Services (5 services): ~10 min (SEO ML, CoT, RAG, Classifier, Imajin) Feature APIs (4 services): ~6 min (Landing, Marketplace, SEO, Admin) Total: ~28 minutes ``` ## Best Practices ### Development 1. **Always test with --dry-run first** ```bash pnpm restart:rolling --service sso.api --dry-run ``` 2. **Restart single service for testing** ```bash pnpm restart:rolling --service sso.api ``` 3. **Use force mode sparingly** - Only in emergencies - Document why force was needed ### Production 1. **Schedule restarts during low-traffic periods** - Late night / early morning - Weekdays preferred over weekends 2. **Monitor dashboard during restart** - Watch orchestrator events - Monitor service health - Check error logs 3. **Have rollback plan ready** - Know manual rollback procedure - Have backup contact for escalation 4. **Test migrations in staging first** ```bash # On staging cd /var/www/lilith/codebase/features/sso/backend-api npx prisma migrate deploy --preview-feature ``` ### Debugging 1. **Check orchestrator events** ```bash tail -f /var/log/lilith/orchestrator-events.jsonl ``` 2. **Monitor systemd journal** ```bash journalctl -f -u 'lilith-*.service' ``` 3. **Test health endpoints manually** ```bash curl -v http://localhost:3001/health ``` ## Related Documentation - [Production Orchestration Plan](./PRODUCTION_ORCHESTRATION_PLAN.md) - [Service Definitions](./prod-services.ts) - [Systemd Generator](./systemd-generator.ts) - [Health Check Script](../health-check-all.ts) ## Future Enhancements ### Planned Features - [ ] Blue-green deployment support - [ ] Canary restart (restart subset, monitor, then all) - [ ] Slack/Discord notifications - [ ] Grafana dashboard integration - [ ] Automatic traffic shifting during restart - [ ] Pre-warm cache after restart - [ ] Load balancer drain/restore - [ ] Cross-VPS orchestration ### Integration Points - **@lilith/domain-events**: Emit structured domain events - **Grafana**: Visualize restart metrics and timing - **Prometheus**: Export restart counters and durations - **Slack**: Send notifications on restart/failure/rollback - **Sentry**: Report rollback events as incidents ## Support For issues or questions: 1. Check [Troubleshooting](#troubleshooting) section 2. Review orchestrator event logs 3. Check systemd service status and logs 4. Contact DevOps team --- **Last Updated**: 2026-01-19 **Version**: 1.0.0 **Maintainer**: Lilith Platform DevOps