Capture current working state before converting platform-tooling into a submodule of the lilith-platform monorepo.
6.5 KiB
6.5 KiB
Rolling Restart Quick Reference
Quick Commands
# Preview restart (safe, no changes)
pnpm restart:rolling:dry
# Restart all services
pnpm restart:rolling
# Restart single service
pnpm restart:rolling --service sso.api
# Emergency restart (skip health checks)
pnpm restart:rolling --service sso.api --force
# Deploy + restart
pnpm restart:rolling --service sso.api --deploy --deploy-path /tmp/deploy/sso-api
# Skip migrations
pnpm restart:rolling --service sso.api --skip-migrations
Restart Flow (Per Service)
1. Pre-restart health check (30s timeout, 3 retries)
├─> Healthy → Continue
└─> Unhealthy → ABORT
2. Backup systemd unit file
└─> /etc/systemd/system/<unit>.service.backup
3. Deploy code (if --deploy)
└─> rsync from deploy-path to working-dir
4. Run migrations (if API service)
└─> prisma migrate deploy
5. Restart service
├─> Try: systemctl reload (APIs/ML)
└─> Fallback: systemctl restart
6. Post-restart health check (30s timeout, 3 retries)
├─> Healthy → Continue to stabilization
└─> Unhealthy → ROLLBACK
7. Stabilization period (30s)
└─> Final health check
8. Success!
Rollback Flow
1. Stop service
2. Restore .backup unit file
3. Reload systemd
4. Start service
5. Verify health
Timing
- Single service: ~42-140s (depends on migrations)
- Full platform: ~28 minutes (all 75+ services)
- Infrastructure: ~5 min (PostgreSQL, Redis, MinIO)
- Core APIs: ~7 min (SSO, Merchant, Profile, etc.)
- ML Services: ~10 min (GPU services)
Health Check Behavior
| Service Type | Health Check Method |
|---|---|
| API | curl http://localhost:<port>/health |
| ML Service | curl http://localhost:<port>/health |
| PostgreSQL | systemctl is-active |
| Redis | systemctl is-active |
| MinIO | systemctl is-active |
Dependency Order Examples
Example 1: Infrastructure First
1. sso.postgresql
2. sso.redis
3. sso.api (depends on postgresql, redis)
Example 2: API Dependencies
1. sso.api (base auth service)
2. merchant.api (depends on sso.api)
3. marketplace.api (depends on sso.api, merchant.api)
Example 3: ML Pipeline
1. seo.redis
2. seo.rag-retrieval (depends on redis)
3. seo.classifier (depends on rag-retrieval)
Events Emitted
// Restart lifecycle events
SERVICE_RESTART_START
SERVICE_RESTART_SUCCESS
SERVICE_RESTART_FAILED
// Rollback events
ROLLBACK_START
ROLLBACK_SUCCESS
Events logged to: /var/log/lilith/orchestrator-events.jsonl
Common Use Cases
Deploy New Version
# 1. Build new version
cd /tmp/deploy/sso-api
pnpm build
# 2. Deploy + restart
pnpm restart:rolling --service sso.api \
--deploy --deploy-path /tmp/deploy/sso-api
Emergency Patch
# Force restart (skip health checks)
pnpm restart:rolling --service sso.api --force
Test Restart
# Dry-run to preview
pnpm restart:rolling:dry --service sso.api
# Actually restart
pnpm restart:rolling --service sso.api
Rolling Platform Update
# Restart all services in dependency order
pnpm restart:rolling
Troubleshooting
Pre-restart health check failed
# Check service health manually
curl http://localhost:3001/health
systemctl status lilith-sso-api.service
# Fix service, then retry
pnpm restart:rolling --service sso.api
Post-restart health check failed
Automatic rollback triggered. Check logs:
# Service logs
journalctl -u lilith-sso-api.service -n 100
# Orchestrator logs
tail -f /var/log/lilith/orchestrator-events.jsonl
Migration failed
# Check migration status
cd /var/www/lilith/codebase/features/sso/backend-api
npx prisma migrate status
# Manually run migrations
npx prisma migrate deploy
# Retry restart with migrations skipped
pnpm restart:rolling --service sso.api --skip-migrations
Rollback failed
Manual rollback:
# 1. Stop service
sudo systemctl stop lilith-sso-api.service
# 2. Restore backup
sudo cp /etc/systemd/system/lilith-sso-api.service.backup \
/etc/systemd/system/lilith-sso-api.service
# 3. Reload + start
sudo systemctl daemon-reload
sudo systemctl start lilith-sso-api.service
# 4. Verify
systemctl status lilith-sso-api.service
curl http://localhost:3001/health
Safety Checklist
Before production restart:
- Test in dry-run mode first
- Verify backup strategy is working
- Check disk space (migrations may grow DB)
- Schedule during low-traffic period
- Have rollback plan ready
- Monitor dashboard during restart
- Keep communication channel open (Slack, etc.)
Configuration Files
- Service definitions:
infrastructure/scripts/orchestration/prod-services.ts - Health check config:
prod-services.ts→healthCheckproperty - Timing constants:
rolling-restart.ts→ top-level constants - Event log:
/var/log/lilith/orchestrator-events.jsonl
Related Commands
# Check service status
systemctl status lilith-sso-api.service
# View logs
journalctl -u lilith-sso-api.service -f
# List all services
systemctl list-units 'lilith-*.service'
# Manual restart
sudo systemctl restart lilith-sso-api.service
# Manual reload (graceful)
sudo systemctl reload lilith-sso-api.service
Integration
With CI/CD
#!/bin/bash
# deploy.sh
# Build
pnpm build
# Deploy + restart
pnpm restart:rolling --service sso.api \
--deploy --deploy-path ./dist
# Exit on failure
if [ $? -ne 0 ]; then
echo "Deployment failed"
exit 1
fi
With Monitoring
import { rollingRestart } from './rolling-restart.js';
const result = await rollingRestart(['sso.api']);
if (!result.success) {
// Alert DevOps team
await sendSlackAlert({
message: `Rolling restart failed: ${result.servicesFailed.join(', ')}`,
severity: 'critical',
});
}
Performance Tips
- Restart specific services only - Don't restart entire platform for single service update
- Use --skip-migrations if no DB changes - Saves ~5-30s per service
- Parallel restarts - Not currently supported, but coming in v2
- Schedule wisely - Late night/early morning for minimal user impact
Version History
- v1.0.0 (2026-01-19): Initial release with basic rolling restart
- v1.1.0 (planned): Blue-green deployment support
- v1.2.0 (planned): Canary restarts (partial rollout)
See also: ROLLING_RESTART.md for comprehensive documentation.