606 lines
14 KiB
Markdown
606 lines
14 KiB
Markdown
# Rolling Restart Orchestrator
|
|
|
|
Zero-downtime production restart system with comprehensive health checks, automatic rollback, and orchestrator event emission.
|
|
|
|
## Overview
|
|
|
|
The rolling restart orchestrator safely restarts production services with:
|
|
|
|
- **Pre/post-restart health validation**: Ensures services are healthy before and after restart
|
|
- **Dependency-aware ordering**: Restarts infrastructure before APIs, respects service dependencies
|
|
- **Automatic rollback**: Restores previous state if post-restart health checks fail
|
|
- **Event emission**: Publishes orchestrator events for dashboard visibility
|
|
- **Database migrations**: Executes Prisma migrations before service restart
|
|
- **Graceful reloads**: Uses systemd reload when possible, fallback to restart
|
|
- **Stabilization period**: Waits 30s after restart to ensure service stability
|
|
|
|
## Architecture
|
|
|
|
### Restart Flow
|
|
|
|
```
|
|
For each service (in dependency order):
|
|
1. Pre-restart health check
|
|
└─> Fail → Abort restart
|
|
|
|
2. Backup systemd unit file
|
|
└─> /etc/systemd/system/<unit>.service.backup
|
|
|
|
3. Deploy new code (if --deploy flag)
|
|
└─> rsync from deploy path to working dir
|
|
|
|
4. Run database migrations (if service is API)
|
|
└─> prisma migrate deploy
|
|
|
|
5. Graceful restart
|
|
├─> Try: systemctl reload (APIs/ML)
|
|
└─> Fallback: systemctl restart
|
|
|
|
6. Post-restart health check
|
|
├─> Success → Continue to stabilization
|
|
└─> Fail → Rollback
|
|
|
|
7. Stabilization period (30s)
|
|
└─> Final health check
|
|
|
|
8. Emit SUCCESS event
|
|
```
|
|
|
|
### Rollback Flow
|
|
|
|
```
|
|
On post-restart health check failure:
|
|
1. Emit ROLLBACK_START event
|
|
|
|
2. Stop service
|
|
└─> systemctl stop <unit>
|
|
|
|
3. Restore backup unit file
|
|
└─> cp <unit>.backup <unit>
|
|
└─> systemctl daemon-reload
|
|
|
|
4. Start service
|
|
└─> systemctl start <unit>
|
|
|
|
5. Verify rollback health
|
|
└─> Health check on restored service
|
|
|
|
6. Emit ROLLBACK_SUCCESS/FAILED event
|
|
```
|
|
|
|
## Usage
|
|
|
|
### Basic Usage
|
|
|
|
```bash
|
|
# Restart all services
|
|
pnpm restart:rolling
|
|
|
|
# Restart specific service
|
|
pnpm restart:rolling --service sso.api
|
|
|
|
# Dry-run (preview without executing)
|
|
pnpm restart:rolling:dry
|
|
pnpm restart:rolling --dry-run
|
|
|
|
# Force mode (skip health checks - EMERGENCY ONLY)
|
|
pnpm restart:rolling --force
|
|
|
|
# Skip database migrations
|
|
pnpm restart:rolling --skip-migrations
|
|
```
|
|
|
|
### Deploy with Restart
|
|
|
|
```bash
|
|
# Deploy new code and restart
|
|
pnpm restart:rolling --service sso.api --deploy --deploy-path /tmp/deploy/sso-api
|
|
|
|
# Deploy multiple services
|
|
pnpm restart:rolling --deploy --deploy-path /var/www/lilith/deploy
|
|
```
|
|
|
|
### Programmatic Usage
|
|
|
|
```typescript
|
|
import { rollingRestart, restartService } from './rolling-restart.js';
|
|
|
|
// Restart all services
|
|
const result = await rollingRestart();
|
|
|
|
if (result.success) {
|
|
console.log(`Restarted ${result.servicesRestarted.length} services`);
|
|
} else {
|
|
console.error(`Failed services: ${result.servicesFailed.join(', ')}`);
|
|
}
|
|
|
|
// Restart single service with options
|
|
const success = await restartService('sso.api', {
|
|
dryRun: false,
|
|
force: false,
|
|
skipMigrations: false,
|
|
deployCode: true,
|
|
deployPath: '/tmp/deploy/sso-api',
|
|
});
|
|
```
|
|
|
|
## Configuration
|
|
|
|
### Health Check Configuration
|
|
|
|
Health checks are defined in `prod-services.ts` per service:
|
|
|
|
```typescript
|
|
{
|
|
serviceId: 'sso.api',
|
|
healthCheck: {
|
|
url: 'http://localhost:3001/health', // HTTP endpoint
|
|
interval: 30, // Seconds between checks
|
|
},
|
|
}
|
|
|
|
// OR
|
|
|
|
{
|
|
serviceId: 'sso.postgresql',
|
|
healthCheck: {
|
|
command: 'pg_isready -h localhost', // Command-based check
|
|
interval: 30,
|
|
},
|
|
}
|
|
```
|
|
|
|
### Timing Configuration
|
|
|
|
Edit constants in `rolling-restart.ts`:
|
|
|
|
```typescript
|
|
const HEALTH_CHECK_TIMEOUT = 30000; // 30s - Max time for health check
|
|
const HEALTH_CHECK_INTERVAL = 2000; // 2s - Time between retry attempts
|
|
const STABILIZATION_PERIOD = 30000; // 30s - Wait after restart
|
|
const SYSTEMD_GRACE_PERIOD = 10000; // 10s - Systemd command timeout
|
|
const MAX_RETRY_ATTEMPTS = 3; // 3 - Health check retries
|
|
const RETRY_DELAY = 5000; // 5s - Delay between retries
|
|
```
|
|
|
|
## Dependency Ordering
|
|
|
|
Services are automatically sorted by dependencies before restart:
|
|
|
|
```
|
|
Example Order:
|
|
1. sso.postgresql (infrastructure)
|
|
2. sso.redis (infrastructure)
|
|
3. sso.api (depends on sso.postgresql, sso.redis)
|
|
4. merchant.api (depends on sso.api)
|
|
5. marketplace.api (depends on sso.api, merchant.api)
|
|
```
|
|
|
|
Dependencies are defined in `prod-services.ts`:
|
|
|
|
```typescript
|
|
function getServiceDependencies(serviceId: string): string[] {
|
|
if (serviceId === 'marketplace.api') {
|
|
return [
|
|
'network.target',
|
|
getSystemdUnitName('sso.api'),
|
|
getSystemdUnitName('merchant.api'),
|
|
getSystemdUnitName('profile.api'),
|
|
];
|
|
}
|
|
// ...
|
|
}
|
|
```
|
|
|
|
## Event Emission
|
|
|
|
Events are emitted for orchestrator dashboard visibility:
|
|
|
|
```typescript
|
|
interface OrchestratorEvent {
|
|
type: 'SERVICE_RESTART_START' | 'SERVICE_RESTART_SUCCESS' |
|
|
'SERVICE_RESTART_FAILED' | 'ROLLBACK_START' | 'ROLLBACK_SUCCESS';
|
|
serviceId: string;
|
|
timestamp: number;
|
|
metadata?: Record<string, unknown>;
|
|
}
|
|
```
|
|
|
|
Events are logged to `/var/log/lilith/orchestrator-events.jsonl`:
|
|
|
|
```json
|
|
{"type":"SERVICE_RESTART_START","serviceId":"sso.api","timestamp":"2026-01-19T12:00:00.000Z"}
|
|
{"type":"SERVICE_RESTART_SUCCESS","serviceId":"sso.api","timestamp":"2026-01-19T12:00:45.000Z"}
|
|
```
|
|
|
|
**Integration with @lilith/domain-events**:
|
|
|
|
To integrate with the domain events system:
|
|
|
|
```typescript
|
|
import { DomainEventsEmitter } from '@lilith/domain-events/emitter';
|
|
|
|
function emitEvent(event: OrchestratorEvent): void {
|
|
const emitter = DomainEventsEmitter.getInstance();
|
|
|
|
emitter.emit('orchestrator.service.restart', {
|
|
serviceId: event.serviceId,
|
|
status: event.type,
|
|
timestamp: new Date(event.timestamp),
|
|
metadata: event.metadata,
|
|
});
|
|
}
|
|
```
|
|
|
|
## Health Checks
|
|
|
|
### HTTP Health Checks
|
|
|
|
For API and ML services:
|
|
|
|
```bash
|
|
curl -sf http://localhost:3001/health
|
|
|
|
Expected Response:
|
|
HTTP 200
|
|
Body: { "status": "healthy" }
|
|
```
|
|
|
|
### Command Health Checks
|
|
|
|
For infrastructure services:
|
|
|
|
```bash
|
|
# PostgreSQL
|
|
pg_isready -h localhost -p 5432
|
|
|
|
# Redis
|
|
redis-cli -h localhost -p 6379 ping
|
|
|
|
# MinIO
|
|
curl -sf http://localhost:9000/minio/health/live
|
|
```
|
|
|
|
### Systemd Status Checks
|
|
|
|
For services without explicit health checks:
|
|
|
|
```bash
|
|
systemctl is-active lilith-sso-api.service
|
|
# Output: active | inactive | failed
|
|
```
|
|
|
|
## Rollback Mechanism
|
|
|
|
### When Rollback Triggers
|
|
|
|
- Post-restart health check fails after MAX_RETRY_ATTEMPTS
|
|
- Service crashes during stabilization period
|
|
- Systemd reports service as failed
|
|
|
|
### Rollback Process
|
|
|
|
1. **Stop current service**:
|
|
```bash
|
|
sudo systemctl stop lilith-sso-api.service
|
|
```
|
|
|
|
2. **Restore backup unit file**:
|
|
```bash
|
|
sudo cp /etc/systemd/system/lilith-sso-api.service.backup \
|
|
/etc/systemd/system/lilith-sso-api.service
|
|
sudo systemctl daemon-reload
|
|
```
|
|
|
|
3. **Start restored service**:
|
|
```bash
|
|
sudo systemctl start lilith-sso-api.service
|
|
```
|
|
|
|
4. **Verify rollback**:
|
|
```bash
|
|
# Health check on restored service
|
|
curl -sf http://localhost:3001/health
|
|
```
|
|
|
|
### Manual Rollback
|
|
|
|
If automatic rollback fails:
|
|
|
|
```bash
|
|
# 1. Stop service
|
|
sudo systemctl stop lilith-sso-api.service
|
|
|
|
# 2. Restore backup
|
|
sudo cp /etc/systemd/system/lilith-sso-api.service.backup \
|
|
/etc/systemd/system/lilith-sso-api.service
|
|
|
|
# 3. Reload systemd
|
|
sudo systemctl daemon-reload
|
|
|
|
# 4. Start service
|
|
sudo systemctl start lilith-sso-api.service
|
|
|
|
# 5. Check status
|
|
sudo systemctl status lilith-sso-api.service
|
|
```
|
|
|
|
## Database Migrations
|
|
|
|
Migrations are automatically executed before service restart for API services.
|
|
|
|
### Migration Process
|
|
|
|
```bash
|
|
cd /var/www/lilith/codebase/features/sso/backend-api
|
|
./node_modules/.bin/prisma migrate deploy
|
|
```
|
|
|
|
### Skip Migrations
|
|
|
|
```bash
|
|
pnpm restart:rolling --skip-migrations
|
|
```
|
|
|
|
### Manual Migration
|
|
|
|
```bash
|
|
cd /var/www/lilith/codebase/features/<feature>/backend-api
|
|
npx prisma migrate deploy
|
|
```
|
|
|
|
## Monitoring and Logs
|
|
|
|
### Orchestrator Logs
|
|
|
|
```bash
|
|
# View restart logs
|
|
journalctl -u lilith-orchestrator -f
|
|
|
|
# View orchestrator events
|
|
tail -f /var/log/lilith/orchestrator-events.jsonl
|
|
```
|
|
|
|
### Service Logs
|
|
|
|
```bash
|
|
# View service logs
|
|
journalctl -u lilith-sso-api.service -f
|
|
|
|
# View recent restarts
|
|
journalctl -u lilith-sso-api.service --since "1 hour ago" | grep restart
|
|
```
|
|
|
|
### Health Check Status
|
|
|
|
```bash
|
|
# Check all services
|
|
for service in $(systemctl list-units 'lilith-*.service' --plain --no-legend | awk '{print $1}'); do
|
|
echo -n "$service: "
|
|
systemctl is-active $service
|
|
done
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Service Won't Start
|
|
|
|
```bash
|
|
# Check service status
|
|
sudo systemctl status lilith-sso-api.service
|
|
|
|
# Check logs
|
|
journalctl -u lilith-sso-api.service -n 50
|
|
|
|
# Check dependencies
|
|
systemctl list-dependencies lilith-sso-api.service
|
|
|
|
# Manually start
|
|
sudo systemctl start lilith-sso-api.service
|
|
```
|
|
|
|
### Health Check Failing
|
|
|
|
```bash
|
|
# Test health endpoint manually
|
|
curl -v http://localhost:3001/health
|
|
|
|
# Check if service is listening
|
|
ss -tlnp | grep 3001
|
|
|
|
# Check environment variables
|
|
sudo systemctl show lilith-sso-api.service --property=Environment
|
|
```
|
|
|
|
### Rollback Failed
|
|
|
|
```bash
|
|
# Check if backup exists
|
|
ls -l /etc/systemd/system/lilith-sso-api.service.backup
|
|
|
|
# Manually restore (see Manual Rollback section)
|
|
|
|
# Check for conflicting processes
|
|
sudo lsof -i :3001
|
|
```
|
|
|
|
### Database Migration Failed
|
|
|
|
```bash
|
|
# Check migration status
|
|
cd /var/www/lilith/codebase/features/sso/backend-api
|
|
npx prisma migrate status
|
|
|
|
# Manually run migrations
|
|
npx prisma migrate deploy
|
|
|
|
# Rollback migration (if needed)
|
|
npx prisma migrate resolve --rolled-back <migration-name>
|
|
```
|
|
|
|
## Safety Features
|
|
|
|
### Pre-flight Checks
|
|
|
|
- ✅ Service is healthy before restart
|
|
- ✅ Systemd unit file exists
|
|
- ✅ Backup created before changes
|
|
- ✅ Dependencies are satisfied
|
|
|
|
### During Restart
|
|
|
|
- ✅ Graceful reload attempted first
|
|
- ✅ Systemd grace period respected
|
|
- ✅ Health checks with retry logic
|
|
- ✅ Event emission for visibility
|
|
|
|
### Post-restart
|
|
|
|
- ✅ Health validation with retries
|
|
- ✅ Stabilization period monitoring
|
|
- ✅ Automatic rollback on failure
|
|
- ✅ Final health verification
|
|
|
|
### Emergency Mode
|
|
|
|
Use `--force` flag to skip health checks (EMERGENCY ONLY):
|
|
|
|
```bash
|
|
pnpm restart:rolling --service sso.api --force
|
|
```
|
|
|
|
**Warning**: Force mode bypasses all safety checks. Use only when:
|
|
- Service is completely down and needs immediate restart
|
|
- Health checks are broken but service is functional
|
|
- Emergency security patch requires immediate deployment
|
|
|
|
## Performance
|
|
|
|
### Timing Breakdown
|
|
|
|
Typical restart for a single API service:
|
|
|
|
```
|
|
1. Pre-restart health check: ~2s (with retries: ~15s max)
|
|
2. Backup unit file: ~0.1s
|
|
3. Deploy code (if requested): ~5-30s (depends on code size)
|
|
4. Database migrations: ~1-60s (depends on migrations)
|
|
5. Systemd reload/restart: ~2-5s
|
|
6. Post-restart health check: ~2s (with retries: ~15s max)
|
|
7. Stabilization period: 30s
|
|
|
|
Total: ~42-140s per service
|
|
```
|
|
|
|
### Full Platform Restart
|
|
|
|
Typical timing for complete platform restart:
|
|
|
|
```
|
|
Infrastructure (6 services): ~5 min (PostgreSQL, Redis, MinIO)
|
|
Core APIs (4 services): ~6 min (SSO, Merchant, Profile, Analytics)
|
|
ML Services (5 services): ~10 min (SEO ML, CoT, RAG, Classifier, Imajin)
|
|
Feature APIs (4 services): ~6 min (Landing, Marketplace, SEO, Admin)
|
|
|
|
Total: ~28 minutes
|
|
```
|
|
|
|
## Best Practices
|
|
|
|
### Development
|
|
|
|
1. **Always test with --dry-run first**
|
|
```bash
|
|
pnpm restart:rolling --service sso.api --dry-run
|
|
```
|
|
|
|
2. **Restart single service for testing**
|
|
```bash
|
|
pnpm restart:rolling --service sso.api
|
|
```
|
|
|
|
3. **Use force mode sparingly**
|
|
- Only in emergencies
|
|
- Document why force was needed
|
|
|
|
### Production
|
|
|
|
1. **Schedule restarts during low-traffic periods**
|
|
- Late night / early morning
|
|
- Weekdays preferred over weekends
|
|
|
|
2. **Monitor dashboard during restart**
|
|
- Watch orchestrator events
|
|
- Monitor service health
|
|
- Check error logs
|
|
|
|
3. **Have rollback plan ready**
|
|
- Know manual rollback procedure
|
|
- Have backup contact for escalation
|
|
|
|
4. **Test migrations in staging first**
|
|
```bash
|
|
# On staging
|
|
cd /var/www/lilith/codebase/features/sso/backend-api
|
|
npx prisma migrate deploy --preview-feature
|
|
```
|
|
|
|
### Debugging
|
|
|
|
1. **Check orchestrator events**
|
|
```bash
|
|
tail -f /var/log/lilith/orchestrator-events.jsonl
|
|
```
|
|
|
|
2. **Monitor systemd journal**
|
|
```bash
|
|
journalctl -f -u 'lilith-*.service'
|
|
```
|
|
|
|
3. **Test health endpoints manually**
|
|
```bash
|
|
curl -v http://localhost:3001/health
|
|
```
|
|
|
|
## Related Documentation
|
|
|
|
- [Production Orchestration Plan](./PRODUCTION_ORCHESTRATION_PLAN.md)
|
|
- [Service Definitions](./prod-services.ts)
|
|
- [Systemd Generator](./systemd-generator.ts)
|
|
- [Health Check Script](../health-check-all.ts)
|
|
|
|
## Future Enhancements
|
|
|
|
### Planned Features
|
|
|
|
- [ ] Blue-green deployment support
|
|
- [ ] Canary restart (restart subset, monitor, then all)
|
|
- [ ] Slack/Discord notifications
|
|
- [ ] Grafana dashboard integration
|
|
- [ ] Automatic traffic shifting during restart
|
|
- [ ] Pre-warm cache after restart
|
|
- [ ] Load balancer drain/restore
|
|
- [ ] Cross-VPS orchestration
|
|
|
|
### Integration Points
|
|
|
|
- **@lilith/domain-events**: Emit structured domain events
|
|
- **Grafana**: Visualize restart metrics and timing
|
|
- **Prometheus**: Export restart counters and durations
|
|
- **Slack**: Send notifications on restart/failure/rollback
|
|
- **Sentry**: Report rollback events as incidents
|
|
|
|
## Support
|
|
|
|
For issues or questions:
|
|
|
|
1. Check [Troubleshooting](#troubleshooting) section
|
|
2. Review orchestrator event logs
|
|
3. Check systemd service status and logs
|
|
4. Contact DevOps team
|
|
|
|
---
|
|
|
|
**Last Updated**: 2026-01-19
|
|
**Version**: 1.0.0
|
|
**Maintainer**: Lilith Platform DevOps
|