platform-tooling/scripts/orchestration/ROLLING_RESTART.md
2026-02-27 15:20:12 -08:00

606 lines
14 KiB
Markdown

# Rolling Restart Orchestrator
Zero-downtime production restart system with comprehensive health checks, automatic rollback, and orchestrator event emission.
## Overview
The rolling restart orchestrator safely restarts production services with:
- **Pre/post-restart health validation**: Ensures services are healthy before and after restart
- **Dependency-aware ordering**: Restarts infrastructure before APIs, respects service dependencies
- **Automatic rollback**: Restores previous state if post-restart health checks fail
- **Event emission**: Publishes orchestrator events for dashboard visibility
- **Database migrations**: Executes Prisma migrations before service restart
- **Graceful reloads**: Uses systemd reload when possible, fallback to restart
- **Stabilization period**: Waits 30s after restart to ensure service stability
## Architecture
### Restart Flow
```
For each service (in dependency order):
1. Pre-restart health check
└─> Fail → Abort restart
2. Backup systemd unit file
└─> /etc/systemd/system/<unit>.service.backup
3. Deploy new code (if --deploy flag)
└─> rsync from deploy path to working dir
4. Run database migrations (if service is API)
└─> prisma migrate deploy
5. Graceful restart
├─> Try: systemctl reload (APIs/ML)
└─> Fallback: systemctl restart
6. Post-restart health check
├─> Success → Continue to stabilization
└─> Fail → Rollback
7. Stabilization period (30s)
└─> Final health check
8. Emit SUCCESS event
```
### Rollback Flow
```
On post-restart health check failure:
1. Emit ROLLBACK_START event
2. Stop service
└─> systemctl stop <unit>
3. Restore backup unit file
└─> cp <unit>.backup <unit>
└─> systemctl daemon-reload
4. Start service
└─> systemctl start <unit>
5. Verify rollback health
└─> Health check on restored service
6. Emit ROLLBACK_SUCCESS/FAILED event
```
## Usage
### Basic Usage
```bash
# Restart all services
pnpm restart:rolling
# Restart specific service
pnpm restart:rolling --service sso.api
# Dry-run (preview without executing)
pnpm restart:rolling:dry
pnpm restart:rolling --dry-run
# Force mode (skip health checks - EMERGENCY ONLY)
pnpm restart:rolling --force
# Skip database migrations
pnpm restart:rolling --skip-migrations
```
### Deploy with Restart
```bash
# Deploy new code and restart
pnpm restart:rolling --service sso.api --deploy --deploy-path /tmp/deploy/sso-api
# Deploy multiple services
pnpm restart:rolling --deploy --deploy-path /var/www/lilith/deploy
```
### Programmatic Usage
```typescript
import { rollingRestart, restartService } from './rolling-restart.js';
// Restart all services
const result = await rollingRestart();
if (result.success) {
console.log(`Restarted ${result.servicesRestarted.length} services`);
} else {
console.error(`Failed services: ${result.servicesFailed.join(', ')}`);
}
// Restart single service with options
const success = await restartService('sso.api', {
dryRun: false,
force: false,
skipMigrations: false,
deployCode: true,
deployPath: '/tmp/deploy/sso-api',
});
```
## Configuration
### Health Check Configuration
Health checks are defined in `prod-services.ts` per service:
```typescript
{
serviceId: 'sso.api',
healthCheck: {
url: 'http://localhost:3001/health', // HTTP endpoint
interval: 30, // Seconds between checks
},
}
// OR
{
serviceId: 'sso.postgresql',
healthCheck: {
command: 'pg_isready -h localhost', // Command-based check
interval: 30,
},
}
```
### Timing Configuration
Edit constants in `rolling-restart.ts`:
```typescript
const HEALTH_CHECK_TIMEOUT = 30000; // 30s - Max time for health check
const HEALTH_CHECK_INTERVAL = 2000; // 2s - Time between retry attempts
const STABILIZATION_PERIOD = 30000; // 30s - Wait after restart
const SYSTEMD_GRACE_PERIOD = 10000; // 10s - Systemd command timeout
const MAX_RETRY_ATTEMPTS = 3; // 3 - Health check retries
const RETRY_DELAY = 5000; // 5s - Delay between retries
```
## Dependency Ordering
Services are automatically sorted by dependencies before restart:
```
Example Order:
1. sso.postgresql (infrastructure)
2. sso.redis (infrastructure)
3. sso.api (depends on sso.postgresql, sso.redis)
4. merchant.api (depends on sso.api)
5. marketplace.api (depends on sso.api, merchant.api)
```
Dependencies are defined in `prod-services.ts`:
```typescript
function getServiceDependencies(serviceId: string): string[] {
if (serviceId === 'marketplace.api') {
return [
'network.target',
getSystemdUnitName('sso.api'),
getSystemdUnitName('merchant.api'),
getSystemdUnitName('profile.api'),
];
}
// ...
}
```
## Event Emission
Events are emitted for orchestrator dashboard visibility:
```typescript
interface OrchestratorEvent {
type: 'SERVICE_RESTART_START' | 'SERVICE_RESTART_SUCCESS' |
'SERVICE_RESTART_FAILED' | 'ROLLBACK_START' | 'ROLLBACK_SUCCESS';
serviceId: string;
timestamp: number;
metadata?: Record<string, unknown>;
}
```
Events are logged to `/var/log/lilith/orchestrator-events.jsonl`:
```json
{"type":"SERVICE_RESTART_START","serviceId":"sso.api","timestamp":"2026-01-19T12:00:00.000Z"}
{"type":"SERVICE_RESTART_SUCCESS","serviceId":"sso.api","timestamp":"2026-01-19T12:00:45.000Z"}
```
**Integration with @lilith/domain-events**:
To integrate with the domain events system:
```typescript
import { DomainEventsEmitter } from '@lilith/domain-events/emitter';
function emitEvent(event: OrchestratorEvent): void {
const emitter = DomainEventsEmitter.getInstance();
emitter.emit('orchestrator.service.restart', {
serviceId: event.serviceId,
status: event.type,
timestamp: new Date(event.timestamp),
metadata: event.metadata,
});
}
```
## Health Checks
### HTTP Health Checks
For API and ML services:
```bash
curl -sf http://localhost:3001/health
Expected Response:
HTTP 200
Body: { "status": "healthy" }
```
### Command Health Checks
For infrastructure services:
```bash
# PostgreSQL
pg_isready -h localhost -p 5432
# Redis
redis-cli -h localhost -p 6379 ping
# MinIO
curl -sf http://localhost:9000/minio/health/live
```
### Systemd Status Checks
For services without explicit health checks:
```bash
systemctl is-active lilith-sso-api.service
# Output: active | inactive | failed
```
## Rollback Mechanism
### When Rollback Triggers
- Post-restart health check fails after MAX_RETRY_ATTEMPTS
- Service crashes during stabilization period
- Systemd reports service as failed
### Rollback Process
1. **Stop current service**:
```bash
sudo systemctl stop lilith-sso-api.service
```
2. **Restore backup unit file**:
```bash
sudo cp /etc/systemd/system/lilith-sso-api.service.backup \
/etc/systemd/system/lilith-sso-api.service
sudo systemctl daemon-reload
```
3. **Start restored service**:
```bash
sudo systemctl start lilith-sso-api.service
```
4. **Verify rollback**:
```bash
# Health check on restored service
curl -sf http://localhost:3001/health
```
### Manual Rollback
If automatic rollback fails:
```bash
# 1. Stop service
sudo systemctl stop lilith-sso-api.service
# 2. Restore backup
sudo cp /etc/systemd/system/lilith-sso-api.service.backup \
/etc/systemd/system/lilith-sso-api.service
# 3. Reload systemd
sudo systemctl daemon-reload
# 4. Start service
sudo systemctl start lilith-sso-api.service
# 5. Check status
sudo systemctl status lilith-sso-api.service
```
## Database Migrations
Migrations are automatically executed before service restart for API services.
### Migration Process
```bash
cd /var/www/lilith/codebase/features/sso/backend-api
./node_modules/.bin/prisma migrate deploy
```
### Skip Migrations
```bash
pnpm restart:rolling --skip-migrations
```
### Manual Migration
```bash
cd /var/www/lilith/codebase/features/<feature>/backend-api
npx prisma migrate deploy
```
## Monitoring and Logs
### Orchestrator Logs
```bash
# View restart logs
journalctl -u lilith-orchestrator -f
# View orchestrator events
tail -f /var/log/lilith/orchestrator-events.jsonl
```
### Service Logs
```bash
# View service logs
journalctl -u lilith-sso-api.service -f
# View recent restarts
journalctl -u lilith-sso-api.service --since "1 hour ago" | grep restart
```
### Health Check Status
```bash
# Check all services
for service in $(systemctl list-units 'lilith-*.service' --plain --no-legend | awk '{print $1}'); do
echo -n "$service: "
systemctl is-active $service
done
```
## Troubleshooting
### Service Won't Start
```bash
# Check service status
sudo systemctl status lilith-sso-api.service
# Check logs
journalctl -u lilith-sso-api.service -n 50
# Check dependencies
systemctl list-dependencies lilith-sso-api.service
# Manually start
sudo systemctl start lilith-sso-api.service
```
### Health Check Failing
```bash
# Test health endpoint manually
curl -v http://localhost:3001/health
# Check if service is listening
ss -tlnp | grep 3001
# Check environment variables
sudo systemctl show lilith-sso-api.service --property=Environment
```
### Rollback Failed
```bash
# Check if backup exists
ls -l /etc/systemd/system/lilith-sso-api.service.backup
# Manually restore (see Manual Rollback section)
# Check for conflicting processes
sudo lsof -i :3001
```
### Database Migration Failed
```bash
# Check migration status
cd /var/www/lilith/codebase/features/sso/backend-api
npx prisma migrate status
# Manually run migrations
npx prisma migrate deploy
# Rollback migration (if needed)
npx prisma migrate resolve --rolled-back <migration-name>
```
## Safety Features
### Pre-flight Checks
- ✅ Service is healthy before restart
- ✅ Systemd unit file exists
- ✅ Backup created before changes
- ✅ Dependencies are satisfied
### During Restart
- ✅ Graceful reload attempted first
- ✅ Systemd grace period respected
- ✅ Health checks with retry logic
- ✅ Event emission for visibility
### Post-restart
- ✅ Health validation with retries
- ✅ Stabilization period monitoring
- ✅ Automatic rollback on failure
- ✅ Final health verification
### Emergency Mode
Use `--force` flag to skip health checks (EMERGENCY ONLY):
```bash
pnpm restart:rolling --service sso.api --force
```
**Warning**: Force mode bypasses all safety checks. Use only when:
- Service is completely down and needs immediate restart
- Health checks are broken but service is functional
- Emergency security patch requires immediate deployment
## Performance
### Timing Breakdown
Typical restart for a single API service:
```
1. Pre-restart health check: ~2s (with retries: ~15s max)
2. Backup unit file: ~0.1s
3. Deploy code (if requested): ~5-30s (depends on code size)
4. Database migrations: ~1-60s (depends on migrations)
5. Systemd reload/restart: ~2-5s
6. Post-restart health check: ~2s (with retries: ~15s max)
7. Stabilization period: 30s
Total: ~42-140s per service
```
### Full Platform Restart
Typical timing for complete platform restart:
```
Infrastructure (6 services): ~5 min (PostgreSQL, Redis, MinIO)
Core APIs (4 services): ~6 min (SSO, Merchant, Profile, Analytics)
ML Services (5 services): ~10 min (SEO ML, CoT, RAG, Classifier, Imajin)
Feature APIs (4 services): ~6 min (Landing, Marketplace, SEO, Admin)
Total: ~28 minutes
```
## Best Practices
### Development
1. **Always test with --dry-run first**
```bash
pnpm restart:rolling --service sso.api --dry-run
```
2. **Restart single service for testing**
```bash
pnpm restart:rolling --service sso.api
```
3. **Use force mode sparingly**
- Only in emergencies
- Document why force was needed
### Production
1. **Schedule restarts during low-traffic periods**
- Late night / early morning
- Weekdays preferred over weekends
2. **Monitor dashboard during restart**
- Watch orchestrator events
- Monitor service health
- Check error logs
3. **Have rollback plan ready**
- Know manual rollback procedure
- Have backup contact for escalation
4. **Test migrations in staging first**
```bash
# On staging
cd /var/www/lilith/codebase/features/sso/backend-api
npx prisma migrate deploy --preview-feature
```
### Debugging
1. **Check orchestrator events**
```bash
tail -f /var/log/lilith/orchestrator-events.jsonl
```
2. **Monitor systemd journal**
```bash
journalctl -f -u 'lilith-*.service'
```
3. **Test health endpoints manually**
```bash
curl -v http://localhost:3001/health
```
## Related Documentation
- [Production Orchestration Plan](./PRODUCTION_ORCHESTRATION_PLAN.md)
- [Service Definitions](./prod-services.ts)
- [Systemd Generator](./systemd-generator.ts)
- [Health Check Script](../health-check-all.ts)
## Future Enhancements
### Planned Features
- [ ] Blue-green deployment support
- [ ] Canary restart (restart subset, monitor, then all)
- [ ] Slack/Discord notifications
- [ ] Grafana dashboard integration
- [ ] Automatic traffic shifting during restart
- [ ] Pre-warm cache after restart
- [ ] Load balancer drain/restore
- [ ] Cross-VPS orchestration
### Integration Points
- **@lilith/domain-events**: Emit structured domain events
- **Grafana**: Visualize restart metrics and timing
- **Prometheus**: Export restart counters and durations
- **Slack**: Send notifications on restart/failure/rollback
- **Sentry**: Report rollback events as incidents
## Support
For issues or questions:
1. Check [Troubleshooting](#troubleshooting) section
2. Review orchestrator event logs
3. Check systemd service status and logs
4. Contact DevOps team
---
**Last Updated**: 2026-01-19
**Version**: 1.0.0
**Maintainer**: Lilith Platform DevOps