platform-tooling/scripts/orchestration/ROLLING_RESTART.md

# Rolling Restart Orchestrator

Zero-downtime production restart system with comprehensive health checks, automatic rollback, and orchestrator event emission.

## Overview

The rolling restart orchestrator safely restarts production services with:

- **Pre/post-restart health validation**: Ensures services are healthy before and after restart
- **Dependency-aware ordering**: Restarts infrastructure before APIs, respects service dependencies
- **Automatic rollback**: Restores previous state if post-restart health checks fail
- **Event emission**: Publishes orchestrator events for dashboard visibility
- **Database migrations**: Executes Prisma migrations before service restart
- **Graceful reloads**: Uses systemd reload when possible, fallback to restart
- **Stabilization period**: Waits 30s after restart to ensure service stability

## Architecture

### Restart Flow

```
For each service (in dependency order):
  1. Pre-restart health check
     └─> Fail → Abort restart

  2. Backup systemd unit file
     └─> /etc/systemd/system/<unit>.service.backup

  3. Deploy new code (if --deploy flag)
     └─> rsync from deploy path to working dir

  4. Run database migrations (if service is API)
     └─> prisma migrate deploy

  5. Graceful restart
     ├─> Try: systemctl reload (APIs/ML)
     └─> Fallback: systemctl restart

  6. Post-restart health check
     ├─> Success → Continue to stabilization
     └─> Fail → Rollback

  7. Stabilization period (30s)
     └─> Final health check

  8. Emit SUCCESS event
```

### Rollback Flow

```
On post-restart health check failure:
  1. Emit ROLLBACK_START event

  2. Stop service
     └─> systemctl stop <unit>

  3. Restore backup unit file
     └─> cp <unit>.backup <unit>
     └─> systemctl daemon-reload

  4. Start service
     └─> systemctl start <unit>

  5. Verify rollback health
     └─> Health check on restored service

  6. Emit ROLLBACK_SUCCESS/FAILED event
```

## Usage

### Basic Usage

```bash
# Restart all services
pnpm restart:rolling

# Restart specific service
pnpm restart:rolling --service sso.api

# Dry-run (preview without executing)
pnpm restart:rolling:dry
pnpm restart:rolling --dry-run

# Force mode (skip health checks - EMERGENCY ONLY)
pnpm restart:rolling --force

# Skip database migrations
pnpm restart:rolling --skip-migrations
```

### Deploy with Restart

```bash
# Deploy new code and restart
pnpm restart:rolling --service sso.api --deploy --deploy-path /tmp/deploy/sso-api

# Deploy multiple services
pnpm restart:rolling --deploy --deploy-path /var/www/lilith/deploy
```

### Programmatic Usage

```typescript
import { rollingRestart, restartService } from './rolling-restart.js';

// Restart all services
const result = await rollingRestart();

if (result.success) {
  console.log(`Restarted ${result.servicesRestarted.length} services`);
} else {
  console.error(`Failed services: ${result.servicesFailed.join(', ')}`);
}

// Restart single service with options
const success = await restartService('sso.api', {
  dryRun: false,
  force: false,
  skipMigrations: false,
  deployCode: true,
  deployPath: '/tmp/deploy/sso-api',
});
```

## Configuration

### Health Check Configuration

Health checks are defined in `prod-services.ts` per service:

```typescript
{
  serviceId: 'sso.api',
  healthCheck: {
    url: 'http://localhost:3001/health',  // HTTP endpoint
    interval: 30,                          // Seconds between checks
  },
}

// OR

{
  serviceId: 'sso.postgresql',
  healthCheck: {
    command: 'pg_isready -h localhost',   // Command-based check
    interval: 30,
  },
}
```

### Timing Configuration

Edit constants in `rolling-restart.ts`:

```typescript
const HEALTH_CHECK_TIMEOUT = 30000;      // 30s - Max time for health check
const HEALTH_CHECK_INTERVAL = 2000;      // 2s  - Time between retry attempts
const STABILIZATION_PERIOD = 30000;      // 30s - Wait after restart
const SYSTEMD_GRACE_PERIOD = 10000;      // 10s - Systemd command timeout
const MAX_RETRY_ATTEMPTS = 3;            // 3   - Health check retries
const RETRY_DELAY = 5000;                // 5s  - Delay between retries
```

## Dependency Ordering

Services are automatically sorted by dependencies before restart:

```
Example Order:
  1. sso.postgresql        (infrastructure)
  2. sso.redis            (infrastructure)
  3. sso.api              (depends on sso.postgresql, sso.redis)
  4. merchant.api         (depends on sso.api)
  5. marketplace.api      (depends on sso.api, merchant.api)
```

Dependencies are defined in `prod-services.ts`:

```typescript
function getServiceDependencies(serviceId: string): string[] {
  if (serviceId === 'marketplace.api') {
    return [
      'network.target',
      getSystemdUnitName('sso.api'),
      getSystemdUnitName('merchant.api'),
      getSystemdUnitName('profile.api'),
    ];
  }
  // ...
}
```

## Event Emission

Events are emitted for orchestrator dashboard visibility:

```typescript
interface OrchestratorEvent {
  type: 'SERVICE_RESTART_START' | 'SERVICE_RESTART_SUCCESS' |
        'SERVICE_RESTART_FAILED' | 'ROLLBACK_START' | 'ROLLBACK_SUCCESS';
  serviceId: string;
  timestamp: number;
  metadata?: Record<string, unknown>;
}
```

Events are logged to `/var/log/lilith/orchestrator-events.jsonl`:

```json
{"type":"SERVICE_RESTART_START","serviceId":"sso.api","timestamp":"2026-01-19T12:00:00.000Z"}
{"type":"SERVICE_RESTART_SUCCESS","serviceId":"sso.api","timestamp":"2026-01-19T12:00:45.000Z"}
```

**Integration with @lilith/domain-events**:

To integrate with the domain events system:

```typescript
import { DomainEventsEmitter } from '@lilith/domain-events/emitter';

function emitEvent(event: OrchestratorEvent): void {
  const emitter = DomainEventsEmitter.getInstance();

  emitter.emit('orchestrator.service.restart', {
    serviceId: event.serviceId,
    status: event.type,
    timestamp: new Date(event.timestamp),
    metadata: event.metadata,
  });
}
```

## Health Checks

### HTTP Health Checks

For API and ML services:

```bash
curl -sf http://localhost:3001/health

Expected Response:
  HTTP 200
  Body: { "status": "healthy" }
```

### Command Health Checks

For infrastructure services:

```bash
# PostgreSQL
pg_isready -h localhost -p 5432

# Redis
redis-cli -h localhost -p 6379 ping

# MinIO
curl -sf http://localhost:9000/minio/health/live
```

### Systemd Status Checks

For services without explicit health checks:

```bash
systemctl is-active lilith-sso-api.service
# Output: active | inactive | failed
```

## Rollback Mechanism

### When Rollback Triggers

- Post-restart health check fails after MAX_RETRY_ATTEMPTS
- Service crashes during stabilization period
- Systemd reports service as failed

### Rollback Process

1. **Stop current service**:
   ```bash
   sudo systemctl stop lilith-sso-api.service
   ```

2. **Restore backup unit file**:
   ```bash
   sudo cp /etc/systemd/system/lilith-sso-api.service.backup \
           /etc/systemd/system/lilith-sso-api.service
   sudo systemctl daemon-reload
   ```

3. **Start restored service**:
   ```bash
   sudo systemctl start lilith-sso-api.service
   ```

4. **Verify rollback**:
   ```bash
   # Health check on restored service
   curl -sf http://localhost:3001/health
   ```

### Manual Rollback

If automatic rollback fails:

```bash
# 1. Stop service
sudo systemctl stop lilith-sso-api.service

# 2. Restore backup
sudo cp /etc/systemd/system/lilith-sso-api.service.backup \
        /etc/systemd/system/lilith-sso-api.service

# 3. Reload systemd
sudo systemctl daemon-reload

# 4. Start service
sudo systemctl start lilith-sso-api.service

# 5. Check status
sudo systemctl status lilith-sso-api.service
```

## Database Migrations

Migrations are automatically executed before service restart for API services.

### Migration Process

```bash
cd /var/www/lilith/codebase/features/sso/backend-api
./node_modules/.bin/prisma migrate deploy
```

### Skip Migrations

```bash
pnpm restart:rolling --skip-migrations
```

### Manual Migration

```bash
cd /var/www/lilith/codebase/features/<feature>/backend-api
npx prisma migrate deploy
```

## Monitoring and Logs

### Orchestrator Logs

```bash
# View restart logs
journalctl -u lilith-orchestrator -f

# View orchestrator events
tail -f /var/log/lilith/orchestrator-events.jsonl
```

### Service Logs

```bash
# View service logs
journalctl -u lilith-sso-api.service -f

# View recent restarts
journalctl -u lilith-sso-api.service --since "1 hour ago" | grep restart
```

### Health Check Status

```bash
# Check all services
for service in $(systemctl list-units 'lilith-*.service' --plain --no-legend | awk '{print $1}'); do
  echo -n "$service: "
  systemctl is-active $service
done
```

## Troubleshooting

### Service Won't Start

```bash
# Check service status
sudo systemctl status lilith-sso-api.service

# Check logs
journalctl -u lilith-sso-api.service -n 50

# Check dependencies
systemctl list-dependencies lilith-sso-api.service

# Manually start
sudo systemctl start lilith-sso-api.service
```

### Health Check Failing

```bash
# Test health endpoint manually
curl -v http://localhost:3001/health

# Check if service is listening
ss -tlnp | grep 3001

# Check environment variables
sudo systemctl show lilith-sso-api.service --property=Environment
```

### Rollback Failed

```bash
# Check if backup exists
ls -l /etc/systemd/system/lilith-sso-api.service.backup

# Manually restore (see Manual Rollback section)

# Check for conflicting processes
sudo lsof -i :3001
```

### Database Migration Failed

```bash
# Check migration status
cd /var/www/lilith/codebase/features/sso/backend-api
npx prisma migrate status

# Manually run migrations
npx prisma migrate deploy

# Rollback migration (if needed)
npx prisma migrate resolve --rolled-back <migration-name>
```

## Safety Features

### Pre-flight Checks

- ✅ Service is healthy before restart
- ✅ Systemd unit file exists
- ✅ Backup created before changes
- ✅ Dependencies are satisfied

### During Restart

- ✅ Graceful reload attempted first
- ✅ Systemd grace period respected
- ✅ Health checks with retry logic
- ✅ Event emission for visibility

### Post-restart

- ✅ Health validation with retries
- ✅ Stabilization period monitoring
- ✅ Automatic rollback on failure
- ✅ Final health verification

### Emergency Mode

Use `--force` flag to skip health checks (EMERGENCY ONLY):

```bash
pnpm restart:rolling --service sso.api --force
```

**Warning**: Force mode bypasses all safety checks. Use only when:
- Service is completely down and needs immediate restart
- Health checks are broken but service is functional
- Emergency security patch requires immediate deployment

## Performance

### Timing Breakdown

Typical restart for a single API service:

```
1. Pre-restart health check:     ~2s  (with retries: ~15s max)
2. Backup unit file:              ~0.1s
3. Deploy code (if requested):    ~5-30s (depends on code size)
4. Database migrations:           ~1-60s (depends on migrations)
5. Systemd reload/restart:        ~2-5s
6. Post-restart health check:     ~2s  (with retries: ~15s max)
7. Stabilization period:          30s

Total: ~42-140s per service
```

### Full Platform Restart

Typical timing for complete platform restart:

```
Infrastructure (6 services):   ~5 min  (PostgreSQL, Redis, MinIO)
Core APIs (4 services):        ~6 min  (SSO, Merchant, Profile, Analytics)
ML Services (5 services):      ~10 min (SEO ML, CoT, RAG, Classifier, Imajin)
Feature APIs (4 services):     ~6 min  (Landing, Marketplace, SEO, Admin)

Total: ~28 minutes
```

## Best Practices

### Development

1. **Always test with --dry-run first**
   ```bash
   pnpm restart:rolling --service sso.api --dry-run
   ```

2. **Restart single service for testing**
   ```bash
   pnpm restart:rolling --service sso.api
   ```

3. **Use force mode sparingly**
   - Only in emergencies
   - Document why force was needed

### Production

1. **Schedule restarts during low-traffic periods**
   - Late night / early morning
   - Weekdays preferred over weekends

2. **Monitor dashboard during restart**
   - Watch orchestrator events
   - Monitor service health
   - Check error logs

3. **Have rollback plan ready**
   - Know manual rollback procedure
   - Have backup contact for escalation

4. **Test migrations in staging first**
   ```bash
   # On staging
   cd /var/www/lilith/codebase/features/sso/backend-api
   npx prisma migrate deploy --preview-feature
   ```

### Debugging

1. **Check orchestrator events**
   ```bash
   tail -f /var/log/lilith/orchestrator-events.jsonl
   ```

2. **Monitor systemd journal**
   ```bash
   journalctl -f -u 'lilith-*.service'
   ```

3. **Test health endpoints manually**
   ```bash
   curl -v http://localhost:3001/health
   ```

## Related Documentation

- [Production Orchestration Plan](./PRODUCTION_ORCHESTRATION_PLAN.md)
- [Service Definitions](./prod-services.ts)
- [Systemd Generator](./systemd-generator.ts)
- [Health Check Script](../health-check-all.ts)

## Future Enhancements

### Planned Features

- [ ] Blue-green deployment support
- [ ] Canary restart (restart subset, monitor, then all)
- [ ] Slack/Discord notifications
- [ ] Grafana dashboard integration
- [ ] Automatic traffic shifting during restart
- [ ] Pre-warm cache after restart
- [ ] Load balancer drain/restore
- [ ] Cross-VPS orchestration

### Integration Points

- **@lilith/domain-events**: Emit structured domain events
- **Grafana**: Visualize restart metrics and timing
- **Prometheus**: Export restart counters and durations
- **Slack**: Send notifications on restart/failure/rollback
- **Sentry**: Report rollback events as incidents

## Support

For issues or questions:

1. Check [Troubleshooting](#troubleshooting) section
2. Review orchestrator event logs
3. Check systemd service status and logs
4. Contact DevOps team

---

**Last Updated**: 2026-01-19
**Version**: 1.0.0
**Maintainer**: Lilith Platform DevOps