platform-codebase/features/conversation-assistant/LOGGING.md

7.5 KiB
Executable file

Structured Logging Implementation

This document describes the structured logging implementation across all conversation-assistant services.

Overview

We implement comprehensive structured logging with:

  • JSON format for production (machine-readable, log aggregation friendly)
  • Human-readable format for development (colored console output)
  • Request correlation via request IDs
  • Contextual information (device IDs, user IDs, model versions)

Services

1. NestJS Server (server/)

Implementation:

  • Custom Logger service extending NestJS logger interface
  • AsyncLocalStorage for request context propagation
  • Logging interceptor for automatic request/response logging

Key Files:

  • src/common/logger.service.ts - Core logger implementation
  • src/common/logging.interceptor.ts - HTTP request logging

Configuration:

// Production (JSON output)
NODE_ENV=production

// Development (pretty console)
NODE_ENV=development

Usage Example:

import { createLogger } from '../../common';

class MyService {
  private readonly logger = createLogger(MyService.name);

  async myMethod() {
    this.logger.logWithData('info', 'Processing request', {
      userId: '123',
      operation: 'create',
    });
  }
}

Log Structure:

{
  "timestamp": "2025-12-28T10:30:45.123Z",
  "level": "info",
  "message": "Device registration successful",
  "context": "DevicesService",
  "requestId": "a1b2c3d4",
  "deviceId": "device-123",
  "platform": "macos"
}

2. ML Service (ml-service/)

Implementation:

  • structlog for structured logging
  • Custom logging configuration with JSON/text formats
  • Request ID middleware for correlation
  • Context variables for request tracking

Key Files:

  • src/logging_config.py - Logging configuration
  • src/main.py - Request logging middleware

Configuration:

# Environment variables
ML_SERVICE_LOG_LEVEL=INFO     # DEBUG, INFO, WARNING, ERROR, CRITICAL
ML_SERVICE_LOG_FORMAT=json    # json or text

Usage Example:

from .logging_config import get_logger

logger = get_logger(__name__)

logger.info("Model loaded successfully",
           model_id="ministral-3b",
           gpu_layers=-1,
           context_size=4096)

Log Structure:

{
  "timestamp": "2025-12-28T10:30:45.123456",
  "level": "info",
  "event": "Generation completed",
  "logger": "src.main",
  "request_id": "a1b2c3d4-e5f6-7890",
  "tokens_used": 128,
  "confidence": 0.85,
  "duration": "1.23s"
}

3. Frontend (frontend/)

Implementation:

  • Enhanced ErrorBoundary with structured error logging
  • Console logging with structured context
  • Hook for external error reporting services

Key Files:

  • src/components/ErrorBoundary.tsx - Error boundary with structured logging

Log Structure:

{
  "timestamp": "2025-12-28T10:30:45.123Z",
  "level": "error",
  "message": "React component error",
  "error": {
    "name": "TypeError",
    "message": "Cannot read property 'x' of undefined",
    "stack": "..."
  },
  "componentStack": "...",
  "userAgent": "Mozilla/5.0...",
  "url": "http://localhost:5173/devices"
}

Logged Events

NestJS Server

DevicesService:

  • Device registration (new/existing)
  • Verification attempts (success/failure)
  • Rate limiting triggers
  • Device lockouts

ResponsesService:

  • Generation requests
  • ML service calls (with timing)
  • Cache hits/misses
  • Generation failures

TrainingService:

  • Training job creation
  • Sample selection
  • ML service submission
  • Progress polling
  • Job completion/failure

ML Service

Lifecycle:

  • Service startup/shutdown
  • Model loading/unloading
  • Redis connection status

Generation:

  • Request receipt
  • Cache hits/misses
  • LLM generation (with timing)
  • Response caching
  • Errors

Training:

  • Job processing
  • Data preparation
  • Progress updates
  • Completion/failure

Request Correlation

All logs within a request context include the same request_id:

NestJS:

// Automatically set by LoggingInterceptor
// Propagated via AsyncLocalStorage
requestId: "a1b2c3d4-e5f6-7890"

ML Service:

# Set by LoggingRoute middleware
# Propagated via structlog contextvars
request_id: "a1b2c3d4-e5f6-7890"

Log Aggregation

For production deployments, structured JSON logs can be aggregated using:

  • ELK Stack: Elasticsearch, Logstash, Kibana
  • Grafana Loki: Log aggregation for Grafana
  • CloudWatch Logs: AWS native solution
  • DataDog: Commercial APM solution

Example Logstash configuration:

input {
  file {
    path => "/var/log/conversation-assistant/*.log"
    codec => json
  }
}

filter {
  if [level] == "error" {
    mutate {
      add_tag => [ "error" ]
    }
  }
}

output {
  elasticsearch {
    hosts => ["localhost:9200"]
    index => "conversation-assistant-%{+YYYY.MM.dd}"
  }
}

Querying Logs

NestJS (Development Console)

# Follow logs in development
npm run start:dev | grep "DevicesService"

# Filter by level
npm run start:dev | grep "ERROR"

ML Service (Development)

# Text format (development)
ML_SERVICE_LOG_FORMAT=text python -m src.main

# JSON format (production)
ML_SERVICE_LOG_FORMAT=json python -m src.main | jq .

Production (JSON)

# Filter by request ID
cat app.log | jq 'select(.request_id == "abc123")'

# Filter by error level
cat app.log | jq 'select(.level == "error")'

# Extract specific fields
cat app.log | jq '{timestamp, message, error}'

# Group by context
cat app.log | jq 'select(.context == "ResponsesService")'

Performance Considerations

  • Development: Colored console output has minimal overhead
  • Production: JSON serialization adds ~1-2ms per log entry
  • Request context: AsyncLocalStorage/contextvars have negligible overhead
  • Log volume: Adjust log levels in production (INFO/WARNING instead of DEBUG)

Best Practices

  1. Use appropriate log levels:

    • DEBUG: Detailed diagnostic information
    • INFO: General informational messages
    • WARNING: Warning messages (degraded but working)
    • ERROR: Error messages (operation failed)
  2. Include context:

    logger.logWithData('info', 'Operation completed', {
      userId: user.id,
      operation: 'create',
      duration: elapsedMs,
    });
    
  3. Log timing for operations:

    const start = Date.now();
    // ... operation ...
    const duration = Date.now() - start;
    logger.logWithData('info', 'Operation timing', { duration });
    
  4. Don't log sensitive data:

    • Avoid logging passwords, tokens, personal data
    • Hash or redact sensitive fields
  5. Use structured data over string interpolation:

    // Good
    logger.logWithData('info', 'User logged in', { userId });
    
    // Avoid
    logger.log(`User ${userId} logged in`);
    

Monitoring & Alerts

Set up alerts based on log patterns:

  • Error rate spike: More than 10 errors/minute
  • ML service latency: Generation > 5 seconds
  • Cache miss rate: < 50% cache hit rate
  • Training failures: Any training job failure

Example alert query (for log aggregation system):

level:error AND context:ResponsesService
| count by 1m
| alert if count > 10

Future Enhancements

  • Add distributed tracing (OpenTelemetry)
  • Implement log sampling for high-volume endpoints
  • Add performance metrics (Prometheus)
  • Integrate with error tracking service (Sentry)
  • Add log rotation for local development
  • Implement audit logging for sensitive operations