nvidia-oc/TODOs.md
2026-01-14 12:30:45 -08:00

20 KiB

NVIDIA OC Development TODOs

Project: @infrastructure/nvidia-oc - NVIDIA GPU Overclocking Control Panel Status: In Development (Phase 2) Start Date: 2026-01-14 Target Release: v0.1.0 (2-3 weeks)


Phase 1: Package Scaffolding COMPLETED

1.1 Directory Structure

  • Create package directory: @infrastructure/nvidia-oc/
  • Create backend Python structure: backend/nvidia_oc/{core,api,cli,config,utils}/
  • Create frontend React structure: frontend/src/{components,hooks,api,types}/
  • Create configs directory: configs/
  • Create systemd directory: systemd/
  • Create tests directory: backend/tests/

1.2 Package Metadata

  • Generate pyproject.toml with all Python dependencies
  • Generate package.json with @ui workspace dependencies
  • Create .gitignore for Python + Node
  • Create README.md with user documentation
  • Create ARCHITECTURE.md with technical design
  • Create docs/PROJECT_OVERVIEW.md
  • Create docs/IMPLEMENTATION_GUIDE.md

1.3 Python Module Init Files

  • backend/nvidia_oc/__init__.py
  • backend/nvidia_oc/core/__init__.py
  • backend/nvidia_oc/api/__init__.py
  • backend/nvidia_oc/api/routes/__init__.py
  • backend/nvidia_oc/cli/__init__.py
  • backend/nvidia_oc/cli/commands/__init__.py
  • backend/nvidia_oc/config/__init__.py
  • backend/nvidia_oc/utils/__init__.py

1.4 Default Profile Configs

  • configs/default.yaml - Factory reset profile
  • configs/quiet.yaml - Stock clocks, low fan (30-80%)
  • configs/balanced.yaml - +100/+500 MHz, moderate fan (50-100%)
  • configs/performance.yaml - +150/+700 MHz, aggressive fan (70-100%)

1.5 Git Repository

  • Initialize git repo
  • Add .gitignore
  • Let auto-commit service handle first commit

Phase 2: Python Core Library 🔄 IN PROGRESS

2.1 GPU Management (core/gpu.py)

  • Implement GPUDevice dataclass
    • Fields: index, name, uuid, handle
    • repr method for debugging
  • Implement GPUManager class
    • __init__() - Initialize NVML
    • list_devices() - Enumerate all GPUs
    • get_device(index) - Get specific GPU by index
    • _refresh_devices() - Query NVML for devices
    • __del__() - Shutdown NVML
    • Error handling for missing NVIDIA driver
    • Error handling for NVML initialization failures

2.2 Clock Control (core/clock.py)

  • Implement ClockInfo dataclass
    • Fields: core, memory, shader
  • Implement ClockController class
    • get_clocks(device) - Read current clock speeds
    • set_clock_offset(device, core, memory) - Apply offsets via nvidia-settings
    • reset_clocks(device) - Reset to defaults
    • check_coolbits() - Verify Coolbits enabled
    • Error handling for nvidia-settings subprocess
    • Validation before applying offsets

2.3 Fan Control (core/fan.py)

  • Define FanCurve type alias: List[Tuple[int, int]]
  • Implement FanController class
    • get_fan_speed(device) - Read current fan %
    • set_fan_speed(device, speed) - Manual fan control
    • apply_curve(device, curve) - Start curve monitoring task
    • _curve_monitor(device, curve) - Background task (async)
    • _interpolate_curve(temp, curve) - Linear interpolation
    • enable_auto(device) - Re-enable automatic fan
    • Error handling for NVML fan API calls

2.4 Telemetry Collection (core/telemetry.py)

  • Implement GPUMetrics dataclass
    • Fields: timestamp, temperature, fan_speed, power_draw, core_clock, memory_clock, utilization, memory_used, memory_total
  • Implement TelemetryCollector class
    • collect(device) - One-time metrics snapshot
    • stream(device, interval) - Async generator for single GPU
    • stream_all(devices, interval) - Async generator for all GPUs
    • Error handling for NVML metric queries
    • Unit conversion (mW → W, bytes → MB)

2.5 Profile Management (core/profile.py)

  • Implement ProfileConfig Pydantic model
    • Fields: name, description, core_offset, memory_offset, power_limit, fan_curve
    • Validators for offset ranges
    • Validators for fan curve format
  • Implement ProfileManager class
    • load(path) - Load profile from YAML
    • save(profile, path) - Save profile to YAML
    • apply(device, profile) - Apply profile to GPU
    • capture(device) - Capture current settings
    • list_profiles(directory) - List available profiles
    • Error handling for YAML parsing
    • Profile validation with Pydantic

2.6 Configuration & Utilities

  • Implement config/schema.py
    • OCConfig Pydantic model for app settings
  • Implement config/defaults.py
    • DEFAULT_FAN_CURVE constant
    • DEFAULT_PROFILES dict
  • Implement utils/output.py
    • console - Rich Console instance
    • print_table() - Format GPU status as table
    • print_gpu_status() - Single GPU status formatter
    • Color-coded temperature output (green/yellow/red)
  • Implement utils/validation.py
    • validate_clock_offset(offset, domain) - Check ±200/±1000 limits
    • validate_fan_speed(speed) - Check 0-100 range
    • validate_temperature_threshold(temp) - Check reasonable range

2.7 Unit Tests

  • tests/test_gpu.py
    • Mock NVML calls
    • Test device enumeration
    • Test error handling
  • tests/test_clock.py
    • Mock nvidia-settings subprocess
    • Test validation logic
    • Test Coolbits check
  • tests/test_fan.py
    • Mock NVML fan API
    • Test curve interpolation algorithm
    • Test background task lifecycle
  • tests/test_telemetry.py
    • Mock NVML metric queries
    • Test async streaming
    • Test unit conversions
  • tests/test_profile.py
    • Test YAML loading/saving
    • Test Pydantic validation
    • Test profile application

Phase 3: CLI Tool 📋 PENDING

3.1 CLI Entry Point (cli/main.py)

  • Create Click group cli()
  • Add version option
  • Add global options (--verbose, --gpu-id)
  • Register all commands
  • Error handling with Rich formatting

3.2 Status Command (cli/commands/status.py)

  • Implement status command
    • --watch flag for live monitoring
    • Display GPU info, temp, fan, clocks, utilization
    • Use Rich Table for formatting
    • Color-coded temperatures
    • Handle multiple GPUs side-by-side
    • Update every 1 second in watch mode

3.3 Clock Commands (cli/commands/set_clock.py)

  • Implement set-clock command
    • --gpu <id> option (required)
    • --core <offset> option (required)
    • --memory <offset> option (required)
    • --reset flag to reset to defaults
    • Confirmation prompt for large offsets
    • Progress spinner while applying
    • Success/error Rich output

3.4 Fan Commands (cli/commands/set_fan.py)

  • Implement set-fan command
    • --gpu <id> option (required)
    • --speed <percent> option for manual control
    • --auto flag to re-enable automatic
    • --curve <file> option to apply custom curve
    • Validation for speed range
    • Progress spinner while applying
    • Success/error Rich output

3.5 Profile Commands (cli/commands/profile.py)

  • Implement profile list subcommand
    • Show system profiles (configs/)
    • Show user profiles (~/.config/nvidia-oc/profiles/)
    • Display as Rich table with descriptions
  • Implement profile apply <name> subcommand
    • Load profile from YAML
    • Apply to specified GPU (or all)
    • Show progress for each step
    • Confirm successful application
  • Implement profile save <name> subcommand
    • Capture current GPU settings
    • Save to user profile directory
    • Confirm file written
  • Implement profile delete <name> subcommand
    • Confirmation prompt
    • Delete user profile file
    • Prevent deletion of system profiles

3.6 Daemon Command (cli/commands/daemon.py) [Optional]

  • Implement daemon command
    • Start background monitoring service
    • Apply fan curves continuously
    • Thermal protection monitoring
    • Write PID file
    • Graceful shutdown on SIGTERM

3.7 CLI Testing

  • Test all commands with mock data
  • Test error handling (missing GPU, permission denied)
  • Test Rich output rendering
  • Test watch mode interrupt (Ctrl+C)

Phase 4: FastAPI Backend 🌐 PENDING

4.1 Main App (api/main.py)

  • Create FastAPI app instance
  • Add CORS middleware
  • Add startup event handler
    • Initialize GPUManager
    • Initialize TelemetryCollector
  • Add shutdown event handler
    • Cleanup NVML
  • Serve static frontend files at /
  • Add exception handlers

4.2 GPU Routes (api/routes/gpu.py)

  • GET /api/gpus - List all GPUs
    • Return list of GPU metadata
  • GET /api/gpus/{gpu_id} - Get single GPU info
    • Return detailed GPU info
  • GET /api/gpus/{gpu_id}/status - Get current metrics
    • Return GPUMetrics
  • POST /api/gpus/{gpu_id}/clock - Set clock offsets
    • Request body: {core: int, memory: int}
    • Validate offsets
    • Apply via ClockController
    • Return success/error
  • POST /api/gpus/{gpu_id}/fan - Set fan speed
    • Request body: {speed: int} or {auto: true}
    • Apply via FanController
    • Return success/error

4.3 Profile Routes (api/routes/profile.py)

  • GET /api/profiles - List profiles
    • Return system + user profiles
  • GET /api/profiles/{name} - Get profile details
    • Return ProfileConfig
  • POST /api/profiles/{name}/apply - Apply profile
    • Request body: {gpu_id?: int} (optional, default all)
    • Apply via ProfileManager
    • Return success/error
  • POST /api/profiles - Create new profile
    • Request body: ProfileConfig
    • Save to user directory
    • Return success/error
  • DELETE /api/profiles/{name} - Delete profile
    • Only allow user profiles
    • Return success/error

4.4 WebSocket Telemetry (api/routes/telemetry.py)

  • WS /ws/telemetry - Stream live telemetry
    • Accept WebSocket connection
    • Stream GPUMetrics at 1Hz (configurable)
    • Send JSON: {timestamp, gpus: [...]}
    • Handle client disconnection gracefully
    • Error handling for NVML failures

4.5 API Models (api/models.py)

  • ClockRequest - Request body for clock updates
  • FanRequest - Request body for fan updates
  • ProfileApplyRequest - Request body for profile application
  • GPUResponse - Response schema for GPU info
  • MetricsResponse - Response schema for telemetry
  • ErrorResponse - Standard error response

4.6 Integration Tests

  • Test all REST endpoints with TestClient
  • Test WebSocket connection and streaming
  • Test error responses (404, 400, 500)
  • Test CORS headers
  • Test concurrent requests

Phase 5: React Frontend 🎨 PENDING

5.1 Project Setup

  • Create frontend/vite.config.ts
    • Configure React plugin
    • Configure proxy to backend (port 8000)
    • Configure build output directory
  • Create frontend/tsconfig.json
    • Extend @lilith/typescript-config-react
    • Configure path aliases
  • Create frontend/index.html
    • Basic HTML shell
    • Import main.tsx

5.2 Main App (frontend/src/App.tsx)

  • Setup ThemeProvider with cyberpunk adapter
  • Setup ToastProvider for notifications
  • Setup Navigation component
  • Create main layout with Container + Grid
  • Implement error boundary
  • Implement loading states

5.3 Custom Hooks

  • hooks/useWebSocket.ts
    • Establish WebSocket connection
    • Parse incoming telemetry messages
    • Handle connection states (connecting, connected, disconnected, error)
    • Automatic reconnection on disconnect
    • Return: {metrics, connectionState, error}
  • hooks/useGPUData.ts
    • Fetch GPU list from /api/gpus
    • Provide mutation functions (updateClock, updateFan)
    • Handle loading/error states
    • Return: {gpus, loading, error, updateClock, updateFan}
  • hooks/useProfiles.ts
    • Fetch profile list from /api/profiles
    • Provide apply/save/delete functions
    • Handle loading/error states
    • Return: {profiles, loading, applyProfile, saveProfile, deleteProfile}

5.4 API Client (frontend/src/api/client.ts)

  • Create axios instance with base URL
  • fetchGPUs() - GET /api/gpus
  • fetchGPUStatus(id) - GET /api/gpus/{id}/status
  • updateClock(id, core, memory) - POST /api/gpus/{id}/clock
  • updateFan(id, speed) - POST /api/gpus/{id}/fan
  • fetchProfiles() - GET /api/profiles
  • applyProfile(name, gpuId?) - POST /api/profiles/{name}/apply
  • Error handling and retries

5.5 Components

  • components/GPUCard.tsx
    • Display GPU name, index
    • StatCard for temp, fan, power, clocks
    • Color-coded temperature indicator
    • Utilization bar
    • Memory usage bar
  • components/ClockControl.tsx
    • LabeledSlider for core offset (-200 to +200)
    • LabeledSlider for memory offset (-1000 to +1000)
    • Apply button
    • Reset button
    • Current values display
  • components/FanControl.tsx
    • LabeledSlider for manual fan speed (0-100)
    • Auto button to re-enable automatic
    • Current fan speed display
    • Fan curve editor (future)
  • components/TelemetryChart.tsx
    • LineChart or AreaChart from @ui/ui-charts
    • Configurable metric (temp, power, clock, etc.)
    • Rolling window (last 60 data points)
    • X-axis: time, Y-axis: metric value
  • components/ProfileManager.tsx
    • DataTable listing all profiles
    • Apply button per profile
    • Delete button for user profiles
    • Create new profile button
  • components/StatusIndicator.tsx
    • SystemHealthIndicator from @ui/ui-admin
    • Show GPU health status
    • Show connection status
    • Show safety threshold warnings

5.6 Types (frontend/src/types/index.ts)

  • GPU interface
  • GPUMetrics interface
  • Profile interface
  • ConnectionState enum
  • ClockUpdate interface
  • FanUpdate interface

5.7 Entry Point (frontend/src/main.tsx)

  • Create root with React 19 createRoot
  • Wrap App with StrictMode
  • Mount to #root element

5.8 Frontend Testing

  • Test components with React Testing Library
  • Test hooks with renderHook
  • Test WebSocket connection mocking
  • Test API client error handling

Phase 6: Integration & Testing 🧪 PENDING

6.1 End-to-End Integration

  • Install Python package in editable mode
  • Install frontend dependencies with pnpm
  • Start backend dev server (uvicorn --reload)
  • Start frontend dev server (vite)
  • Test full flow: CLI → API → Web UI
  • Verify telemetry streaming works
  • Verify profile switching works
  • Test on actual NVIDIA GPU hardware

6.2 Systemd Services

  • Create systemd/nvidia-oc.service
    • ExecStart: uvicorn serving on 0.0.0.0:8000
    • User: root (required for GPU control)
    • Restart: always
    • After: network.target, nvidia-persistenced.service
  • Create systemd/nvidia-oc-daemon.service (optional)
    • Background fan curve monitoring
    • Thermal protection
  • Test service installation
    • systemctl enable nvidia-oc
    • systemctl start nvidia-oc
    • Verify web UI accessible
    • Verify service survives reboot

6.3 Stability Testing

  • Conservative OC Test (4 hours)
    • Apply balanced profile (+100/+500)
    • Run ML training workload
    • Monitor for crashes, CUDA errors
    • Monitor temps stay below 80°C
  • Aggressive OC Test (8 hours)
    • Apply performance profile (+150/+700)
    • Run stress test (FurMark or similar)
    • Monitor for instability
    • Monitor temps stay below 75°C
  • 24-Hour Burn-In
    • Apply final stable profile
    • Run continuous workload
    • Monitor metrics every hour
    • Verify 0 crashes, 0 errors
    • Document final stable settings

6.4 Multi-GPU Testing

  • Test with 2x RTX 3090 setup
  • Verify independent control of each GPU
  • Test applying different profiles to different GPUs
  • Test WebSocket streams both GPUs correctly
  • Test CLI status displays both GPUs

6.5 Documentation Finalization

  • Update README with installation instructions
  • Add API documentation (OpenAPI/Swagger)
  • Add troubleshooting section
  • Add performance tuning guide
  • Add safety warnings
  • Record demo video/GIF

Phase 7: Production Deployment 🚀 PENDING

7.1 Bluefin LTS Deployment (Primary Workstation)

  • Enable Coolbits: sudo nvidia-xconfig -a --cool-bits=28
  • Reboot or restart display-manager
  • Install package: pip install -e . or pip install lilith-nvidia-oc
  • Copy systemd service to /etc/systemd/system/
  • Enable and start service
  • Test web UI access from browser
  • Test CLI commands
  • Apply balanced profile
  • Monitor for 24 hours

7.2 Ubuntu Headless Server Deployment

  • Install NVIDIA drivers
  • Enable Coolbits (may need virtual X)
  • Install package via pip
  • Configure systemd service
  • Test remote access from workstation
  • Test WebSocket streaming
  • Apply performance profile
  • Monitor for 48 hours

7.3 Performance Validation

  • Measure baseline performance (stock clocks, auto fan)
    • Record GPU temps under load
    • Record training iterations/sec
    • Record fan noise level
  • Measure with balanced profile
    • Record GPU temps (target: -7 to -10°C)
    • Record training iterations/sec (target: +5-7%)
    • Record fan noise level
  • Measure with performance profile
    • Record GPU temps (target: -10 to -15°C)
    • Record training iterations/sec (target: +8-12%)
    • Record fan noise level
  • Document all results in README

7.4 Publishing to Forgejo

  • Build Python package: python -m build
  • Publish to Forgejo PyPI: twine upload --repository-url ...
  • Build frontend: cd frontend && pnpm build
  • Tag release: git tag v0.1.0
  • Push to Forgejo: git push origin v0.1.0
  • Create release notes on Forgejo

7.5 Monitoring & Maintenance

  • Set up log monitoring (journalctl -u nvidia-oc -f)
  • Monitor for errors or warnings
  • Track GPU temps over 7 days
  • Verify no performance degradation
  • Collect user feedback
  • Plan v0.2.0 features

Future Enhancements (Post-MVP)

v0.2.0

  • Historical metrics database (SQLite)
  • Profile scheduler (auto-switch by time/load)
  • Voltage control (advanced Coolbits)
  • Email/webhook alerts on thermal events
  • Docker containerization
  • Prometheus metrics exporter

v0.3.0

  • Multi-node cluster support
  • Authentication (JWT + HTTPS)
  • Power limit curves (dynamic)
  • Mobile-responsive UI improvements
  • Profile sharing (import/export)
  • Fan curve visual editor in web UI

v1.0.0

  • Stable API contract
  • Packaging for Fedora/Ubuntu/Arch
  • Security audit
  • Production systemd hardening
  • Comprehensive test suite (>90% coverage)
  • Localization (i18n)

Blockers & Risks

Active Blockers

  • None currently

Known Risks

  1. Coolbits dependency - Users may not enable Coolbits
    • Mitigation: Clear documentation, graceful fallback (read-only mode)
  2. nvidia-settings X11 requirement - Clock writes need X server
    • Mitigation: Virtual X on headless servers, document workaround
  3. Driver compatibility - NVML API may vary between driver versions
    • Mitigation: Test on multiple driver versions (535.x, 545.x, 550.x)
  4. Hardware variance - Different GPU models may behave differently
    • Mitigation: Test on multiple GPUs (3090, 4090, etc.)

Notes

  • Development Environment: Bluefin LTS (Fedora-based), Wayland, 2x RTX 3090
  • Target Users: ML engineers, Linux gamers, GPU server admins
  • Primary Goal: Provide MSI Afterburner equivalent for Linux
  • Success Metric: 24h stable operation at +100/+500 overclock

Last Updated: 2026-01-14 Current Phase: Phase 2 (Core Library Implementation) Estimated Completion: 2-3 weeks from start