nvidia-oc/docs/PROJECT_OVERVIEW.md
2026-01-14 12:30:45 -08:00

15 KiB

NVIDIA GPU Overclocking Control Panel - Project Overview

Mission Statement

Build a production-grade, network-accessible GPU overclocking control panel that provides:

  • Safety-first overclocking with validation and thermal protection
  • Multi-interface access via CLI, REST API, and web dashboard
  • Real-time monitoring with 1Hz telemetry streaming
  • Profile management for quick switching between performance modes
  • Cross-platform support for Wayland, X11, and headless environments

Problem Statement

Current Pain Points

  1. MSI Afterburner doesn't work on Linux

    • Windows-only GUI tool
    • Wine compatibility is unreliable
  2. nvidia-settings requires X11

    • Breaks on Wayland
    • Unusable on headless servers
    • No remote access capabilities
  3. Manual nvidia-smi is tedious

    • Can't control fans
    • No profile system
    • No real-time monitoring dashboard
  4. Existing Linux tools are fragmented

    • CLI-only with poor UX
    • No web interface
    • Limited to single-GPU setups
    • No safety mechanisms

Our Solution

A unified, full-stack solution that:

  • Uses NVML Python bindings (works on Wayland + headless)
  • Provides three interfaces: CLI, REST API, Web UI
  • Real-time WebSocket telemetry dashboard
  • Profile-based configuration with YAML
  • Multi-GPU support with independent control
  • Built-in safety thresholds and validation

Target Users

Primary Users

  1. ML Engineers - Running training workloads on local GPUs
  2. Gamers on Linux - Want MSI Afterburner equivalent
  3. Cryptocurrency Miners - Need 24/7 stable overclocking
  4. System Administrators - Managing GPU servers remotely

Use Cases

Use Case 1: Daily ML Training

  • User: ML engineer with 2x RTX 3090
  • Need: Better thermals during 8-hour training runs
  • Solution: Apply "Balanced" profile via CLI before training
  • Benefit: 10°C cooler GPUs, 5% faster training

Use Case 2: Remote Server Management

  • User: DevOps managing 10 GPU servers
  • Need: Monitor and adjust GPU settings remotely
  • Solution: Web dashboard accessible from laptop
  • Benefit: No SSH required, live telemetry, quick profile switching

Use Case 3: Gaming Performance

  • User: Linux gamer wanting max FPS
  • Need: MSI Afterburner alternative
  • Solution: "Performance" profile + web UI for monitoring
  • Benefit: +10% FPS, real-time temp/clock monitoring

Use Case 4: Quiet Computing

  • User: Developer in shared office
  • Need: Reduce fan noise during compiling
  • Solution: "Quiet" profile with low fan curve
  • Benefit: Barely audible GPU fans, stock performance

Project Goals

MVP (v0.1.0) - 2-3 Weeks

  • Python core library (NVML wrappers)
  • CLI tool with all commands
  • FastAPI backend with WebSocket
  • React web UI with @ui components
  • 4 default profiles (quiet, balanced, performance, default)
  • Multi-GPU support
  • Safety validation

v0.2.0 - 1 Month After MVP

  • Historical metrics database (SQLite)
  • Profile scheduler (auto-switch by time/load)
  • Voltage control (advanced Coolbits)
  • Email/webhook alerts on thermal events
  • Docker containerization

v0.3.0 - 2 Months After MVP

  • Multi-node cluster support
  • Authentication (JWT + HTTPS)
  • Power limit curves (dynamic based on load)
  • Mobile-responsive UI
  • Prometheus metrics exporter

v1.0.0 - Stable Release

  • Stable API contract
  • Packaging for major distros (Fedora, Ubuntu, Arch AUR)
  • Comprehensive documentation
  • Production-ready systemd hardening
  • Security audit

Technical Architecture

Stack Overview

┌─────────────────────────────────────────────────────────────┐
│  User Interfaces                                             │
├───────────────┬─────────────────┬───────────────────────────┤
│  CLI          │  Web Dashboard  │  REST API                 │
│  (Click)      │  (React 19)     │  (FastAPI)                │
└───────────────┴─────────────────┴───────────────────────────┘
                       │
┌─────────────────────────────────────────────────────────────┐
│  Core Library (Python 3.11+)                                 │
├───────────────┬─────────────────┬───────────────────────────┤
│  GPUManager   │  ClockController│  FanController            │
│  Telemetry    │  ProfileManager │  Validation               │
└───────────────┴─────────────────┴───────────────────────────┘
                       │
┌─────────────────────────────────────────────────────────────┐
│  NVML (nvidia-ml-py)                                         │
│  - nvmlDeviceGetHandleByIndex()                              │
│  - nvmlDeviceSetFanSpeed_v2()                                │
│  - nvmlDeviceGetClockInfo()                                  │
└─────────────────────────────────────────────────────────────┘
                       │
┌─────────────────────────────────────────────────────────────┐
│  NVIDIA GPU Hardware                                         │
│  (RTX 30-series, 40-series, etc.)                            │
└─────────────────────────────────────────────────────────────┘

Key Technologies

Backend:

  • Python 3.11+ (type hints, dataclasses)
  • nvidia-ml-py (NVML bindings)
  • FastAPI (REST + WebSocket)
  • Pydantic (validation)
  • Click (CLI framework)
  • Rich (terminal formatting)
  • PyYAML (config loading)

Frontend:

  • React 19 (UI framework)
  • TypeScript 5.7 (type safety)
  • Vite (dev server + build)
  • @ui/* packages (workspace components)
  • styled-components (CSS-in-JS)
  • WebSocket (real-time telemetry)

Infrastructure:

  • systemd (service management)
  • pnpm workspace (monorepo)
  • pytest (testing)
  • Forgejo Actions (CI/CD)

Core Features

1. GPU Control

Clock Offsets:

  • Core clock: ±200 MHz (validation enforced)
  • Memory clock: ±1000 MHz (validation enforced)
  • Reset to defaults with single command

Fan Control:

  • Manual speed: 0-100%
  • Temperature-based curves (linear interpolation)
  • Automatic mode (driver-managed)
  • Per-GPU independent control

Power Management:

  • Power limit adjustment (50-150% TDP)
  • Real-time power draw monitoring
  • Power efficiency metrics

2. Telemetry & Monitoring

Metrics Collected:

  • Temperature (°C)
  • Fan speed (%)
  • Power draw (W)
  • Core clock (MHz)
  • Memory clock (MHz)
  • GPU utilization (%)
  • Memory usage (MB / total MB)
  • Timestamp

Streaming:

  • WebSocket push at 1Hz (configurable)
  • JSON format for easy parsing
  • Batch updates for all GPUs in single message

3. Profile System

Profile Storage:

  • YAML format (human-readable)
  • Located in ~/.config/nvidia-oc/profiles/
  • System profiles in package configs/

Profile Contents:

  • Name and description
  • Clock offsets (core, memory)
  • Power limit
  • Fan curve (temp/speed pairs)
  • Safety thresholds
  • Metadata (recommended_for, warnings)

Profile Operations:

  • List available profiles
  • Apply profile to GPU(s)
  • Save current settings as profile
  • Delete custom profiles
  • Reset to default profile

4. Safety Mechanisms

Validation:

  • Clock offsets within safe ranges
  • Fan speeds 0-100%
  • Temperature thresholds respected
  • Coolbits enabled check

Thermal Protection:

  • Emergency fan to 100% at 85°C
  • Clock reset at 90°C (critical)
  • Configurable per-profile thresholds

Error Handling:

  • Graceful NVML errors
  • Automatic fallback to safe settings
  • User notifications (Toast/CLI output)

5. Multi-Interface Access

CLI (nvidia-oc):

nvidia-oc status --watch
nvidia-oc set-clock --gpu 0 --core +100 --memory +500
nvidia-oc set-fan --gpu 0 --speed 70
nvidia-oc profile apply balanced

REST API:

GET  /api/gpus                    # List GPUs
GET  /api/gpus/{id}/status        # Get metrics
POST /api/gpus/{id}/clock         # Set clocks
POST /api/gpus/{id}/fan           # Set fan
GET  /api/profiles                # List profiles
POST /api/profiles/{name}/apply   # Apply profile

Web Dashboard:

  • Live telemetry cards (StatCard)
  • Interactive sliders (ClockControl, FanControl)
  • Real-time charts (temperature, power)
  • Profile manager (DataTable)
  • Multi-GPU side-by-side view

Success Metrics

Technical Metrics

  • CLI functional with all commands
  • API response time < 100ms (excluding hardware)
  • WebSocket latency < 50ms
  • UI render time < 16ms (60 FPS)
  • Test coverage > 80%
  • Zero memory leaks during 24h run

Performance Metrics

  • 🎯 Temperature reduction: 7-10°C vs auto fan
  • 🎯 Performance improvement: 5-7% compute throughput
  • 🎯 Stability: 0 crashes in 24h stress test
  • 🎯 Multi-GPU latency: <100ms per GPU

User Experience Metrics

  • 🎯 CLI install to first use: < 2 minutes
  • 🎯 Web UI page load: < 2 seconds
  • 🎯 Profile switch time: < 1 second
  • 🎯 Documentation readability: Beginner-friendly

Non-Goals (Out of Scope for v0.1.0)

Voltage control - Requires advanced Coolbits, high risk Automatic overclocking - Too complex, instability risk Multi-node clustering - Later version Authentication - Trusted LAN assumption for MVP Historical metrics DB - Later version Mobile app - Web UI is mobile-responsive instead AMD GPU support - NVIDIA-only for MVP

Development Principles

Code Quality

  • SOLID principles - Single responsibility, dependency inversion
  • DRY - No code duplication
  • Type safety - Python type hints, TypeScript strict mode
  • Error handling - Validate inputs, handle edge cases
  • Testing - Unit tests for core logic, integration tests for API

User Experience

  • No surprises - Clear warnings before destructive actions
  • Instant feedback - Loading states, progress indicators
  • Graceful degradation - Work even if Coolbits not enabled (read-only)
  • Accessibility - Keyboard navigation, screen reader support

Performance

  • Async operations - Non-blocking GPU control
  • Efficient polling - 1Hz telemetry (not wasteful)
  • React optimization - Memoization, lazy loading
  • Resource limits - Cap memory usage, CPU usage

Security

  • Input validation - Pydantic models, TypeScript types
  • Privilege escalation - Document root requirement clearly
  • Network isolation - Optional localhost-only mode
  • No secrets in code - Config files for sensitive data

Deployment Strategy

Development

# Backend
pip install -e backend/
uvicorn nvidia_oc.api.main:app --reload --host 0.0.0.0

# Frontend
cd frontend && pnpm install && pnpm dev

Production

# Install
pip install lilith-nvidia-oc

# Enable Coolbits
sudo nvidia-xconfig -a --cool-bits=28
sudo systemctl restart display-manager

# Run as service
sudo systemctl enable nvidia-oc
sudo systemctl start nvidia-oc

# Access
http://localhost:8000  # or http://192.168.x.x:8000

Risk Assessment

High Risk

  • Hardware damage - Incorrect settings could overheat GPU
    • Mitigation: Hard-coded safety limits, thermal protection
  • System instability - Aggressive OC causes crashes
    • Mitigation: Conservative defaults, gradual testing

Medium Risk

  • Driver conflicts - NVML issues with certain driver versions
    • Mitigation: Test on multiple driver versions, document compatibility
  • Coolbits not enabled - Users skip setup step
    • Mitigation: Clear warnings, graceful read-only fallback

Low Risk

  • Network security - Unauthorized access to control panel
    • Mitigation: Document LAN-only use, add auth in v0.3.0
  • Browser compatibility - Web UI breaks on older browsers
    • Mitigation: Target modern browsers (2023+), progressive enhancement

Success Criteria for MVP

Must Have (Blocking Release)

  1. All core modules implemented (gpu, clock, fan, telemetry, profile)
  2. CLI tool with all commands working
  3. FastAPI backend serving REST + WebSocket
  4. React web UI with live telemetry
  5. 4 default profiles working
  6. Multi-GPU support verified (2+ GPUs)
  7. Safety mechanisms active (thermal protection, validation)
  8. 24-hour stability test passed (0 crashes)
  9. Documentation complete (README, ARCHITECTURE, API docs)
  10. Package published to Forgejo registry

Should Have (Desirable for Release)

  • Unit tests with >70% coverage
  • Integration tests for API endpoints
  • Systemd service unit files
  • Installation script
  • Video demo/tutorial

Nice to Have (Post-MVP)

  • Docker container
  • Historical metrics
  • Email alerts
  • Grafana dashboard

Timeline

Week 1: Core Library + CLI

  • Days 1-2: Implement core modules (gpu, clock, fan, telemetry, profile)
  • Day 3: Unit tests for core modules
  • Day 4: Implement CLI commands (status, set-clock, set-fan, profile)
  • Day 5: CLI testing and refinement

Week 2: API + Frontend

  • Days 6-7: FastAPI backend (REST endpoints + WebSocket)
  • Days 8-9: React webapp (components, hooks, layout)
  • Day 10: Frontend-backend integration

Week 3: Testing + Deployment

  • Days 11-12: Integration testing
  • Day 13: Systemd service setup
  • Day 14: Documentation finalization
  • Day 15: Stability testing (24h burn-in)

Total: ~15 days part-time or ~10 days full-time

References


Last Updated: 2026-01-14 Status: In Development Version: 0.1.0-alpha