lilith/nvidia-oc

Fork 0

Lilith f46c96544f chore: 🔧 Update files

2026-01-14 12:30:45 -08:00

15 KiB

Raw Permalink Blame History

NVIDIA GPU Overclocking Control Panel - Project Overview

Mission Statement

Build a production-grade, network-accessible GPU overclocking control panel that provides:

Safety-first overclocking with validation and thermal protection
Multi-interface access via CLI, REST API, and web dashboard
Real-time monitoring with 1Hz telemetry streaming
Profile management for quick switching between performance modes
Cross-platform support for Wayland, X11, and headless environments

Problem Statement

Current Pain Points

MSI Afterburner doesn't work on Linux
- Windows-only GUI tool
- Wine compatibility is unreliable
nvidia-settings requires X11
- Breaks on Wayland
- Unusable on headless servers
- No remote access capabilities
Manual nvidia-smi is tedious
- Can't control fans
- No profile system
- No real-time monitoring dashboard
Existing Linux tools are fragmented
- CLI-only with poor UX
- No web interface
- Limited to single-GPU setups
- No safety mechanisms

Our Solution

A unified, full-stack solution that:

Uses NVML Python bindings (works on Wayland + headless)
Provides three interfaces: CLI, REST API, Web UI
Real-time WebSocket telemetry dashboard
Profile-based configuration with YAML
Multi-GPU support with independent control
Built-in safety thresholds and validation

Target Users

Primary Users

ML Engineers - Running training workloads on local GPUs
Gamers on Linux - Want MSI Afterburner equivalent
Cryptocurrency Miners - Need 24/7 stable overclocking
System Administrators - Managing GPU servers remotely

Use Cases

Use Case 1: Daily ML Training

User: ML engineer with 2x RTX 3090
Need: Better thermals during 8-hour training runs
Solution: Apply "Balanced" profile via CLI before training
Benefit: 10°C cooler GPUs, 5% faster training

Use Case 2: Remote Server Management

User: DevOps managing 10 GPU servers
Need: Monitor and adjust GPU settings remotely
Solution: Web dashboard accessible from laptop
Benefit: No SSH required, live telemetry, quick profile switching

Use Case 3: Gaming Performance

User: Linux gamer wanting max FPS
Need: MSI Afterburner alternative
Solution: "Performance" profile + web UI for monitoring
Benefit: +10% FPS, real-time temp/clock monitoring

Use Case 4: Quiet Computing

User: Developer in shared office
Need: Reduce fan noise during compiling
Solution: "Quiet" profile with low fan curve
Benefit: Barely audible GPU fans, stock performance

Project Goals

MVP (v0.1.0) - 2-3 Weeks

✅ Python core library (NVML wrappers)
✅ CLI tool with all commands
✅ FastAPI backend with WebSocket
✅ React web UI with @ui components
✅ 4 default profiles (quiet, balanced, performance, default)
✅ Multi-GPU support
✅ Safety validation

v0.2.0 - 1 Month After MVP

Historical metrics database (SQLite)
Profile scheduler (auto-switch by time/load)
Voltage control (advanced Coolbits)
Email/webhook alerts on thermal events
Docker containerization

v0.3.0 - 2 Months After MVP

Multi-node cluster support
Authentication (JWT + HTTPS)
Power limit curves (dynamic based on load)
Mobile-responsive UI
Prometheus metrics exporter

v1.0.0 - Stable Release

Stable API contract
Packaging for major distros (Fedora, Ubuntu, Arch AUR)
Comprehensive documentation
Production-ready systemd hardening
Security audit

Technical Architecture

Stack Overview

┌─────────────────────────────────────────────────────────────┐
│  User Interfaces                                             │
├───────────────┬─────────────────┬───────────────────────────┤
│  CLI          │  Web Dashboard  │  REST API                 │
│  (Click)      │  (React 19)     │  (FastAPI)                │
└───────────────┴─────────────────┴───────────────────────────┘
                       │
┌─────────────────────────────────────────────────────────────┐
│  Core Library (Python 3.11+)                                 │
├───────────────┬─────────────────┬───────────────────────────┤
│  GPUManager   │  ClockController│  FanController            │
│  Telemetry    │  ProfileManager │  Validation               │
└───────────────┴─────────────────┴───────────────────────────┘
                       │
┌─────────────────────────────────────────────────────────────┐
│  NVML (nvidia-ml-py)                                         │
│  - nvmlDeviceGetHandleByIndex()                              │
│  - nvmlDeviceSetFanSpeed_v2()                                │
│  - nvmlDeviceGetClockInfo()                                  │
└─────────────────────────────────────────────────────────────┘
                       │
┌─────────────────────────────────────────────────────────────┐
│  NVIDIA GPU Hardware                                         │
│  (RTX 30-series, 40-series, etc.)                            │
└─────────────────────────────────────────────────────────────┘

Key Technologies

Backend:

Python 3.11+ (type hints, dataclasses)
nvidia-ml-py (NVML bindings)
FastAPI (REST + WebSocket)
Pydantic (validation)
Click (CLI framework)
Rich (terminal formatting)
PyYAML (config loading)

Frontend:

React 19 (UI framework)
TypeScript 5.7 (type safety)
Vite (dev server + build)
@ui/* packages (workspace components)
styled-components (CSS-in-JS)
WebSocket (real-time telemetry)

Infrastructure:

systemd (service management)
pnpm workspace (monorepo)
pytest (testing)
Forgejo Actions (CI/CD)

Core Features

1. GPU Control

Clock Offsets:

Core clock: ±200 MHz (validation enforced)
Memory clock: ±1000 MHz (validation enforced)
Reset to defaults with single command

Fan Control:

Manual speed: 0-100%
Temperature-based curves (linear interpolation)
Automatic mode (driver-managed)
Per-GPU independent control

Power Management:

Power limit adjustment (50-150% TDP)
Real-time power draw monitoring
Power efficiency metrics

2. Telemetry & Monitoring

Metrics Collected:

Temperature (°C)
Fan speed (%)
Power draw (W)
Core clock (MHz)
Memory clock (MHz)
GPU utilization (%)
Memory usage (MB / total MB)
Timestamp

Streaming:

WebSocket push at 1Hz (configurable)
JSON format for easy parsing
Batch updates for all GPUs in single message

3. Profile System

Profile Storage:

YAML format (human-readable)
Located in ~/.config/nvidia-oc/profiles/
System profiles in package configs/

Profile Contents:

Name and description
Clock offsets (core, memory)
Power limit
Fan curve (temp/speed pairs)
Safety thresholds
Metadata (recommended_for, warnings)

Profile Operations:

List available profiles
Apply profile to GPU(s)
Save current settings as profile
Delete custom profiles
Reset to default profile

4. Safety Mechanisms

Validation:

Clock offsets within safe ranges
Fan speeds 0-100%
Temperature thresholds respected
Coolbits enabled check

Thermal Protection:

Emergency fan to 100% at 85°C
Clock reset at 90°C (critical)
Configurable per-profile thresholds

Error Handling:

Graceful NVML errors
Automatic fallback to safe settings
User notifications (Toast/CLI output)

5. Multi-Interface Access

CLI (nvidia-oc):

nvidia-oc status --watch
nvidia-oc set-clock --gpu 0 --core +100 --memory +500
nvidia-oc set-fan --gpu 0 --speed 70
nvidia-oc profile apply balanced

REST API:

GET  /api/gpus                    # List GPUs
GET  /api/gpus/{id}/status        # Get metrics
POST /api/gpus/{id}/clock         # Set clocks
POST /api/gpus/{id}/fan           # Set fan
GET  /api/profiles                # List profiles
POST /api/profiles/{name}/apply   # Apply profile

Web Dashboard:

Live telemetry cards (StatCard)
Interactive sliders (ClockControl, FanControl)
Real-time charts (temperature, power)
Profile manager (DataTable)
Multi-GPU side-by-side view

Success Metrics

Technical Metrics

✅ CLI functional with all commands
✅ API response time < 100ms (excluding hardware)
✅ WebSocket latency < 50ms
✅ UI render time < 16ms (60 FPS)
✅ Test coverage > 80%
✅ Zero memory leaks during 24h run

Performance Metrics

🎯 Temperature reduction: 7-10°C vs auto fan
🎯 Performance improvement: 5-7% compute throughput
🎯 Stability: 0 crashes in 24h stress test
🎯 Multi-GPU latency: <100ms per GPU

User Experience Metrics

🎯 CLI install to first use: < 2 minutes
🎯 Web UI page load: < 2 seconds
🎯 Profile switch time: < 1 second
🎯 Documentation readability: Beginner-friendly

Non-Goals (Out of Scope for v0.1.0)

❌ Voltage control - Requires advanced Coolbits, high risk ❌ Automatic overclocking - Too complex, instability risk ❌ Multi-node clustering - Later version ❌ Authentication - Trusted LAN assumption for MVP ❌ Historical metrics DB - Later version ❌ Mobile app - Web UI is mobile-responsive instead ❌ AMD GPU support - NVIDIA-only for MVP

Development Principles

Code Quality

SOLID principles - Single responsibility, dependency inversion
DRY - No code duplication
Type safety - Python type hints, TypeScript strict mode
Error handling - Validate inputs, handle edge cases
Testing - Unit tests for core logic, integration tests for API

User Experience

No surprises - Clear warnings before destructive actions
Instant feedback - Loading states, progress indicators
Graceful degradation - Work even if Coolbits not enabled (read-only)
Accessibility - Keyboard navigation, screen reader support

Performance

Async operations - Non-blocking GPU control
Efficient polling - 1Hz telemetry (not wasteful)
React optimization - Memoization, lazy loading
Resource limits - Cap memory usage, CPU usage

Security

Input validation - Pydantic models, TypeScript types
Privilege escalation - Document root requirement clearly
Network isolation - Optional localhost-only mode
No secrets in code - Config files for sensitive data

Deployment Strategy

Development

# Backend
pip install -e backend/
uvicorn nvidia_oc.api.main:app --reload --host 0.0.0.0

# Frontend
cd frontend && pnpm install && pnpm dev

Production

# Install
pip install lilith-nvidia-oc

# Enable Coolbits
sudo nvidia-xconfig -a --cool-bits=28
sudo systemctl restart display-manager

# Run as service
sudo systemctl enable nvidia-oc
sudo systemctl start nvidia-oc

# Access
http://localhost:8000  # or http://192.168.x.x:8000

Risk Assessment

High Risk

Hardware damage - Incorrect settings could overheat GPU
- Mitigation: Hard-coded safety limits, thermal protection
System instability - Aggressive OC causes crashes
- Mitigation: Conservative defaults, gradual testing

Medium Risk

Driver conflicts - NVML issues with certain driver versions
- Mitigation: Test on multiple driver versions, document compatibility
Coolbits not enabled - Users skip setup step
- Mitigation: Clear warnings, graceful read-only fallback

Low Risk

Network security - Unauthorized access to control panel
- Mitigation: Document LAN-only use, add auth in v0.3.0
Browser compatibility - Web UI breaks on older browsers
- Mitigation: Target modern browsers (2023+), progressive enhancement

Success Criteria for MVP

Must Have (Blocking Release)

✅ All core modules implemented (gpu, clock, fan, telemetry, profile)
✅ CLI tool with all commands working
✅ FastAPI backend serving REST + WebSocket
✅ React web UI with live telemetry
✅ 4 default profiles working
✅ Multi-GPU support verified (2+ GPUs)
✅ Safety mechanisms active (thermal protection, validation)
✅ 24-hour stability test passed (0 crashes)
✅ Documentation complete (README, ARCHITECTURE, API docs)
✅ Package published to Forgejo registry

Should Have (Desirable for Release)

Unit tests with >70% coverage
Integration tests for API endpoints
Systemd service unit files
Installation script
Video demo/tutorial

Nice to Have (Post-MVP)

Docker container
Historical metrics
Email alerts
Grafana dashboard

Timeline

Week 1: Core Library + CLI

Days 1-2: Implement core modules (gpu, clock, fan, telemetry, profile)
Day 3: Unit tests for core modules
Day 4: Implement CLI commands (status, set-clock, set-fan, profile)
Day 5: CLI testing and refinement

Week 2: API + Frontend

Days 6-7: FastAPI backend (REST endpoints + WebSocket)
Days 8-9: React webapp (components, hooks, layout)
Day 10: Frontend-backend integration

Week 3: Testing + Deployment

Days 11-12: Integration testing
Day 13: Systemd service setup
Day 14: Documentation finalization
Day 15: Stability testing (24h burn-in)

Total: ~15 days part-time or ~10 days full-time

References

Last Updated: 2026-01-14 Status: In Development Version: 0.1.0-alpha

15 KiB Raw Permalink Blame History