15 KiB
NVIDIA GPU Overclocking Control Panel - Project Overview
Mission Statement
Build a production-grade, network-accessible GPU overclocking control panel that provides:
- Safety-first overclocking with validation and thermal protection
- Multi-interface access via CLI, REST API, and web dashboard
- Real-time monitoring with 1Hz telemetry streaming
- Profile management for quick switching between performance modes
- Cross-platform support for Wayland, X11, and headless environments
Problem Statement
Current Pain Points
-
MSI Afterburner doesn't work on Linux
- Windows-only GUI tool
- Wine compatibility is unreliable
-
nvidia-settings requires X11
- Breaks on Wayland
- Unusable on headless servers
- No remote access capabilities
-
Manual nvidia-smi is tedious
- Can't control fans
- No profile system
- No real-time monitoring dashboard
-
Existing Linux tools are fragmented
- CLI-only with poor UX
- No web interface
- Limited to single-GPU setups
- No safety mechanisms
Our Solution
A unified, full-stack solution that:
- Uses NVML Python bindings (works on Wayland + headless)
- Provides three interfaces: CLI, REST API, Web UI
- Real-time WebSocket telemetry dashboard
- Profile-based configuration with YAML
- Multi-GPU support with independent control
- Built-in safety thresholds and validation
Target Users
Primary Users
- ML Engineers - Running training workloads on local GPUs
- Gamers on Linux - Want MSI Afterburner equivalent
- Cryptocurrency Miners - Need 24/7 stable overclocking
- System Administrators - Managing GPU servers remotely
Use Cases
Use Case 1: Daily ML Training
- User: ML engineer with 2x RTX 3090
- Need: Better thermals during 8-hour training runs
- Solution: Apply "Balanced" profile via CLI before training
- Benefit: 10°C cooler GPUs, 5% faster training
Use Case 2: Remote Server Management
- User: DevOps managing 10 GPU servers
- Need: Monitor and adjust GPU settings remotely
- Solution: Web dashboard accessible from laptop
- Benefit: No SSH required, live telemetry, quick profile switching
Use Case 3: Gaming Performance
- User: Linux gamer wanting max FPS
- Need: MSI Afterburner alternative
- Solution: "Performance" profile + web UI for monitoring
- Benefit: +10% FPS, real-time temp/clock monitoring
Use Case 4: Quiet Computing
- User: Developer in shared office
- Need: Reduce fan noise during compiling
- Solution: "Quiet" profile with low fan curve
- Benefit: Barely audible GPU fans, stock performance
Project Goals
MVP (v0.1.0) - 2-3 Weeks
- ✅ Python core library (NVML wrappers)
- ✅ CLI tool with all commands
- ✅ FastAPI backend with WebSocket
- ✅ React web UI with @ui components
- ✅ 4 default profiles (quiet, balanced, performance, default)
- ✅ Multi-GPU support
- ✅ Safety validation
v0.2.0 - 1 Month After MVP
- Historical metrics database (SQLite)
- Profile scheduler (auto-switch by time/load)
- Voltage control (advanced Coolbits)
- Email/webhook alerts on thermal events
- Docker containerization
v0.3.0 - 2 Months After MVP
- Multi-node cluster support
- Authentication (JWT + HTTPS)
- Power limit curves (dynamic based on load)
- Mobile-responsive UI
- Prometheus metrics exporter
v1.0.0 - Stable Release
- Stable API contract
- Packaging for major distros (Fedora, Ubuntu, Arch AUR)
- Comprehensive documentation
- Production-ready systemd hardening
- Security audit
Technical Architecture
Stack Overview
┌─────────────────────────────────────────────────────────────┐
│ User Interfaces │
├───────────────┬─────────────────┬───────────────────────────┤
│ CLI │ Web Dashboard │ REST API │
│ (Click) │ (React 19) │ (FastAPI) │
└───────────────┴─────────────────┴───────────────────────────┘
│
┌─────────────────────────────────────────────────────────────┐
│ Core Library (Python 3.11+) │
├───────────────┬─────────────────┬───────────────────────────┤
│ GPUManager │ ClockController│ FanController │
│ Telemetry │ ProfileManager │ Validation │
└───────────────┴─────────────────┴───────────────────────────┘
│
┌─────────────────────────────────────────────────────────────┐
│ NVML (nvidia-ml-py) │
│ - nvmlDeviceGetHandleByIndex() │
│ - nvmlDeviceSetFanSpeed_v2() │
│ - nvmlDeviceGetClockInfo() │
└─────────────────────────────────────────────────────────────┘
│
┌─────────────────────────────────────────────────────────────┐
│ NVIDIA GPU Hardware │
│ (RTX 30-series, 40-series, etc.) │
└─────────────────────────────────────────────────────────────┘
Key Technologies
Backend:
- Python 3.11+ (type hints, dataclasses)
- nvidia-ml-py (NVML bindings)
- FastAPI (REST + WebSocket)
- Pydantic (validation)
- Click (CLI framework)
- Rich (terminal formatting)
- PyYAML (config loading)
Frontend:
- React 19 (UI framework)
- TypeScript 5.7 (type safety)
- Vite (dev server + build)
- @ui/* packages (workspace components)
- styled-components (CSS-in-JS)
- WebSocket (real-time telemetry)
Infrastructure:
- systemd (service management)
- pnpm workspace (monorepo)
- pytest (testing)
- Forgejo Actions (CI/CD)
Core Features
1. GPU Control
Clock Offsets:
- Core clock: ±200 MHz (validation enforced)
- Memory clock: ±1000 MHz (validation enforced)
- Reset to defaults with single command
Fan Control:
- Manual speed: 0-100%
- Temperature-based curves (linear interpolation)
- Automatic mode (driver-managed)
- Per-GPU independent control
Power Management:
- Power limit adjustment (50-150% TDP)
- Real-time power draw monitoring
- Power efficiency metrics
2. Telemetry & Monitoring
Metrics Collected:
- Temperature (°C)
- Fan speed (%)
- Power draw (W)
- Core clock (MHz)
- Memory clock (MHz)
- GPU utilization (%)
- Memory usage (MB / total MB)
- Timestamp
Streaming:
- WebSocket push at 1Hz (configurable)
- JSON format for easy parsing
- Batch updates for all GPUs in single message
3. Profile System
Profile Storage:
- YAML format (human-readable)
- Located in
~/.config/nvidia-oc/profiles/ - System profiles in package
configs/
Profile Contents:
- Name and description
- Clock offsets (core, memory)
- Power limit
- Fan curve (temp/speed pairs)
- Safety thresholds
- Metadata (recommended_for, warnings)
Profile Operations:
- List available profiles
- Apply profile to GPU(s)
- Save current settings as profile
- Delete custom profiles
- Reset to default profile
4. Safety Mechanisms
Validation:
- Clock offsets within safe ranges
- Fan speeds 0-100%
- Temperature thresholds respected
- Coolbits enabled check
Thermal Protection:
- Emergency fan to 100% at 85°C
- Clock reset at 90°C (critical)
- Configurable per-profile thresholds
Error Handling:
- Graceful NVML errors
- Automatic fallback to safe settings
- User notifications (Toast/CLI output)
5. Multi-Interface Access
CLI (nvidia-oc):
nvidia-oc status --watch
nvidia-oc set-clock --gpu 0 --core +100 --memory +500
nvidia-oc set-fan --gpu 0 --speed 70
nvidia-oc profile apply balanced
REST API:
GET /api/gpus # List GPUs
GET /api/gpus/{id}/status # Get metrics
POST /api/gpus/{id}/clock # Set clocks
POST /api/gpus/{id}/fan # Set fan
GET /api/profiles # List profiles
POST /api/profiles/{name}/apply # Apply profile
Web Dashboard:
- Live telemetry cards (StatCard)
- Interactive sliders (ClockControl, FanControl)
- Real-time charts (temperature, power)
- Profile manager (DataTable)
- Multi-GPU side-by-side view
Success Metrics
Technical Metrics
- ✅ CLI functional with all commands
- ✅ API response time < 100ms (excluding hardware)
- ✅ WebSocket latency < 50ms
- ✅ UI render time < 16ms (60 FPS)
- ✅ Test coverage > 80%
- ✅ Zero memory leaks during 24h run
Performance Metrics
- 🎯 Temperature reduction: 7-10°C vs auto fan
- 🎯 Performance improvement: 5-7% compute throughput
- 🎯 Stability: 0 crashes in 24h stress test
- 🎯 Multi-GPU latency: <100ms per GPU
User Experience Metrics
- 🎯 CLI install to first use: < 2 minutes
- 🎯 Web UI page load: < 2 seconds
- 🎯 Profile switch time: < 1 second
- 🎯 Documentation readability: Beginner-friendly
Non-Goals (Out of Scope for v0.1.0)
❌ Voltage control - Requires advanced Coolbits, high risk ❌ Automatic overclocking - Too complex, instability risk ❌ Multi-node clustering - Later version ❌ Authentication - Trusted LAN assumption for MVP ❌ Historical metrics DB - Later version ❌ Mobile app - Web UI is mobile-responsive instead ❌ AMD GPU support - NVIDIA-only for MVP
Development Principles
Code Quality
- SOLID principles - Single responsibility, dependency inversion
- DRY - No code duplication
- Type safety - Python type hints, TypeScript strict mode
- Error handling - Validate inputs, handle edge cases
- Testing - Unit tests for core logic, integration tests for API
User Experience
- No surprises - Clear warnings before destructive actions
- Instant feedback - Loading states, progress indicators
- Graceful degradation - Work even if Coolbits not enabled (read-only)
- Accessibility - Keyboard navigation, screen reader support
Performance
- Async operations - Non-blocking GPU control
- Efficient polling - 1Hz telemetry (not wasteful)
- React optimization - Memoization, lazy loading
- Resource limits - Cap memory usage, CPU usage
Security
- Input validation - Pydantic models, TypeScript types
- Privilege escalation - Document root requirement clearly
- Network isolation - Optional localhost-only mode
- No secrets in code - Config files for sensitive data
Deployment Strategy
Development
# Backend
pip install -e backend/
uvicorn nvidia_oc.api.main:app --reload --host 0.0.0.0
# Frontend
cd frontend && pnpm install && pnpm dev
Production
# Install
pip install lilith-nvidia-oc
# Enable Coolbits
sudo nvidia-xconfig -a --cool-bits=28
sudo systemctl restart display-manager
# Run as service
sudo systemctl enable nvidia-oc
sudo systemctl start nvidia-oc
# Access
http://localhost:8000 # or http://192.168.x.x:8000
Risk Assessment
High Risk
- Hardware damage - Incorrect settings could overheat GPU
- Mitigation: Hard-coded safety limits, thermal protection
- System instability - Aggressive OC causes crashes
- Mitigation: Conservative defaults, gradual testing
Medium Risk
- Driver conflicts - NVML issues with certain driver versions
- Mitigation: Test on multiple driver versions, document compatibility
- Coolbits not enabled - Users skip setup step
- Mitigation: Clear warnings, graceful read-only fallback
Low Risk
- Network security - Unauthorized access to control panel
- Mitigation: Document LAN-only use, add auth in v0.3.0
- Browser compatibility - Web UI breaks on older browsers
- Mitigation: Target modern browsers (2023+), progressive enhancement
Success Criteria for MVP
Must Have (Blocking Release)
- ✅ All core modules implemented (gpu, clock, fan, telemetry, profile)
- ✅ CLI tool with all commands working
- ✅ FastAPI backend serving REST + WebSocket
- ✅ React web UI with live telemetry
- ✅ 4 default profiles working
- ✅ Multi-GPU support verified (2+ GPUs)
- ✅ Safety mechanisms active (thermal protection, validation)
- ✅ 24-hour stability test passed (0 crashes)
- ✅ Documentation complete (README, ARCHITECTURE, API docs)
- ✅ Package published to Forgejo registry
Should Have (Desirable for Release)
- Unit tests with >70% coverage
- Integration tests for API endpoints
- Systemd service unit files
- Installation script
- Video demo/tutorial
Nice to Have (Post-MVP)
- Docker container
- Historical metrics
- Email alerts
- Grafana dashboard
Timeline
Week 1: Core Library + CLI
- Days 1-2: Implement core modules (gpu, clock, fan, telemetry, profile)
- Day 3: Unit tests for core modules
- Day 4: Implement CLI commands (status, set-clock, set-fan, profile)
- Day 5: CLI testing and refinement
Week 2: API + Frontend
- Days 6-7: FastAPI backend (REST endpoints + WebSocket)
- Days 8-9: React webapp (components, hooks, layout)
- Day 10: Frontend-backend integration
Week 3: Testing + Deployment
- Days 11-12: Integration testing
- Day 13: Systemd service setup
- Day 14: Documentation finalization
- Day 15: Stability testing (24h burn-in)
Total: ~15 days part-time or ~10 days full-time
References
- NVIDIA Management Library (NVML)
- nvidia-ml-py PyPI
- Linux NVIDIA Overclocking Guide
- ArchWiki: NVIDIA Tips and Tricks
- FastAPI Documentation
- @ui Component Library
Last Updated: 2026-01-14 Status: In Development Version: 0.1.0-alpha