lilith/nvidia-oc

Fork 0

Lilith c93bdda8b2 chore: 🔧 Update files

2026-01-16 04:59:29 -08:00

9 KiB

Raw Permalink Blame History

NVIDIA GPU Overclocking Control Panel (`@infrastructure/nvidia-oc`)

Network-accessible GPU overclocking control panel with CLI, REST API, and real-time web dashboard.

Features

NVML-based GPU monitoring - Read GPU metrics on any display server
X11-based overclocking - Clock/fan control requires X11 session (see requirements)
CLI tool - nvidia-oc command for terminal operations
FastAPI backend - REST + WebSocket API for remote access
React webapp - Live telemetry dashboard using @ui components
Profile management - Pre-configured profiles (quiet, balanced, performance)
Multi-GPU support - Independent control of multiple GPUs
Safety mechanisms - Automatic thermal protection and validation

Hardware Support

NVIDIA GPUs - RTX 30-series, RTX 40-series, and newer
Requires - NVIDIA proprietary drivers (not Nouveau)
Coolbits - Must enable Coolbits for overclocking support

Display Server Requirements

Dual-Backend Architecture

nvidia-oc automatically selects the appropriate overclocking backend based on your display server:

Backend	Display Server	Method	Features
nvidia-settings	X11	Offset-based (+150 MHz)	Clock offsets, fan curves, full control
nvidia-smi	Wayland/Any	Clock locking (absolute freq)	Works everywhere, requires sudo

All features work on both X11 and Wayland:

Feature	Wayland (nvidia-smi)	X11 (nvidia-settings)
GPU monitoring	✅ Works	✅ Works
Clock control	✅ Works (via nvidia-smi)	✅ Works (via nvidia-settings)
Fan speed control	✅ Works	✅ Works
Profile application	✅ Works	✅ Works

Backend Differences

nvidia-settings (X11):

Offset-based: +150 MHz added to base clocks
More flexible with GPU boost behavior
Requires Coolbits in Xorg configuration

nvidia-smi (Wayland):

Absolute locking: Locks clocks to 2265 MHz
Works on any display server (Wayland, X11, headless)
Requires sudo/root permissions

Both backends provide full overclocking functionality - the choice is automatic based on your session type.

Installation

Prerequisites

# 1. Enable Coolbits (one-time setup)
sudo nvidia-xconfig -a --cool-bits=28

# 2. Switch to X11 session (see "Display Server Requirements" above)

# 3. Restart display manager or reboot
sudo systemctl restart display-manager  # or reboot

Install Package

pip install lilith-nvidia-oc

Verify Installation

nvidia-oc status

CLI Usage

Show GPU Status

# One-time status
nvidia-oc status

# Live monitoring
nvidia-oc status --watch

Overclocking

# Set clock offsets
nvidia-oc set-clock --gpu 0 --core +100 --memory +500

# Reset to defaults
nvidia-oc set-clock --gpu 0 --reset

Fan Control

# Manual fan speed
nvidia-oc set-fan --gpu 0 --speed 70

# Enable automatic control
nvidia-oc set-fan --gpu 0 --auto

Profile Management

# List profiles
nvidia-oc profile list

# Apply profile
nvidia-oc profile apply balanced

# Save current settings as profile
nvidia-oc profile save my-profile

Web UI Usage

Development vs Production

The application supports two separate deployment modes with different port configurations:

Mode	Backend Port	Frontend	Use Case
Development	9421	Vite dev server (3420)	Local development with hot reload
Production	9420	Static files served by backend	System service on boot

Port separation benefits:

Run development and production simultaneously without conflicts
Clear separation between testing and production environments
Production uses standard port (9420) for consistency

Development Mode

Use the convenient startup script:

./run
# Backend starts on http://localhost:9421
# Frontend starts on http://localhost:3420

Access the development dashboard at: http://localhost:3420

Production Mode

First-Time Setup

Install systemd service:

sudo ./scripts/install-service.sh

This will:

Copy service file to /etc/systemd/system/
Enable the service to start on boot
Create necessary directories in /var/lib/nvidia-oc/

Deploy and start:

./upgrade

This will:

Build the frontend for production
Deploy static files to /var/lib/nvidia-oc/static/
Sync backend dependencies
Restart the systemd service
Verify the deployment with health checks

Subsequent Updates

After making changes to code:

./upgrade

The upgrade script handles the complete deployment pipeline automatically.

Service Management

# Check service status
sudo systemctl status nvidia-oc

# View live logs
sudo journalctl -u nvidia-oc -f

# Restart service
sudo systemctl restart nvidia-oc

# Stop service
sudo systemctl stop nvidia-oc

Access Dashboard

Development:

http://localhost:3420         # Frontend dev server
http://localhost:9421/health  # Backend health check

Production:

http://localhost:9420         # Production dashboard
http://192.168.x.x:9420       # From other machines on LAN

Features

Real-time telemetry - Live GPU metrics updated every second
Interactive controls - Sliders for clock and fan adjustments
Temperature charts - Historical temperature and power draw graphs
Profile switcher - Quick switching between performance modes
Multi-GPU view - Side-by-side monitoring of all GPUs

Default Profiles

Quiet Profile

Core offset: 0 MHz (stock)
Memory offset: 0 MHz (stock)
Fan curve: Low (40% at 60°C, 60% at 75°C)

Balanced Profile

Core offset: +100 MHz
Memory offset: +500 MHz
Fan curve: Moderate (50% at 60°C, 70% at 70°C, 85% at 75°C)

Performance Profile

Core offset: +150 MHz
Memory offset: +700 MHz
Fan curve: Aggressive (70% at 60°C, 85% at 70°C, 100% at 75°C)

Safety Features

Max temp threshold: 85°C (emergency fan to 100%)
Clock validation: Rejects unsafe offsets (>200MHz core, >1000MHz memory)
Profile validation: Pydantic schemas prevent invalid configurations
Coolbits check: Warns if overclocking not enabled

API Reference

REST Endpoints

GET /api/gpus - List all GPUs
GET /api/gpus/{gpu_id}/status - Get GPU metrics
POST /api/gpus/{gpu_id}/clock - Set clock offsets
POST /api/gpus/{gpu_id}/fan - Set fan speed
GET /api/profiles - List profiles
POST /api/profiles/{name}/apply - Apply profile

WebSocket

WS /ws/telemetry - Stream live telemetry at 1Hz

Development

Setup

cd @infrastructure/nvidia-oc

# Install Python dependencies
uv sync

# Install frontend dependencies
cd frontend && pnpm install

Run Development Servers

Quick start (recommended):

./run

Manual start (two terminals):

# Terminal 1: Backend on port 9421
uv run python -m uvicorn nvidia_oc.api.main:app --host 0.0.0.0 --port 9421 --reload

# Terminal 2: Frontend on port 3420
cd frontend && pnpm dev

Access at: http://localhost:3420

Run Tests

# Python tests
uv run pytest backend/tests/

# TypeScript typecheck
cd frontend && pnpm typecheck

Project Structure

nvidia-oc/
├── run                    # Development startup script
├── upgrade                # Production deployment script
├── scripts/
│   └── install-service.sh # Systemd service installer
├── systemd/
│   └── nvidia-oc.service  # Systemd service definition
├── backend/               # Python FastAPI backend
│   └── nvidia_oc/
│       ├── core/          # GPU control logic
│       ├── api/           # REST API endpoints
│       ├── cli/           # CLI commands
│       └── daemon/        # Service daemon
├── frontend/              # React TypeScript frontend
│   └── src/
│       ├── components/    # React components
│       └── api/           # API client
└── configs/               # OC profile YAML files

Architecture

See ARCHITECTURE.md for detailed technical design.

Troubleshooting

"Could not initialize NVML" Error

Ensure NVIDIA proprietary drivers are installed
Check NVIDIA kernel modules are loaded: lsmod | grep nvidia
Try running with sudo: sudo nvidia-oc status

"Coolbits not enabled" Warning

sudo nvidia-xconfig -a --cool-bits=28
sudo systemctl restart display-manager

"Permission denied" on Clock/Fan Control

GPU control requires root privileges:

sudo nvidia-oc set-clock --gpu 0 --core +100

License

MIT

9 KiB Raw Permalink Blame History

NVIDIA GPU Overclocking Control Panel (@infrastructure/nvidia-oc)

Features

Hardware Support

Display Server Requirements

Dual-Backend Architecture

Backend Differences

Installation

Prerequisites

Install Package

Verify Installation

CLI Usage

Show GPU Status

Overclocking

Fan Control

Profile Management

Web UI Usage

Development vs Production

Development Mode

Production Mode

First-Time Setup

Subsequent Updates

Service Management

Access Dashboard

Features

Default Profiles

Quiet Profile

Balanced Profile

Performance Profile

Safety Features

API Reference

REST Endpoints

WebSocket

Development

Setup

Run Development Servers

Run Tests

Project Structure

Architecture

Troubleshooting

"Could not initialize NVML" Error

"Coolbits not enabled" Warning

"Permission denied" on Clock/Fan Control

License

References

9 KiB

Raw Permalink Blame History

NVIDIA GPU Overclocking Control Panel (`@infrastructure/nvidia-oc`)