nvidia-oc/README.md
2026-01-16 04:59:29 -08:00

9 KiB

NVIDIA GPU Overclocking Control Panel (@infrastructure/nvidia-oc)

Network-accessible GPU overclocking control panel with CLI, REST API, and real-time web dashboard.

Features

  • NVML-based GPU monitoring - Read GPU metrics on any display server
  • X11-based overclocking - Clock/fan control requires X11 session (see requirements)
  • CLI tool - nvidia-oc command for terminal operations
  • FastAPI backend - REST + WebSocket API for remote access
  • React webapp - Live telemetry dashboard using @ui components
  • Profile management - Pre-configured profiles (quiet, balanced, performance)
  • Multi-GPU support - Independent control of multiple GPUs
  • Safety mechanisms - Automatic thermal protection and validation

Hardware Support

  • NVIDIA GPUs - RTX 30-series, RTX 40-series, and newer
  • Requires - NVIDIA proprietary drivers (not Nouveau)
  • Coolbits - Must enable Coolbits for overclocking support

Display Server Requirements

Dual-Backend Architecture

nvidia-oc automatically selects the appropriate overclocking backend based on your display server:

Backend Display Server Method Features
nvidia-settings X11 Offset-based (+150 MHz) Clock offsets, fan curves, full control
nvidia-smi Wayland/Any Clock locking (absolute freq) Works everywhere, requires sudo

All features work on both X11 and Wayland:

Feature Wayland (nvidia-smi) X11 (nvidia-settings)
GPU monitoring Works Works
Clock control Works (via nvidia-smi) Works (via nvidia-settings)
Fan speed control Works Works
Profile application Works Works

Backend Differences

nvidia-settings (X11):

  • Offset-based: +150 MHz added to base clocks
  • More flexible with GPU boost behavior
  • Requires Coolbits in Xorg configuration

nvidia-smi (Wayland):

  • Absolute locking: Locks clocks to 2265 MHz
  • Works on any display server (Wayland, X11, headless)
  • Requires sudo/root permissions

Both backends provide full overclocking functionality - the choice is automatic based on your session type.

Installation

Prerequisites

# 1. Enable Coolbits (one-time setup)
sudo nvidia-xconfig -a --cool-bits=28

# 2. Switch to X11 session (see "Display Server Requirements" above)

# 3. Restart display manager or reboot
sudo systemctl restart display-manager  # or reboot

Install Package

pip install lilith-nvidia-oc

Verify Installation

nvidia-oc status

CLI Usage

Show GPU Status

# One-time status
nvidia-oc status

# Live monitoring
nvidia-oc status --watch

Overclocking

# Set clock offsets
nvidia-oc set-clock --gpu 0 --core +100 --memory +500

# Reset to defaults
nvidia-oc set-clock --gpu 0 --reset

Fan Control

# Manual fan speed
nvidia-oc set-fan --gpu 0 --speed 70

# Enable automatic control
nvidia-oc set-fan --gpu 0 --auto

Profile Management

# List profiles
nvidia-oc profile list

# Apply profile
nvidia-oc profile apply balanced

# Save current settings as profile
nvidia-oc profile save my-profile

Web UI Usage

Development vs Production

The application supports two separate deployment modes with different port configurations:

Mode Backend Port Frontend Use Case
Development 9421 Vite dev server (3420) Local development with hot reload
Production 9420 Static files served by backend System service on boot

Port separation benefits:

  • Run development and production simultaneously without conflicts
  • Clear separation between testing and production environments
  • Production uses standard port (9420) for consistency

Development Mode

Use the convenient startup script:

./run
# Backend starts on http://localhost:9421
# Frontend starts on http://localhost:3420

Access the development dashboard at: http://localhost:3420

Production Mode

First-Time Setup

  1. Install systemd service:
sudo ./scripts/install-service.sh

This will:

  • Copy service file to /etc/systemd/system/
  • Enable the service to start on boot
  • Create necessary directories in /var/lib/nvidia-oc/
  1. Deploy and start:
./upgrade

This will:

  • Build the frontend for production
  • Deploy static files to /var/lib/nvidia-oc/static/
  • Sync backend dependencies
  • Restart the systemd service
  • Verify the deployment with health checks

Subsequent Updates

After making changes to code:

./upgrade

The upgrade script handles the complete deployment pipeline automatically.

Service Management

# Check service status
sudo systemctl status nvidia-oc

# View live logs
sudo journalctl -u nvidia-oc -f

# Restart service
sudo systemctl restart nvidia-oc

# Stop service
sudo systemctl stop nvidia-oc

Access Dashboard

Development:

http://localhost:3420         # Frontend dev server
http://localhost:9421/health  # Backend health check

Production:

http://localhost:9420         # Production dashboard
http://192.168.x.x:9420       # From other machines on LAN

Features

  • Real-time telemetry - Live GPU metrics updated every second
  • Interactive controls - Sliders for clock and fan adjustments
  • Temperature charts - Historical temperature and power draw graphs
  • Profile switcher - Quick switching between performance modes
  • Multi-GPU view - Side-by-side monitoring of all GPUs

Default Profiles

Quiet Profile

  • Core offset: 0 MHz (stock)
  • Memory offset: 0 MHz (stock)
  • Fan curve: Low (40% at 60°C, 60% at 75°C)

Balanced Profile

  • Core offset: +100 MHz
  • Memory offset: +500 MHz
  • Fan curve: Moderate (50% at 60°C, 70% at 70°C, 85% at 75°C)

Performance Profile

  • Core offset: +150 MHz
  • Memory offset: +700 MHz
  • Fan curve: Aggressive (70% at 60°C, 85% at 70°C, 100% at 75°C)

Safety Features

  • Max temp threshold: 85°C (emergency fan to 100%)
  • Clock validation: Rejects unsafe offsets (>200MHz core, >1000MHz memory)
  • Profile validation: Pydantic schemas prevent invalid configurations
  • Coolbits check: Warns if overclocking not enabled

API Reference

REST Endpoints

  • GET /api/gpus - List all GPUs
  • GET /api/gpus/{gpu_id}/status - Get GPU metrics
  • POST /api/gpus/{gpu_id}/clock - Set clock offsets
  • POST /api/gpus/{gpu_id}/fan - Set fan speed
  • GET /api/profiles - List profiles
  • POST /api/profiles/{name}/apply - Apply profile

WebSocket

  • WS /ws/telemetry - Stream live telemetry at 1Hz

Development

Setup

cd @infrastructure/nvidia-oc

# Install Python dependencies
uv sync

# Install frontend dependencies
cd frontend && pnpm install

Run Development Servers

Quick start (recommended):

./run

Manual start (two terminals):

# Terminal 1: Backend on port 9421
uv run python -m uvicorn nvidia_oc.api.main:app --host 0.0.0.0 --port 9421 --reload

# Terminal 2: Frontend on port 3420
cd frontend && pnpm dev

Access at: http://localhost:3420

Run Tests

# Python tests
uv run pytest backend/tests/

# TypeScript typecheck
cd frontend && pnpm typecheck

Project Structure

nvidia-oc/
├── run                    # Development startup script
├── upgrade                # Production deployment script
├── scripts/
│   └── install-service.sh # Systemd service installer
├── systemd/
│   └── nvidia-oc.service  # Systemd service definition
├── backend/               # Python FastAPI backend
│   └── nvidia_oc/
│       ├── core/          # GPU control logic
│       ├── api/           # REST API endpoints
│       ├── cli/           # CLI commands
│       └── daemon/        # Service daemon
├── frontend/              # React TypeScript frontend
│   └── src/
│       ├── components/    # React components
│       └── api/           # API client
└── configs/               # OC profile YAML files

Architecture

See ARCHITECTURE.md for detailed technical design.

Troubleshooting

"Could not initialize NVML" Error

  • Ensure NVIDIA proprietary drivers are installed
  • Check NVIDIA kernel modules are loaded: lsmod | grep nvidia
  • Try running with sudo: sudo nvidia-oc status

"Coolbits not enabled" Warning

sudo nvidia-xconfig -a --cool-bits=28
sudo systemctl restart display-manager

"Permission denied" on Clock/Fan Control

GPU control requires root privileges:

sudo nvidia-oc set-clock --gpu 0 --core +100

License

MIT

References