20 KiB
20 KiB
NVIDIA OC Development TODOs
Project: @infrastructure/nvidia-oc - NVIDIA GPU Overclocking Control Panel
Status: In Development (Phase 2)
Start Date: 2026-01-14
Target Release: v0.1.0 (2-3 weeks)
Phase 1: Package Scaffolding ✅ COMPLETED
1.1 Directory Structure ✅
- Create package directory:
@infrastructure/nvidia-oc/ - Create backend Python structure:
backend/nvidia_oc/{core,api,cli,config,utils}/ - Create frontend React structure:
frontend/src/{components,hooks,api,types}/ - Create configs directory:
configs/ - Create systemd directory:
systemd/ - Create tests directory:
backend/tests/
1.2 Package Metadata ✅
- Generate
pyproject.tomlwith all Python dependencies - Generate
package.jsonwith @ui workspace dependencies - Create
.gitignorefor Python + Node - Create
README.mdwith user documentation - Create
ARCHITECTURE.mdwith technical design - Create
docs/PROJECT_OVERVIEW.md - Create
docs/IMPLEMENTATION_GUIDE.md
1.3 Python Module Init Files ✅
backend/nvidia_oc/__init__.pybackend/nvidia_oc/core/__init__.pybackend/nvidia_oc/api/__init__.pybackend/nvidia_oc/api/routes/__init__.pybackend/nvidia_oc/cli/__init__.pybackend/nvidia_oc/cli/commands/__init__.pybackend/nvidia_oc/config/__init__.pybackend/nvidia_oc/utils/__init__.py
1.4 Default Profile Configs ✅
configs/default.yaml- Factory reset profileconfigs/quiet.yaml- Stock clocks, low fan (30-80%)configs/balanced.yaml- +100/+500 MHz, moderate fan (50-100%)configs/performance.yaml- +150/+700 MHz, aggressive fan (70-100%)
1.5 Git Repository ✅
- Initialize git repo
- Add .gitignore
- Let auto-commit service handle first commit
Phase 2: Python Core Library 🔄 IN PROGRESS
2.1 GPU Management (core/gpu.py)
- Implement
GPUDevicedataclass- Fields: index, name, uuid, handle
- repr method for debugging
- Implement
GPUManagerclass__init__()- Initialize NVMLlist_devices()- Enumerate all GPUsget_device(index)- Get specific GPU by index_refresh_devices()- Query NVML for devices__del__()- Shutdown NVML- Error handling for missing NVIDIA driver
- Error handling for NVML initialization failures
2.2 Clock Control (core/clock.py)
- Implement
ClockInfodataclass- Fields: core, memory, shader
- Implement
ClockControllerclassget_clocks(device)- Read current clock speedsset_clock_offset(device, core, memory)- Apply offsets via nvidia-settingsreset_clocks(device)- Reset to defaultscheck_coolbits()- Verify Coolbits enabled- Error handling for nvidia-settings subprocess
- Validation before applying offsets
2.3 Fan Control (core/fan.py)
- Define
FanCurvetype alias:List[Tuple[int, int]] - Implement
FanControllerclassget_fan_speed(device)- Read current fan %set_fan_speed(device, speed)- Manual fan controlapply_curve(device, curve)- Start curve monitoring task_curve_monitor(device, curve)- Background task (async)_interpolate_curve(temp, curve)- Linear interpolationenable_auto(device)- Re-enable automatic fan- Error handling for NVML fan API calls
2.4 Telemetry Collection (core/telemetry.py)
- Implement
GPUMetricsdataclass- Fields: timestamp, temperature, fan_speed, power_draw, core_clock, memory_clock, utilization, memory_used, memory_total
- Implement
TelemetryCollectorclasscollect(device)- One-time metrics snapshotstream(device, interval)- Async generator for single GPUstream_all(devices, interval)- Async generator for all GPUs- Error handling for NVML metric queries
- Unit conversion (mW → W, bytes → MB)
2.5 Profile Management (core/profile.py)
- Implement
ProfileConfigPydantic model- Fields: name, description, core_offset, memory_offset, power_limit, fan_curve
- Validators for offset ranges
- Validators for fan curve format
- Implement
ProfileManagerclassload(path)- Load profile from YAMLsave(profile, path)- Save profile to YAMLapply(device, profile)- Apply profile to GPUcapture(device)- Capture current settingslist_profiles(directory)- List available profiles- Error handling for YAML parsing
- Profile validation with Pydantic
2.6 Configuration & Utilities
- Implement
config/schema.pyOCConfigPydantic model for app settings
- Implement
config/defaults.pyDEFAULT_FAN_CURVEconstantDEFAULT_PROFILESdict
- Implement
utils/output.pyconsole- Rich Console instanceprint_table()- Format GPU status as tableprint_gpu_status()- Single GPU status formatter- Color-coded temperature output (green/yellow/red)
- Implement
utils/validation.pyvalidate_clock_offset(offset, domain)- Check ±200/±1000 limitsvalidate_fan_speed(speed)- Check 0-100 rangevalidate_temperature_threshold(temp)- Check reasonable range
2.7 Unit Tests
tests/test_gpu.py- Mock NVML calls
- Test device enumeration
- Test error handling
tests/test_clock.py- Mock nvidia-settings subprocess
- Test validation logic
- Test Coolbits check
tests/test_fan.py- Mock NVML fan API
- Test curve interpolation algorithm
- Test background task lifecycle
tests/test_telemetry.py- Mock NVML metric queries
- Test async streaming
- Test unit conversions
tests/test_profile.py- Test YAML loading/saving
- Test Pydantic validation
- Test profile application
Phase 3: CLI Tool 📋 PENDING
3.1 CLI Entry Point (cli/main.py)
- Create Click group
cli() - Add version option
- Add global options (--verbose, --gpu-id)
- Register all commands
- Error handling with Rich formatting
3.2 Status Command (cli/commands/status.py)
- Implement
statuscommand--watchflag for live monitoring- Display GPU info, temp, fan, clocks, utilization
- Use Rich Table for formatting
- Color-coded temperatures
- Handle multiple GPUs side-by-side
- Update every 1 second in watch mode
3.3 Clock Commands (cli/commands/set_clock.py)
- Implement
set-clockcommand--gpu <id>option (required)--core <offset>option (required)--memory <offset>option (required)--resetflag to reset to defaults- Confirmation prompt for large offsets
- Progress spinner while applying
- Success/error Rich output
3.4 Fan Commands (cli/commands/set_fan.py)
- Implement
set-fancommand--gpu <id>option (required)--speed <percent>option for manual control--autoflag to re-enable automatic--curve <file>option to apply custom curve- Validation for speed range
- Progress spinner while applying
- Success/error Rich output
3.5 Profile Commands (cli/commands/profile.py)
- Implement
profile listsubcommand- Show system profiles (configs/)
- Show user profiles (~/.config/nvidia-oc/profiles/)
- Display as Rich table with descriptions
- Implement
profile apply <name>subcommand- Load profile from YAML
- Apply to specified GPU (or all)
- Show progress for each step
- Confirm successful application
- Implement
profile save <name>subcommand- Capture current GPU settings
- Save to user profile directory
- Confirm file written
- Implement
profile delete <name>subcommand- Confirmation prompt
- Delete user profile file
- Prevent deletion of system profiles
3.6 Daemon Command (cli/commands/daemon.py) [Optional]
- Implement
daemoncommand- Start background monitoring service
- Apply fan curves continuously
- Thermal protection monitoring
- Write PID file
- Graceful shutdown on SIGTERM
3.7 CLI Testing
- Test all commands with mock data
- Test error handling (missing GPU, permission denied)
- Test Rich output rendering
- Test watch mode interrupt (Ctrl+C)
Phase 4: FastAPI Backend 🌐 PENDING
4.1 Main App (api/main.py)
- Create FastAPI app instance
- Add CORS middleware
- Add startup event handler
- Initialize GPUManager
- Initialize TelemetryCollector
- Add shutdown event handler
- Cleanup NVML
- Serve static frontend files at
/ - Add exception handlers
4.2 GPU Routes (api/routes/gpu.py)
GET /api/gpus- List all GPUs- Return list of GPU metadata
GET /api/gpus/{gpu_id}- Get single GPU info- Return detailed GPU info
GET /api/gpus/{gpu_id}/status- Get current metrics- Return GPUMetrics
POST /api/gpus/{gpu_id}/clock- Set clock offsets- Request body:
{core: int, memory: int} - Validate offsets
- Apply via ClockController
- Return success/error
- Request body:
POST /api/gpus/{gpu_id}/fan- Set fan speed- Request body:
{speed: int}or{auto: true} - Apply via FanController
- Return success/error
- Request body:
4.3 Profile Routes (api/routes/profile.py)
GET /api/profiles- List profiles- Return system + user profiles
GET /api/profiles/{name}- Get profile details- Return ProfileConfig
POST /api/profiles/{name}/apply- Apply profile- Request body:
{gpu_id?: int}(optional, default all) - Apply via ProfileManager
- Return success/error
- Request body:
POST /api/profiles- Create new profile- Request body: ProfileConfig
- Save to user directory
- Return success/error
DELETE /api/profiles/{name}- Delete profile- Only allow user profiles
- Return success/error
4.4 WebSocket Telemetry (api/routes/telemetry.py)
WS /ws/telemetry- Stream live telemetry- Accept WebSocket connection
- Stream GPUMetrics at 1Hz (configurable)
- Send JSON:
{timestamp, gpus: [...]} - Handle client disconnection gracefully
- Error handling for NVML failures
4.5 API Models (api/models.py)
ClockRequest- Request body for clock updatesFanRequest- Request body for fan updatesProfileApplyRequest- Request body for profile applicationGPUResponse- Response schema for GPU infoMetricsResponse- Response schema for telemetryErrorResponse- Standard error response
4.6 Integration Tests
- Test all REST endpoints with TestClient
- Test WebSocket connection and streaming
- Test error responses (404, 400, 500)
- Test CORS headers
- Test concurrent requests
Phase 5: React Frontend 🎨 PENDING
5.1 Project Setup
- Create
frontend/vite.config.ts- Configure React plugin
- Configure proxy to backend (port 8000)
- Configure build output directory
- Create
frontend/tsconfig.json- Extend @lilith/typescript-config-react
- Configure path aliases
- Create
frontend/index.html- Basic HTML shell
- Import main.tsx
5.2 Main App (frontend/src/App.tsx)
- Setup ThemeProvider with cyberpunk adapter
- Setup ToastProvider for notifications
- Setup Navigation component
- Create main layout with Container + Grid
- Implement error boundary
- Implement loading states
5.3 Custom Hooks
hooks/useWebSocket.ts- Establish WebSocket connection
- Parse incoming telemetry messages
- Handle connection states (connecting, connected, disconnected, error)
- Automatic reconnection on disconnect
- Return:
{metrics, connectionState, error}
hooks/useGPUData.ts- Fetch GPU list from
/api/gpus - Provide mutation functions (updateClock, updateFan)
- Handle loading/error states
- Return:
{gpus, loading, error, updateClock, updateFan}
- Fetch GPU list from
hooks/useProfiles.ts- Fetch profile list from
/api/profiles - Provide apply/save/delete functions
- Handle loading/error states
- Return:
{profiles, loading, applyProfile, saveProfile, deleteProfile}
- Fetch profile list from
5.4 API Client (frontend/src/api/client.ts)
- Create axios instance with base URL
fetchGPUs()- GET /api/gpusfetchGPUStatus(id)- GET /api/gpus/{id}/statusupdateClock(id, core, memory)- POST /api/gpus/{id}/clockupdateFan(id, speed)- POST /api/gpus/{id}/fanfetchProfiles()- GET /api/profilesapplyProfile(name, gpuId?)- POST /api/profiles/{name}/apply- Error handling and retries
5.5 Components
components/GPUCard.tsx- Display GPU name, index
- StatCard for temp, fan, power, clocks
- Color-coded temperature indicator
- Utilization bar
- Memory usage bar
components/ClockControl.tsx- LabeledSlider for core offset (-200 to +200)
- LabeledSlider for memory offset (-1000 to +1000)
- Apply button
- Reset button
- Current values display
components/FanControl.tsx- LabeledSlider for manual fan speed (0-100)
- Auto button to re-enable automatic
- Current fan speed display
- Fan curve editor (future)
components/TelemetryChart.tsx- LineChart or AreaChart from @ui/ui-charts
- Configurable metric (temp, power, clock, etc.)
- Rolling window (last 60 data points)
- X-axis: time, Y-axis: metric value
components/ProfileManager.tsx- DataTable listing all profiles
- Apply button per profile
- Delete button for user profiles
- Create new profile button
components/StatusIndicator.tsx- SystemHealthIndicator from @ui/ui-admin
- Show GPU health status
- Show connection status
- Show safety threshold warnings
5.6 Types (frontend/src/types/index.ts)
GPUinterfaceGPUMetricsinterfaceProfileinterfaceConnectionStateenumClockUpdateinterfaceFanUpdateinterface
5.7 Entry Point (frontend/src/main.tsx)
- Create root with React 19 createRoot
- Wrap App with StrictMode
- Mount to #root element
5.8 Frontend Testing
- Test components with React Testing Library
- Test hooks with renderHook
- Test WebSocket connection mocking
- Test API client error handling
Phase 6: Integration & Testing 🧪 PENDING
6.1 End-to-End Integration
- Install Python package in editable mode
- Install frontend dependencies with pnpm
- Start backend dev server (uvicorn --reload)
- Start frontend dev server (vite)
- Test full flow: CLI → API → Web UI
- Verify telemetry streaming works
- Verify profile switching works
- Test on actual NVIDIA GPU hardware
6.2 Systemd Services
- Create
systemd/nvidia-oc.service- ExecStart: uvicorn serving on 0.0.0.0:8000
- User: root (required for GPU control)
- Restart: always
- After: network.target, nvidia-persistenced.service
- Create
systemd/nvidia-oc-daemon.service(optional)- Background fan curve monitoring
- Thermal protection
- Test service installation
systemctl enable nvidia-ocsystemctl start nvidia-oc- Verify web UI accessible
- Verify service survives reboot
6.3 Stability Testing
- Conservative OC Test (4 hours)
- Apply balanced profile (+100/+500)
- Run ML training workload
- Monitor for crashes, CUDA errors
- Monitor temps stay below 80°C
- Aggressive OC Test (8 hours)
- Apply performance profile (+150/+700)
- Run stress test (FurMark or similar)
- Monitor for instability
- Monitor temps stay below 75°C
- 24-Hour Burn-In
- Apply final stable profile
- Run continuous workload
- Monitor metrics every hour
- Verify 0 crashes, 0 errors
- Document final stable settings
6.4 Multi-GPU Testing
- Test with 2x RTX 3090 setup
- Verify independent control of each GPU
- Test applying different profiles to different GPUs
- Test WebSocket streams both GPUs correctly
- Test CLI status displays both GPUs
6.5 Documentation Finalization
- Update README with installation instructions
- Add API documentation (OpenAPI/Swagger)
- Add troubleshooting section
- Add performance tuning guide
- Add safety warnings
- Record demo video/GIF
Phase 7: Production Deployment 🚀 PENDING
7.1 Bluefin LTS Deployment (Primary Workstation)
- Enable Coolbits:
sudo nvidia-xconfig -a --cool-bits=28 - Reboot or restart display-manager
- Install package:
pip install -e .orpip install lilith-nvidia-oc - Copy systemd service to
/etc/systemd/system/ - Enable and start service
- Test web UI access from browser
- Test CLI commands
- Apply balanced profile
- Monitor for 24 hours
7.2 Ubuntu Headless Server Deployment
- Install NVIDIA drivers
- Enable Coolbits (may need virtual X)
- Install package via pip
- Configure systemd service
- Test remote access from workstation
- Test WebSocket streaming
- Apply performance profile
- Monitor for 48 hours
7.3 Performance Validation
- Measure baseline performance (stock clocks, auto fan)
- Record GPU temps under load
- Record training iterations/sec
- Record fan noise level
- Measure with balanced profile
- Record GPU temps (target: -7 to -10°C)
- Record training iterations/sec (target: +5-7%)
- Record fan noise level
- Measure with performance profile
- Record GPU temps (target: -10 to -15°C)
- Record training iterations/sec (target: +8-12%)
- Record fan noise level
- Document all results in README
7.4 Publishing to Forgejo
- Build Python package:
python -m build - Publish to Forgejo PyPI:
twine upload --repository-url ... - Build frontend:
cd frontend && pnpm build - Tag release:
git tag v0.1.0 - Push to Forgejo:
git push origin v0.1.0 - Create release notes on Forgejo
7.5 Monitoring & Maintenance
- Set up log monitoring (journalctl -u nvidia-oc -f)
- Monitor for errors or warnings
- Track GPU temps over 7 days
- Verify no performance degradation
- Collect user feedback
- Plan v0.2.0 features
Future Enhancements (Post-MVP)
v0.2.0
- Historical metrics database (SQLite)
- Profile scheduler (auto-switch by time/load)
- Voltage control (advanced Coolbits)
- Email/webhook alerts on thermal events
- Docker containerization
- Prometheus metrics exporter
v0.3.0
- Multi-node cluster support
- Authentication (JWT + HTTPS)
- Power limit curves (dynamic)
- Mobile-responsive UI improvements
- Profile sharing (import/export)
- Fan curve visual editor in web UI
v1.0.0
- Stable API contract
- Packaging for Fedora/Ubuntu/Arch
- Security audit
- Production systemd hardening
- Comprehensive test suite (>90% coverage)
- Localization (i18n)
Blockers & Risks
Active Blockers
- None currently
Known Risks
- Coolbits dependency - Users may not enable Coolbits
- Mitigation: Clear documentation, graceful fallback (read-only mode)
- nvidia-settings X11 requirement - Clock writes need X server
- Mitigation: Virtual X on headless servers, document workaround
- Driver compatibility - NVML API may vary between driver versions
- Mitigation: Test on multiple driver versions (535.x, 545.x, 550.x)
- Hardware variance - Different GPU models may behave differently
- Mitigation: Test on multiple GPUs (3090, 4090, etc.)
Notes
- Development Environment: Bluefin LTS (Fedora-based), Wayland, 2x RTX 3090
- Target Users: ML engineers, Linux gamers, GPU server admins
- Primary Goal: Provide MSI Afterburner equivalent for Linux
- Success Metric: 24h stable operation at +100/+500 overclock
Last Updated: 2026-01-14 Current Phase: Phase 2 (Core Library Implementation) Estimated Completion: 2-3 weeks from start