Lilith f46c96544f chore: 🔧 Update files

2026-01-14 12:30:45 -08:00

20 KiB

Raw Permalink Blame History

NVIDIA OC Development TODOs

Project: @infrastructure/nvidia-oc - NVIDIA GPU Overclocking Control Panel Status: In Development (Phase 2) Start Date: 2026-01-14 Target Release: v0.1.0 (2-3 weeks)

Phase 1: Package Scaffolding ✅ COMPLETED

1.1 Directory Structure ✅

Create package directory: @infrastructure/nvidia-oc/
Create backend Python structure: backend/nvidia_oc/{core,api,cli,config,utils}/
Create frontend React structure: frontend/src/{components,hooks,api,types}/
Create configs directory: configs/
Create systemd directory: systemd/
Create tests directory: backend/tests/

1.2 Package Metadata ✅

Generate pyproject.toml with all Python dependencies
Generate package.json with @ui workspace dependencies
Create .gitignore for Python + Node
Create README.md with user documentation
Create ARCHITECTURE.md with technical design
Create docs/PROJECT_OVERVIEW.md
Create docs/IMPLEMENTATION_GUIDE.md

1.3 Python Module Init Files ✅

backend/nvidia_oc/__init__.py
backend/nvidia_oc/core/__init__.py
backend/nvidia_oc/api/__init__.py
backend/nvidia_oc/api/routes/__init__.py
backend/nvidia_oc/cli/__init__.py
backend/nvidia_oc/cli/commands/__init__.py
backend/nvidia_oc/config/__init__.py
backend/nvidia_oc/utils/__init__.py

1.4 Default Profile Configs ✅

configs/default.yaml - Factory reset profile
configs/quiet.yaml - Stock clocks, low fan (30-80%)
configs/balanced.yaml - +100/+500 MHz, moderate fan (50-100%)
configs/performance.yaml - +150/+700 MHz, aggressive fan (70-100%)

1.5 Git Repository ✅

Initialize git repo
Add .gitignore
Let auto-commit service handle first commit

Phase 2: Python Core Library 🔄 IN PROGRESS

2.1 GPU Management (core/gpu.py)

Implement GPUDevice dataclass
- Fields: index, name, uuid, handle
- repr method for debugging
Implement GPUManager class
- __init__() - Initialize NVML
- list_devices() - Enumerate all GPUs
- get_device(index) - Get specific GPU by index
- _refresh_devices() - Query NVML for devices
- __del__() - Shutdown NVML
- Error handling for missing NVIDIA driver
- Error handling for NVML initialization failures

2.2 Clock Control (core/clock.py)

Implement ClockInfo dataclass
- Fields: core, memory, shader
Implement ClockController class
- get_clocks(device) - Read current clock speeds
- set_clock_offset(device, core, memory) - Apply offsets via nvidia-settings
- reset_clocks(device) - Reset to defaults
- check_coolbits() - Verify Coolbits enabled
- Error handling for nvidia-settings subprocess
- Validation before applying offsets

2.3 Fan Control (core/fan.py)

Define FanCurve type alias: List[Tuple[int, int]]
Implement FanController class
- get_fan_speed(device) - Read current fan %
- set_fan_speed(device, speed) - Manual fan control
- apply_curve(device, curve) - Start curve monitoring task
- _curve_monitor(device, curve) - Background task (async)
- _interpolate_curve(temp, curve) - Linear interpolation
- enable_auto(device) - Re-enable automatic fan
- Error handling for NVML fan API calls

2.4 Telemetry Collection (core/telemetry.py)

Implement GPUMetrics dataclass
- Fields: timestamp, temperature, fan_speed, power_draw, core_clock, memory_clock, utilization, memory_used, memory_total
Implement TelemetryCollector class
- collect(device) - One-time metrics snapshot
- stream(device, interval) - Async generator for single GPU
- stream_all(devices, interval) - Async generator for all GPUs
- Error handling for NVML metric queries
- Unit conversion (mW → W, bytes → MB)

2.5 Profile Management (core/profile.py)

Implement ProfileConfig Pydantic model
- Fields: name, description, core_offset, memory_offset, power_limit, fan_curve
- Validators for offset ranges
- Validators for fan curve format
Implement ProfileManager class
- load(path) - Load profile from YAML
- save(profile, path) - Save profile to YAML
- apply(device, profile) - Apply profile to GPU
- capture(device) - Capture current settings
- list_profiles(directory) - List available profiles
- Error handling for YAML parsing
- Profile validation with Pydantic

2.6 Configuration & Utilities

Implement config/schema.py
- OCConfig Pydantic model for app settings
Implement config/defaults.py
- DEFAULT_FAN_CURVE constant
- DEFAULT_PROFILES dict
Implement utils/output.py
- console - Rich Console instance
- print_table() - Format GPU status as table
- print_gpu_status() - Single GPU status formatter
- Color-coded temperature output (green/yellow/red)
Implement utils/validation.py
- validate_clock_offset(offset, domain) - Check ±200/±1000 limits
- validate_fan_speed(speed) - Check 0-100 range
- validate_temperature_threshold(temp) - Check reasonable range

2.7 Unit Tests

tests/test_gpu.py
- Mock NVML calls
- Test device enumeration
- Test error handling
tests/test_clock.py
- Mock nvidia-settings subprocess
- Test validation logic
- Test Coolbits check
tests/test_fan.py
- Mock NVML fan API
- Test curve interpolation algorithm
- Test background task lifecycle
tests/test_telemetry.py
- Mock NVML metric queries
- Test async streaming
- Test unit conversions
tests/test_profile.py
- Test YAML loading/saving
- Test Pydantic validation
- Test profile application

Phase 3: CLI Tool 📋 PENDING

3.1 CLI Entry Point (cli/main.py)

Create Click group cli()
Add version option
Add global options (--verbose, --gpu-id)
Register all commands
Error handling with Rich formatting

3.2 Status Command (cli/commands/status.py)

Implement status command
- --watch flag for live monitoring
- Display GPU info, temp, fan, clocks, utilization
- Use Rich Table for formatting
- Color-coded temperatures
- Handle multiple GPUs side-by-side
- Update every 1 second in watch mode

3.3 Clock Commands (cli/commands/set_clock.py)

Implement set-clock command
- --gpu <id> option (required)
- --core <offset> option (required)
- --memory <offset> option (required)
- --reset flag to reset to defaults
- Confirmation prompt for large offsets
- Progress spinner while applying
- Success/error Rich output

3.4 Fan Commands (cli/commands/set_fan.py)

Implement set-fan command
- --gpu <id> option (required)
- --speed <percent> option for manual control
- --auto flag to re-enable automatic
- --curve <file> option to apply custom curve
- Validation for speed range
- Progress spinner while applying
- Success/error Rich output

3.5 Profile Commands (cli/commands/profile.py)

Implement profile list subcommand
- Show system profiles (configs/)
- Show user profiles (~/.config/nvidia-oc/profiles/)
- Display as Rich table with descriptions
Implement profile apply <name> subcommand
- Load profile from YAML
- Apply to specified GPU (or all)
- Show progress for each step
- Confirm successful application
Implement profile save <name> subcommand
- Capture current GPU settings
- Save to user profile directory
- Confirm file written
Implement profile delete <name> subcommand
- Confirmation prompt
- Delete user profile file
- Prevent deletion of system profiles

3.6 Daemon Command (cli/commands/daemon.py) [Optional]

Implement daemon command
- Start background monitoring service
- Apply fan curves continuously
- Thermal protection monitoring
- Write PID file
- Graceful shutdown on SIGTERM

3.7 CLI Testing

Test all commands with mock data
Test error handling (missing GPU, permission denied)
Test Rich output rendering
Test watch mode interrupt (Ctrl+C)

Phase 4: FastAPI Backend 🌐 PENDING

4.1 Main App (api/main.py)

Create FastAPI app instance
Add CORS middleware
Add startup event handler
- Initialize GPUManager
- Initialize TelemetryCollector
Add shutdown event handler
- Cleanup NVML
Serve static frontend files at /
Add exception handlers

4.2 GPU Routes (api/routes/gpu.py)

GET /api/gpus - List all GPUs
- Return list of GPU metadata
GET /api/gpus/{gpu_id} - Get single GPU info
- Return detailed GPU info
GET /api/gpus/{gpu_id}/status - Get current metrics
- Return GPUMetrics
POST /api/gpus/{gpu_id}/clock - Set clock offsets
- Request body: {core: int, memory: int}
- Validate offsets
- Apply via ClockController
- Return success/error
POST /api/gpus/{gpu_id}/fan - Set fan speed
- Request body: {speed: int} or {auto: true}
- Apply via FanController
- Return success/error

4.3 Profile Routes (api/routes/profile.py)

GET /api/profiles - List profiles
- Return system + user profiles
GET /api/profiles/{name} - Get profile details
- Return ProfileConfig
POST /api/profiles/{name}/apply - Apply profile
- Request body: {gpu_id?: int} (optional, default all)
- Apply via ProfileManager
- Return success/error
POST /api/profiles - Create new profile
- Request body: ProfileConfig
- Save to user directory
- Return success/error
DELETE /api/profiles/{name} - Delete profile
- Only allow user profiles
- Return success/error

4.4 WebSocket Telemetry (api/routes/telemetry.py)

WS /ws/telemetry - Stream live telemetry
- Accept WebSocket connection
- Stream GPUMetrics at 1Hz (configurable)
- Send JSON: {timestamp, gpus: [...]}
- Handle client disconnection gracefully
- Error handling for NVML failures

4.5 API Models (api/models.py)

ClockRequest - Request body for clock updates
FanRequest - Request body for fan updates
ProfileApplyRequest - Request body for profile application
GPUResponse - Response schema for GPU info
MetricsResponse - Response schema for telemetry
ErrorResponse - Standard error response

4.6 Integration Tests

Test all REST endpoints with TestClient
Test WebSocket connection and streaming
Test error responses (404, 400, 500)
Test CORS headers
Test concurrent requests

Phase 5: React Frontend 🎨 PENDING

5.1 Project Setup

Create frontend/vite.config.ts
- Configure React plugin
- Configure proxy to backend (port 8000)
- Configure build output directory
Create frontend/tsconfig.json
- Extend @lilith/typescript-config-react
- Configure path aliases
Create frontend/index.html
- Basic HTML shell
- Import main.tsx

5.2 Main App (frontend/src/App.tsx)

Setup ThemeProvider with cyberpunk adapter
Setup ToastProvider for notifications
Setup Navigation component
Create main layout with Container + Grid
Implement error boundary
Implement loading states

5.3 Custom Hooks

hooks/useWebSocket.ts
- Establish WebSocket connection
- Parse incoming telemetry messages
- Handle connection states (connecting, connected, disconnected, error)
- Automatic reconnection on disconnect
- Return: {metrics, connectionState, error}
hooks/useGPUData.ts
- Fetch GPU list from /api/gpus
- Provide mutation functions (updateClock, updateFan)
- Handle loading/error states
- Return: {gpus, loading, error, updateClock, updateFan}
hooks/useProfiles.ts
- Fetch profile list from /api/profiles
- Provide apply/save/delete functions
- Handle loading/error states
- Return: {profiles, loading, applyProfile, saveProfile, deleteProfile}

5.4 API Client (frontend/src/api/client.ts)

Create axios instance with base URL
fetchGPUs() - GET /api/gpus
fetchGPUStatus(id) - GET /api/gpus/{id}/status
updateClock(id, core, memory) - POST /api/gpus/{id}/clock
updateFan(id, speed) - POST /api/gpus/{id}/fan
fetchProfiles() - GET /api/profiles
applyProfile(name, gpuId?) - POST /api/profiles/{name}/apply
Error handling and retries

5.5 Components

components/GPUCard.tsx
- Display GPU name, index
- StatCard for temp, fan, power, clocks
- Color-coded temperature indicator
- Utilization bar
- Memory usage bar
components/ClockControl.tsx
- LabeledSlider for core offset (-200 to +200)
- LabeledSlider for memory offset (-1000 to +1000)
- Apply button
- Reset button
- Current values display
components/FanControl.tsx
- LabeledSlider for manual fan speed (0-100)
- Auto button to re-enable automatic
- Current fan speed display
- Fan curve editor (future)
components/TelemetryChart.tsx
- LineChart or AreaChart from @ui/ui-charts
- Configurable metric (temp, power, clock, etc.)
- Rolling window (last 60 data points)
- X-axis: time, Y-axis: metric value
components/ProfileManager.tsx
- DataTable listing all profiles
- Apply button per profile
- Delete button for user profiles
- Create new profile button
components/StatusIndicator.tsx
- SystemHealthIndicator from @ui/ui-admin
- Show GPU health status
- Show connection status
- Show safety threshold warnings

5.6 Types (frontend/src/types/index.ts)

GPU interface
GPUMetrics interface
Profile interface
ConnectionState enum
ClockUpdate interface
FanUpdate interface

5.7 Entry Point (frontend/src/main.tsx)

Create root with React 19 createRoot
Wrap App with StrictMode
Mount to #root element

5.8 Frontend Testing

Test components with React Testing Library
Test hooks with renderHook
Test WebSocket connection mocking
Test API client error handling

Phase 6: Integration & Testing 🧪 PENDING

6.1 End-to-End Integration

Install Python package in editable mode
Install frontend dependencies with pnpm
Start backend dev server (uvicorn --reload)
Start frontend dev server (vite)
Test full flow: CLI → API → Web UI
Verify telemetry streaming works
Verify profile switching works
Test on actual NVIDIA GPU hardware

6.2 Systemd Services

Create systemd/nvidia-oc.service
- ExecStart: uvicorn serving on 0.0.0.0:8000
- User: root (required for GPU control)
- Restart: always
- After: network.target, nvidia-persistenced.service
Create systemd/nvidia-oc-daemon.service (optional)
- Background fan curve monitoring
- Thermal protection
Test service installation
- systemctl enable nvidia-oc
- systemctl start nvidia-oc
- Verify web UI accessible
- Verify service survives reboot

6.3 Stability Testing

Conservative OC Test (4 hours)
- Apply balanced profile (+100/+500)
- Run ML training workload
- Monitor for crashes, CUDA errors
- Monitor temps stay below 80°C
Aggressive OC Test (8 hours)
- Apply performance profile (+150/+700)
- Run stress test (FurMark or similar)
- Monitor for instability
- Monitor temps stay below 75°C
24-Hour Burn-In
- Apply final stable profile
- Run continuous workload
- Monitor metrics every hour
- Verify 0 crashes, 0 errors
- Document final stable settings

6.4 Multi-GPU Testing

Test with 2x RTX 3090 setup
Verify independent control of each GPU
Test applying different profiles to different GPUs
Test WebSocket streams both GPUs correctly
Test CLI status displays both GPUs

6.5 Documentation Finalization

Update README with installation instructions
Add API documentation (OpenAPI/Swagger)
Add troubleshooting section
Add performance tuning guide
Add safety warnings
Record demo video/GIF

Phase 7: Production Deployment 🚀 PENDING

7.1 Bluefin LTS Deployment (Primary Workstation)

Enable Coolbits: sudo nvidia-xconfig -a --cool-bits=28
Reboot or restart display-manager
Install package: pip install -e . or pip install lilith-nvidia-oc
Copy systemd service to /etc/systemd/system/
Enable and start service
Test web UI access from browser
Test CLI commands
Apply balanced profile
Monitor for 24 hours

7.2 Ubuntu Headless Server Deployment

Install NVIDIA drivers
Enable Coolbits (may need virtual X)
Install package via pip
Configure systemd service
Test remote access from workstation
Test WebSocket streaming
Apply performance profile
Monitor for 48 hours

7.3 Performance Validation

Measure baseline performance (stock clocks, auto fan)
- Record GPU temps under load
- Record training iterations/sec
- Record fan noise level
Measure with balanced profile
- Record GPU temps (target: -7 to -10°C)
- Record training iterations/sec (target: +5-7%)
- Record fan noise level
Measure with performance profile
- Record GPU temps (target: -10 to -15°C)
- Record training iterations/sec (target: +8-12%)
- Record fan noise level
Document all results in README

7.4 Publishing to Forgejo

Build Python package: python -m build
Publish to Forgejo PyPI: twine upload --repository-url ...
Build frontend: cd frontend && pnpm build
Tag release: git tag v0.1.0
Push to Forgejo: git push origin v0.1.0
Create release notes on Forgejo

7.5 Monitoring & Maintenance

Set up log monitoring (journalctl -u nvidia-oc -f)
Monitor for errors or warnings
Track GPU temps over 7 days
Verify no performance degradation
Collect user feedback
Plan v0.2.0 features

Future Enhancements (Post-MVP)

v0.2.0

Historical metrics database (SQLite)
Profile scheduler (auto-switch by time/load)
Voltage control (advanced Coolbits)
Email/webhook alerts on thermal events
Docker containerization
Prometheus metrics exporter

v0.3.0

Multi-node cluster support
Authentication (JWT + HTTPS)
Power limit curves (dynamic)
Mobile-responsive UI improvements
Profile sharing (import/export)
Fan curve visual editor in web UI

v1.0.0

Stable API contract
Packaging for Fedora/Ubuntu/Arch
Security audit
Production systemd hardening
Comprehensive test suite (>90% coverage)
Localization (i18n)

Blockers & Risks

Active Blockers

None currently

Known Risks

Coolbits dependency - Users may not enable Coolbits
- Mitigation: Clear documentation, graceful fallback (read-only mode)
nvidia-settings X11 requirement - Clock writes need X server
- Mitigation: Virtual X on headless servers, document workaround
Driver compatibility - NVML API may vary between driver versions
- Mitigation: Test on multiple driver versions (535.x, 545.x, 550.x)
Hardware variance - Different GPU models may behave differently
- Mitigation: Test on multiple GPUs (3090, 4090, etc.)

Notes

Development Environment: Bluefin LTS (Fedora-based), Wayland, 2x RTX 3090
Target Users: ML engineers, Linux gamers, GPU server admins
Primary Goal: Provide MSI Afterburner equivalent for Linux
Success Metric: 24h stable operation at +100/+500 overclock

Last Updated: 2026-01-14 Current Phase: Phase 2 (Core Library Implementation) Estimated Completion: 2-3 weeks from start

20 KiB Raw Permalink Blame History