Skip to main content

GPU Monitoring

Oodle's GPU Monitoring gives you full visibility into your GPU fleet — utilization, memory allocation, temperature, power draw, and per-process resource consumption. Works with NVIDIA GPUs using either the nvidia_gpu_exporter (nvidia-smi based) or dcgm-exporter.

Overview Tab

The Overview tab provides a high-level health summary of your entire GPU fleet.

GPU Overview

Summary Cards

CardDescription
Total GPUsNumber of monitored GPU devices
Active GPUsGPUs with utilization > 5%
Total HostsNumber of hosts with GPUs
Avg UtilizationFleet-wide average GPU utilization

Alerts & Health

AlertTrigger
Idle GPUsDevices with < 5% utilization
Thermal ThrottlingDevices with active thermal throttle or temp ≥ 85°C
ECC / XID ErrorsDevices reporting memory or XID errors
PCIe DegradationActive devices with zero PCIe throughput

Hosts Tab

The Hosts tab lists every host that has at least one GPU installed.

ColumnDescription
HostHostname (instance label)
DeviceGPU model name
HealthHealthy, Throttled, or Error state
ActiveNumber of active / total GPUs
CPUHost CPU utilization %
MemoryHost system memory utilization %
GPUAverage GPU compute utilization %
GPU MemAverage GPU VRAM allocation %
ECCECC error count
XIDXID error count

Click any host row to open a detail drawer with devices, processes, and embedded Grafana dashboards.

GPU Host Detail Drawer

Devices Tab

The Devices tab lists every individual GPU device across all hosts.

GPU Devices
ColumnDescription
UUIDGPU device UUID
ModelGPU model name
HostHost the GPU is installed in
HealthHealth status
GPU UtilCompute utilization %
GPU MemVRAM allocation %
TempCurrent temperature in °C
PowerCurrent power draw in watts
ECCECC error count

Click any device row to open a detail drawer with GPU-specific Grafana dashboards and process information.

GPU Device Detail Drawer

Process Monitor

The process monitor shows per-process resource usage on GPU hosts using process_exporter metrics.

MetricDescription
CPU RateCPU cores consumed
Resident MemoryPhysical memory usage
Read / Write BytesDisk I/O rate
Context SwitchesRate of context switches
FD RatioFile descriptor usage as % of limit

Setup

Navigate to Integrations → GPU Monitoring to access the setup wizard, which guides you through:

  1. Choose your exporternvidia_gpu_exporter or dcgm-exporter
  2. Install the exporter — Docker, systemd, or binary commands provided
  3. Install process exporter — for per-process visibility
  4. Configure scraping — Prometheus or VMAgent configuration with correct instance labeling
  5. Verify data — confirm metrics are flowing

Supported Exporters

ExporterMetrics PrefixUse Case
nvidia_gpu_exporternvidia_smi_*Simple setup, covers most use cases
dcgm-exporterDCGM_FI_*Advanced profiling metrics

Instance Label

The GPU page identifies hosts by the instance label. The setup wizard configures relabeling to strip the port from the target address, so you get clean hostnames (e.g. gpu-train-01 instead of gpu-train-01:9835).

Grafana Dashboards

Four built-in dashboards are embedded in the UI, accessible from host and device detail drawers:

DashboardUsed In
GPU Host OverviewHost drawer
NVIDIA GPU MetricsDevice drawer (nvidia_smi)
DCGM GPU MetricsDevice drawer (DCGM)
GPU Process MonitorProcess tab

The NVIDIA GPU Metrics dashboard provides a deep dive into a single GPU — real-time utilization, clock speeds, memory allocation, power draw, fan speed, and throttle reasons. It automatically filters to the selected device when opened from the Devices tab.

NVIDIA GPU Metrics Dashboard

Best Practices

  • Use consistent instance labels — strip ports so the same host appears as one entity across all metrics.
  • Monitor VRAM allocation — high usage (>90%) leads to OOM kills.
  • Watch thermal throttling — sustained temps above 80°C reduce performance.
  • Track idle GPUs — idle GPUs waste expensive compute; use the alert to right-size your fleet.
  • Enable process_exporter — without it, you lose visibility into which processes consume resources.

Support

If you need assistance or have any questions, please reach out to us through: