GPU overview
Each node lists its 8 GPUs with utilization, VRAM, temperature, power, and a per-GPU health badge. See Health checks for what each badge state means.
Time range
Pick a time range from 1h, 6h, 12h, 1d, 7d, or 30d. For deeper analysis, open the linked Grafana dashboard.Metric charts
Metrics are grouped into 6 sections with 25 charts total.Utilization (2 charts)
Track whether GPUs are actively executing tasks and how much memory each one holds.
| Chart | What it shows |
|---|---|
| GPU Utilization | Percentage of time the GPU is executing tasks. Low utilization may indicate over-provisioned GPU resources or that upstream vCPUs cannot keep the GPU fed. |
| GPU Memory Used | Amount of frame buffer (VRAM) currently in use on each GPU. A sudden drop to 0 indicates a process crash inside the VM; a steady upward trend may signal a memory leak. |
System (7 charts)
Spot bottlenecks outside the GPU — vCPU saturation, memory pressure, disk, network, or the VM’s overall health.

| Chart | What it shows |
|---|---|
| CPU Usage | Percentage of time the VM’s vCPUs spend executing tasks. Sustained high values indicate vCPUs are saturated; if GPUs are idle at the same time, vCPUs are likely the bottleneck. |
| Load Average (5m) | Average number of processes waiting on vCPUs inside the VM over the last 5 minutes. Sustained high values may indicate VM overload. |
| System Memory Usage | Percentage of the VM’s allocated RAM in use. Sustained high values may trigger the OOM (Out of Memory) killer inside the VM, and swap usage causes severe slowdowns. |
| Root Disk Usage | Percentage of the VM’s root filesystem (/) capacity in use. Near capacity may cause write failures for any process needing disk space. |
| Network RX | Inbound throughput on the VM’s primary ethernet interface. Sustained 0 B/s during expected activity may indicate an ethernet outage, separate from the InfiniBand fabric. |
| Network TX | Outbound throughput on the VM’s primary ethernet interface. Carries control-plane traffic such as API calls and logs, separate from the InfiniBand fabric used for high-bandwidth workloads. |
| Node Uptime | Tracks node uptime over time. 1 = up (no critical alerts firing), 0 = down (critical alert firing such as XID errors, ECC DBE, or IB link down). Gaps mean the node stopped reporting entirely. |
Temperature & power (3 charts)
Watch for thermal throttling and power-related instability.
| Chart | What it shows |
|---|---|
| GPU Temperature | GPU chip temperature in degrees Celsius. Sustained values approaching the GPU’s thermal limit trigger thermal throttling. |
| Memory Temperature | HBM (High Bandwidth Memory) temperature in degrees Celsius. Cross-check with GPU temperature to identify the thermal hotspot. |
| Power Usage | Current power draw per GPU in watts. Active compute draws close to the GPU’s rated TDP; 0 W typically indicates the VM does not recognize the GPU, and sharp swings may signal unstable load. |
Memory & clock detail (3 charts)
Identify throttling and HBM activity issues.
| Chart | What it shows |
|---|---|
| Memory Utilization | Percentage of time the GPU’s memory engine is active. This measures activity time, not bandwidth (GB/s) — 100% doesn’t necessarily mean max throughput. |
| Memory Clock | Current frequency of the GPU’s HBM memory clock. Values below the base clock may indicate throttling. Cross-check with Temperature & power. |
| SM Clock | Current frequency of the SM (Streaming Multiprocessor) clock. Values below the base clock may indicate power or thermal throttling. |
ECC & errors (4 charts)
Catch GPU hardware faults early — before they take down a training run.
| Chart | What it shows |
|---|---|
| ECC SBE (Correctable) | Single-bit memory errors corrected by hardware in the last 5 minutes. Occasional events are normal, but a sharp upward trend may precede uncorrectable errors. |
| ECC DBE (Uncorrectable) | Uncorrectable memory errors in the last 5 minutes. The healthy count is 0; a single event may corrupt in-flight computation or crash the process using the GPU. |
| Remapped Rows (Correctable Errors) | Correctable HBM row remap events in the last 5 minutes. An upward trend may signal HBM degradation. |
| Remapped Rows (Uncorrectable Errors) | Uncorrectable HBM row remap events in the last 5 minutes. The healthy count is 0; any occurrence is critical and may require GPU replacement. |
InfiniBand detail (6 charts)
Monitor the node-to-node fabric used by multi-node distributed training — throughput, errors, and link stability. Each chart is plotted per HCA (Host Channel Adapter).

| Chart | What it shows |
|---|---|
| IB TX Throughput | Outbound InfiniBand throughput per HCA attached to the VM. Significant imbalance between HCAs may indicate a cable or switch issue; sustained 0 Gbps may indicate the link is down. |
| IB RX Throughput | Inbound InfiniBand throughput per HCA on the VM. Asymmetry against TX may indicate routing issues; per-HCA lines help isolate a faulty port. |
| IB Receive Errors | InfiniBand receive errors per HCA in the last 5 minutes. Sustained values above 0 may indicate cable or port damage, disrupting multi-VM workloads. |
| IB Symbol Errors | InfiniBand symbol errors on the VM’s HCAs in the last 5 minutes. May indicate cable aging or optical module issues. |
| IB Link Error Recovery | InfiniBand link recovery events on the VM’s HCAs in the last 5 minutes. A rising trend signals link instability and may escalate to Link Downed. |
| IB Link Downed | InfiniBand link-down events per HCA in the last 5 minutes. The healthy count is 0; any occurrence disconnects the VM from the IB fabric. |
