Skip to main content
Each node exposes a per-GPU overview plus time-series charts for GPU, system, network, and InfiniBand metrics. Open a node’s detail page from the Active cluster → Node management tab → click a node.

GPU overview

Each node lists its 8 GPUs with utilization, VRAM, temperature, power, and a per-GPU health badge. See Health checks for what each badge state means.
Node detail page header and the GPU overview, showing all 8 GPUs with utilization, VRAM, temperature, and power

Time range

Pick a time range from 1h, 6h, 12h, 1d, 7d, or 30d. For deeper analysis, open the linked Grafana dashboard.

Metric charts

Metrics are grouped into 6 sections with 25 charts total.

Utilization (2 charts)

Track whether GPUs are actively executing tasks and how much memory each one holds.
Utilization section: GPU Utilization and GPU Memory Used time-series charts
ChartWhat it shows
GPU UtilizationPercentage of time the GPU is executing tasks. Low utilization may indicate over-provisioned GPU resources or that upstream vCPUs cannot keep the GPU fed.
GPU Memory UsedAmount of frame buffer (VRAM) currently in use on each GPU. A sudden drop to 0 indicates a process crash inside the VM; a steady upward trend may signal a memory leak.

System (7 charts)

Spot bottlenecks outside the GPU — vCPU saturation, memory pressure, disk, network, or the VM’s overall health.
System section (top): CPU Usage, Load Average, System Memory Usage, and Root Disk Usage charts
System section (continued): Network RX, Network TX, and Node Uptime charts
ChartWhat it shows
CPU UsagePercentage of time the VM’s vCPUs spend executing tasks. Sustained high values indicate vCPUs are saturated; if GPUs are idle at the same time, vCPUs are likely the bottleneck.
Load Average (5m)Average number of processes waiting on vCPUs inside the VM over the last 5 minutes. Sustained high values may indicate VM overload.
System Memory UsagePercentage of the VM’s allocated RAM in use. Sustained high values may trigger the OOM (Out of Memory) killer inside the VM, and swap usage causes severe slowdowns.
Root Disk UsagePercentage of the VM’s root filesystem (/) capacity in use. Near capacity may cause write failures for any process needing disk space.
Network RXInbound throughput on the VM’s primary ethernet interface. Sustained 0 B/s during expected activity may indicate an ethernet outage, separate from the InfiniBand fabric.
Network TXOutbound throughput on the VM’s primary ethernet interface. Carries control-plane traffic such as API calls and logs, separate from the InfiniBand fabric used for high-bandwidth workloads.
Node UptimeTracks node uptime over time. 1 = up (no critical alerts firing), 0 = down (critical alert firing such as XID errors, ECC DBE, or IB link down). Gaps mean the node stopped reporting entirely.

Temperature & power (3 charts)

Watch for thermal throttling and power-related instability.
Temperature & power section: GPU temperature, memory temperature, and power usage charts
ChartWhat it shows
GPU TemperatureGPU chip temperature in degrees Celsius. Sustained values approaching the GPU’s thermal limit trigger thermal throttling.
Memory TemperatureHBM (High Bandwidth Memory) temperature in degrees Celsius. Cross-check with GPU temperature to identify the thermal hotspot.
Power UsageCurrent power draw per GPU in watts. Active compute draws close to the GPU’s rated TDP; 0 W typically indicates the VM does not recognize the GPU, and sharp swings may signal unstable load.

Memory & clock detail (3 charts)

Identify throttling and HBM activity issues.
Memory & clock detail section: memory utilization, memory clock, and SM clock charts
ChartWhat it shows
Memory UtilizationPercentage of time the GPU’s memory engine is active. This measures activity time, not bandwidth (GB/s) — 100% doesn’t necessarily mean max throughput.
Memory ClockCurrent frequency of the GPU’s HBM memory clock. Values below the base clock may indicate throttling. Cross-check with Temperature & power.
SM ClockCurrent frequency of the SM (Streaming Multiprocessor) clock. Values below the base clock may indicate power or thermal throttling.

ECC & errors (4 charts)

Catch GPU hardware faults early — before they take down a training run.
ECC & errors section: ECC SBE/DBE and remapped rows charts
ChartWhat it shows
ECC SBE (Correctable)Single-bit memory errors corrected by hardware in the last 5 minutes. Occasional events are normal, but a sharp upward trend may precede uncorrectable errors.
ECC DBE (Uncorrectable)Uncorrectable memory errors in the last 5 minutes. The healthy count is 0; a single event may corrupt in-flight computation or crash the process using the GPU.
Remapped Rows (Correctable Errors)Correctable HBM row remap events in the last 5 minutes. An upward trend may signal HBM degradation.
Remapped Rows (Uncorrectable Errors)Uncorrectable HBM row remap events in the last 5 minutes. The healthy count is 0; any occurrence is critical and may require GPU replacement.

InfiniBand detail (6 charts)

Monitor the node-to-node fabric used by multi-node distributed training — throughput, errors, and link stability. Each chart is plotted per HCA (Host Channel Adapter).
InfiniBand detail section: per-HCA throughput, receive, and symbol error charts
InfiniBand detail section: IB Link Error Recovery and IB Link Downed charts
ChartWhat it shows
IB TX ThroughputOutbound InfiniBand throughput per HCA attached to the VM. Significant imbalance between HCAs may indicate a cable or switch issue; sustained 0 Gbps may indicate the link is down.
IB RX ThroughputInbound InfiniBand throughput per HCA on the VM. Asymmetry against TX may indicate routing issues; per-HCA lines help isolate a faulty port.
IB Receive ErrorsInfiniBand receive errors per HCA in the last 5 minutes. Sustained values above 0 may indicate cable or port damage, disrupting multi-VM workloads.
IB Symbol ErrorsInfiniBand symbol errors on the VM’s HCAs in the last 5 minutes. May indicate cable aging or optical module issues.
IB Link Error RecoveryInfiniBand link recovery events on the VM’s HCAs in the last 5 minutes. A rising trend signals link instability and may escalate to Link Downed.
IB Link DownedInfiniBand link-down events per HCA in the last 5 minutes. The healthy count is 0; any occurrence disconnects the VM from the IB fabric.