Node overview
The cluster detail page shows a Node overview strip with every node as a colored cell, so you can read the whole cluster’s health at a glance.Statuses
| Status | Color | Meaning |
|---|---|---|
| Healthy | 🟢 | No active alarms. |
| Warning | 🟡 | At least one warning-level alarm. The node is usable but something needs attention. |
| Critical | 🔴 | At least one critical alarm. The node may be unusable or producing bad results. |
| Unknown | ⚪ | We can’t judge the node right now — see Unknown below. |
Levels
Health is tracked at two levels — each GPU has its own status, and the node has an overall status that combines node-level checks with the worst per-GPU status.| Level | Where you see it | What’s checked |
|---|---|---|
| GPU (per device) | The 8-GPU grid on the node detail page | XID errors, ECC, temperature, throttling |
| Node | Node overview cell and the node row dot | Memory, disk, kernel modules, InfiniBand fabric, observability gaps |
GPU checks
Each GPU has its own status — the worst condition among the checks below. The node picks up the worst GPU’s color using worst-wins.| Code | Condition | Status |
|---|---|---|
gpu_high_temp | GPU temperature ≥ 85°C (sustained) | 🟡 Warning |
gpu_high_temp | GPU temperature ≥ 95°C | 🔴 Critical |
gpu_memory_high_temp | HBM (memory) temperature ≥ 85°C | 🟡 Warning |
gpu_memory_high_temp | HBM temperature ≥ 95°C | 🔴 Critical |
gpu_hw_throttle | Hardware throttling | 🟡 Warning |
gpu_no_process_high_util | High GPU utilization with no attached process | 🟡 Warning |
row_remap_nearing_limit | HBM row remaps approaching the safety limit | 🟡 Warning |
gpu_ecc_dbe | Uncorrectable (DBE) ECC error | 🔴 Critical |
gpu_row_remap_failure | HBM row remap failure | 🔴 Critical |
gpu_remapped_rows_pending | HBM row remaps pending (reboot required) | 🔴 Critical |
gpu_count_mismatch | Fewer than 8 GPUs detected by the driver | 🔴 Critical |
gpu_smi_unhealthy | GPU reported unhealthy by nvidia-smi | 🔴 Critical |
gpu_driver_pri_bus_fault | GPU driver / PCIe bus fault | 🔴 Critical |
gpu_recovery_action_required | GPU recovery action required | 🔴 Critical |
XID errors
GPU XID codes are categorized by severity.| XID | Code | Meaning | Status |
|---|---|---|---|
| 38 | gpu_xid38_driver_firmware_mismatch | Driver / firmware mismatch | 🟡 Warning |
| 62 | gpu_xid62_internal_fw_breakpoint | Internal firmware breakpoint | 🟡 Warning |
| 95 | gpu_xid95_uncontained_ecc_reboot | Uncontained ECC (recoverable) | 🟡 Warning |
| 48 | gpu_xid48_dbe_row_remap | DBE row remap | 🔴 Critical |
| 64 | gpu_xid64_ecc_row_remap_failure | ECC row remap failure | 🔴 Critical |
| 74 | gpu_xid74_nvlink_error | NVLink error | 🔴 Critical |
| 79 | gpu_xid79_fallen_off_bus | GPU fallen off the bus | 🔴 Critical |
Node checks
These contribute to the node status. They affect the node dot and the Node overview cell — they don’t change any individual GPU’s status. Hover the node’s health indicator on the cluster detail page to see which specific alarm is firing.System and kernel
| Code | Condition | Status |
|---|---|---|
memory_low | Low system memory | 🟡 Warning |
disk_low | Low root disk space | 🟡 Warning |
peermem_not_loaded | peermem kernel module not loaded | 🔴 Critical |
InfiniBand
| Code | Condition | Status |
|---|---|---|
ib_symbol_errors | Symbol errors on an InfiniBand HCA (physical-layer corruption) | 🟡 Warning |
ib_link_flap / ib_storage_link_flap | Link flap on an InfiniBand HCA | 🟡 Warning |
ib_transport_retries_exceeded | Transport retries exceeded on multiple InfiniBand HCAs | 🟡 Warning |
ib_port_down | One or more InfiniBand ports are down (multi-node paths fail) | 🔴 Critical |
Observability gaps
When part of a node’s health data can’t be collected, the node shows Warning with one of these labels:| Label | What it means | Status |
|---|---|---|
| Node unreachable | The node isn’t responding, so its health can’t be checked. | 🟡 Warning, escalating to 🔴 Critical the longer it stays unreachable |
| GPU metrics unavailable | GPU metrics aren’t being collected, so GPU health can’t be checked. | 🟡 Warning |
| InfiniBand metrics unavailable | InfiniBand metrics aren’t being collected, so link health can’t be checked. | 🟡 Warning |
Unknown (gray)
A node is Unknown (⚪) only when its health can’t be judged at all. Two cases:- The node is in Rebooting state. We don’t judge a node until it’s fully running. This is normal during reboot or initial provisioning.
- The node isn’t set up for monitoring at all. Uncommon; if a running node stays this way, contact support.
When a node is unhealthy
- Reboot the node — often clears transient faults.
- Wait for recovery — our engineers monitor nodes and recover them manually. During Beta there’s no recovery-time SLA.
- Contact support if the problem persists.
