Skip to main content
Every node in your cluster is health-checked continuously. Each GPU gets its own status, and node-level conditions (memory, disk, kernel modules, InfiniBand, observability) roll up into a single status for the node, refreshed about once a minute.

Node overview

The cluster detail page shows a Node overview strip with every node as a colored cell, so you can read the whole cluster’s health at a glance.

Statuses

StatusColorMeaning
Healthy🟢No active alarms.
Warning🟡At least one warning-level alarm. The node is usable but something needs attention.
Critical🔴At least one critical alarm. The node may be unusable or producing bad results.
UnknownWe can’t judge the node right now — see Unknown below.
A node’s status is worst-wins: it takes the highest severity across all of its alarms and per-GPU statuses. Having more warnings does not escalate the node to Critical on its own.

Levels

Health is tracked at two levels — each GPU has its own status, and the node has an overall status that combines node-level checks with the worst per-GPU status.
LevelWhere you see itWhat’s checked
GPU (per device)The 8-GPU grid on the node detail pageXID errors, ECC, temperature, throttling
NodeNode overview cell and the node row dotMemory, disk, kernel modules, InfiniBand fabric, observability gaps

GPU checks

Each GPU has its own status — the worst condition among the checks below. The node picks up the worst GPU’s color using worst-wins.
CodeConditionStatus
gpu_high_tempGPU temperature ≥ 85°C (sustained)🟡 Warning
gpu_high_tempGPU temperature ≥ 95°C🔴 Critical
gpu_memory_high_tempHBM (memory) temperature ≥ 85°C🟡 Warning
gpu_memory_high_tempHBM temperature ≥ 95°C🔴 Critical
gpu_hw_throttleHardware throttling🟡 Warning
gpu_no_process_high_utilHigh GPU utilization with no attached process🟡 Warning
row_remap_nearing_limitHBM row remaps approaching the safety limit🟡 Warning
gpu_ecc_dbeUncorrectable (DBE) ECC error🔴 Critical
gpu_row_remap_failureHBM row remap failure🔴 Critical
gpu_remapped_rows_pendingHBM row remaps pending (reboot required)🔴 Critical
gpu_count_mismatchFewer than 8 GPUs detected by the driver🔴 Critical
gpu_smi_unhealthyGPU reported unhealthy by nvidia-smi🔴 Critical
gpu_driver_pri_bus_faultGPU driver / PCIe bus fault🔴 Critical
gpu_recovery_action_requiredGPU recovery action required🔴 Critical

XID errors

GPU XID codes are categorized by severity.
XIDCodeMeaningStatus
38gpu_xid38_driver_firmware_mismatchDriver / firmware mismatch🟡 Warning
62gpu_xid62_internal_fw_breakpointInternal firmware breakpoint🟡 Warning
95gpu_xid95_uncontained_ecc_rebootUncontained ECC (recoverable)🟡 Warning
48gpu_xid48_dbe_row_remapDBE row remap🔴 Critical
64gpu_xid64_ecc_row_remap_failureECC row remap failure🔴 Critical
74gpu_xid74_nvlink_errorNVLink error🔴 Critical
79gpu_xid79_fallen_off_busGPU fallen off the bus🔴 Critical
For codes not listed here, see the NVIDIA XID catalog. On the node detail page, each XID alert links to its catalog entry.

Node checks

These contribute to the node status. They affect the node dot and the Node overview cell — they don’t change any individual GPU’s status. Hover the node’s health indicator on the cluster detail page to see which specific alarm is firing.

System and kernel

CodeConditionStatus
memory_lowLow system memory🟡 Warning
disk_lowLow root disk space🟡 Warning
peermem_not_loadedpeermem kernel module not loaded🔴 Critical

InfiniBand

CodeConditionStatus
ib_symbol_errorsSymbol errors on an InfiniBand HCA (physical-layer corruption)🟡 Warning
ib_link_flap / ib_storage_link_flapLink flap on an InfiniBand HCA🟡 Warning
ib_transport_retries_exceededTransport retries exceeded on multiple InfiniBand HCAs🟡 Warning
ib_port_downOne or more InfiniBand ports are down (multi-node paths fail)🔴 Critical

Observability gaps

When part of a node’s health data can’t be collected, the node shows Warning with one of these labels:
LabelWhat it meansStatus
Node unreachableThe node isn’t responding, so its health can’t be checked.🟡 Warning, escalating to 🔴 Critical the longer it stays unreachable
GPU metrics unavailableGPU metrics aren’t being collected, so GPU health can’t be checked.🟡 Warning
InfiniBand metrics unavailableInfiniBand metrics aren’t being collected, so link health can’t be checked.🟡 Warning
Partial data loss does not turn the node Unknown — gray is reserved for the cases below.

Unknown (gray)

A node is Unknown (⚪) only when its health can’t be judged at all. Two cases:
  • The node is in Rebooting state. We don’t judge a node until it’s fully running. This is normal during reboot or initial provisioning.
  • The node isn’t set up for monitoring at all. Uncommon; if a running node stays this way, contact support.
In both cases, every cell on the node detail page is gray and the node dot is gray.

When a node is unhealthy

  • Reboot the node — often clears transient faults.
  • Wait for recovery — our engineers monitor nodes and recover them manually. During Beta there’s no recovery-time SLA.
  • Contact support if the problem persists.
If a node fails due to hardware, our engineers may replace it — replacement is a manual decision, not automatic. There’s no fixed maintenance schedule today; if maintenance is needed, we’ll notify you in advance.