VESSL Cloud Documentation

Every node in your cluster is health-checked continuously. Each GPU gets its own status, and node-level conditions (memory, disk, kernel modules, InfiniBand, observability) roll up into a single status for the node, refreshed about once a minute.

Node overview

The cluster detail page shows a Node overview strip with every node as a colored cell, so you can read the whole cluster’s health at a glance.

Statuses

Status	Color	Meaning
Healthy	🟢	No active alarms.
Warning	🟡	At least one warning-level alarm. The node is usable but something needs attention.
Critical	🔴	At least one critical alarm. The node may be unusable or producing bad results.
Unknown	⚪	We can’t judge the node right now — see Unknown below.

A node’s status is worst-wins: it takes the highest severity across all of its alarms and per-GPU statuses. Having more warnings does not escalate the node to Critical on its own.

Levels

Health is tracked at two levels — each GPU has its own status, and the node has an overall status that combines node-level checks with the worst per-GPU status.

Level	Where you see it	What’s checked
GPU (per device)	The 8-GPU grid on the node detail page	XID errors, ECC, temperature, throttling
Node	Node overview cell and the node row dot	Memory, disk, kernel modules, InfiniBand fabric, observability gaps

GPU checks

Each GPU has its own status — the worst condition among the checks below. The node picks up the worst GPU’s color using worst-wins.

Code	Condition	Status
`gpu_high_temp`	GPU temperature ≥ 85°C (sustained)	🟡 Warning
`gpu_high_temp`	GPU temperature ≥ 95°C	🔴 Critical
`gpu_memory_high_temp`	HBM (memory) temperature ≥ 85°C	🟡 Warning
`gpu_memory_high_temp`	HBM temperature ≥ 95°C	🔴 Critical
`gpu_hw_throttle`	Hardware throttling	🟡 Warning
`gpu_no_process_high_util`	High GPU utilization with no attached process	🟡 Warning
`row_remap_nearing_limit`	HBM row remaps approaching the safety limit	🟡 Warning
`gpu_ecc_dbe`	Uncorrectable (DBE) ECC error	🔴 Critical
`gpu_row_remap_failure`	HBM row remap failure	🔴 Critical
`gpu_remapped_rows_pending`	HBM row remaps pending (reboot required)	🔴 Critical
`gpu_count_mismatch`	Fewer than 8 GPUs detected by the driver	🔴 Critical
`gpu_smi_unhealthy`	GPU reported unhealthy by `nvidia-smi`	🔴 Critical
`gpu_driver_pri_bus_fault`	GPU driver / PCIe bus fault	🔴 Critical
`gpu_recovery_action_required`	GPU recovery action required	🔴 Critical

XID errors

GPU XID codes are categorized by severity.

XID	Code	Meaning	Status
38	`gpu_xid38_driver_firmware_mismatch`	Driver / firmware mismatch	🟡 Warning
62	`gpu_xid62_internal_fw_breakpoint`	Internal firmware breakpoint	🟡 Warning
95	`gpu_xid95_uncontained_ecc_reboot`	Uncontained ECC (recoverable)	🟡 Warning
48	`gpu_xid48_dbe_row_remap`	DBE row remap	🔴 Critical
64	`gpu_xid64_ecc_row_remap_failure`	ECC row remap failure	🔴 Critical
74	`gpu_xid74_nvlink_error`	NVLink error	🔴 Critical
79	`gpu_xid79_fallen_off_bus`	GPU fallen off the bus	🔴 Critical

For codes not listed here, see the NVIDIA XID catalog. On the node detail page, each XID alert links to its catalog entry.

Node checks

These contribute to the node status. They affect the node dot and the Node overview cell — they don’t change any individual GPU’s status. Hover the node’s health indicator on the cluster detail page to see which specific alarm is firing.

System and kernel

Code	Condition	Status
`memory_low`	Low system memory	🟡 Warning
`disk_low`	Low root disk space	🟡 Warning
`peermem_not_loaded`	`peermem` kernel module not loaded	🔴 Critical

InfiniBand

Code	Condition	Status
`ib_symbol_errors`	Symbol errors on an InfiniBand HCA (physical-layer corruption)	🟡 Warning
`ib_link_flap` / `ib_storage_link_flap`	Link flap on an InfiniBand HCA	🟡 Warning
`ib_transport_retries_exceeded`	Transport retries exceeded on multiple InfiniBand HCAs	🟡 Warning
`ib_port_down`	One or more InfiniBand ports are down (multi-node paths fail)	🔴 Critical

Observability gaps

When part of a node’s health data can’t be collected, the node shows Warning with one of these labels:

Label	What it means	Status
Node unreachable	The node isn’t responding, so its health can’t be checked.	🟡 Warning, escalating to 🔴 Critical the longer it stays unreachable
GPU metrics unavailable	GPU metrics aren’t being collected, so GPU health can’t be checked.	🟡 Warning
InfiniBand metrics unavailable	InfiniBand metrics aren’t being collected, so link health can’t be checked.	🟡 Warning

Partial data loss does not turn the node Unknown — gray is reserved for the cases below.

Unknown (gray)

A node is Unknown (⚪) only when its health can’t be judged at all. Two cases:

The node is in Rebooting state. We don’t judge a node until it’s fully running. This is normal during reboot or initial provisioning.
The node isn’t set up for monitoring at all. Uncommon; if a running node stays this way, contact support.

In both cases, every cell on the node detail page is gray and the node dot is gray.

When a node is unhealthy

Reboot the node — often clears transient faults.
Wait for recovery — our engineers monitor nodes and recover them manually. During Beta there’s no recovery-time SLA.
Contact support if the problem persists.

If a node fails due to hardware, our engineers may replace it — replacement is a manual decision, not automatic. There’s no fixed maintenance schedule today; if maintenance is needed, we’ll notify you in advance.

​Node overview

​Statuses

​Levels

​GPU checks

​XID errors

​Node checks

​System and kernel

​InfiniBand

​Observability gaps

​Unknown (gray)

​When a node is unhealthy