VESSL Cloud Documentation

Each node exposes a per-GPU overview plus time-series charts for GPU, system, network, and InfiniBand metrics. Open a node’s detail page from the Active cluster → Node management tab → click a node.

GPU overview

Each node lists its 8 GPUs with utilization, VRAM, temperature, power, and a per-GPU health badge. See Health checks for what each badge state means.

Time range

Pick a time range from 1h, 6h, 12h, 1d, 7d, or 30d. For deeper analysis, open the linked Grafana dashboard.

Metric charts

Metrics are grouped into 6 sections with 25 charts total.

Utilization (2 charts)

Track whether GPUs are actively executing tasks and how much memory each one holds.

Utilization section: GPU Utilization and GPU Memory Used time-series charts

Chart	What it shows
GPU Utilization	Percentage of time the GPU is executing tasks. Low utilization may indicate over-provisioned GPU resources or that upstream vCPUs cannot keep the GPU fed.
GPU Memory Used	Amount of frame buffer (VRAM) currently in use on each GPU. A sudden drop to 0 indicates a process crash inside the VM; a steady upward trend may signal a memory leak.

System (7 charts)

Spot bottlenecks outside the GPU — vCPU saturation, memory pressure, disk, network, or the VM’s overall health.

System section (top): CPU Usage, Load Average, System Memory Usage, and Root Disk Usage charts

System section (continued): Network RX, Network TX, and Node Uptime charts

Chart	What it shows
CPU Usage	Percentage of time the VM’s vCPUs spend executing tasks. Sustained high values indicate vCPUs are saturated; if GPUs are idle at the same time, vCPUs are likely the bottleneck.
Load Average (5m)	Average number of processes waiting on vCPUs inside the VM over the last 5 minutes. Sustained high values may indicate VM overload.
System Memory Usage	Percentage of the VM’s allocated RAM in use. Sustained high values may trigger the OOM (Out of Memory) killer inside the VM, and swap usage causes severe slowdowns.
Root Disk Usage	Percentage of the VM’s root filesystem (`/`) capacity in use. Near capacity may cause write failures for any process needing disk space.
Network RX	Inbound throughput on the VM’s primary ethernet interface. Sustained 0 B/s during expected activity may indicate an ethernet outage, separate from the InfiniBand fabric.
Network TX	Outbound throughput on the VM’s primary ethernet interface. Carries control-plane traffic such as API calls and logs, separate from the InfiniBand fabric used for high-bandwidth workloads.
Node Uptime	Tracks node uptime over time. 1 = up (no critical alerts firing), 0 = down (critical alert firing such as XID errors, ECC DBE, or IB link down). Gaps mean the node stopped reporting entirely.

Temperature & power (3 charts)

Watch for thermal throttling and power-related instability.

Temperature & power section: GPU temperature, memory temperature, and power usage charts

Chart	What it shows
GPU Temperature	GPU chip temperature in degrees Celsius. Sustained values approaching the GPU’s thermal limit trigger thermal throttling.
Memory Temperature	HBM (High Bandwidth Memory) temperature in degrees Celsius. Cross-check with GPU temperature to identify the thermal hotspot.
Power Usage	Current power draw per GPU in watts. Active compute draws close to the GPU’s rated TDP; 0 W typically indicates the VM does not recognize the GPU, and sharp swings may signal unstable load.

Memory & clock detail (3 charts)

Identify throttling and HBM activity issues.

Memory & clock detail section: memory utilization, memory clock, and SM clock charts

Chart	What it shows
Memory Utilization	Percentage of time the GPU’s memory engine is active. This measures activity time, not bandwidth (GB/s) — 100% doesn’t necessarily mean max throughput.
Memory Clock	Current frequency of the GPU’s HBM memory clock. Values below the base clock may indicate throttling. Cross-check with Temperature & power.
SM Clock	Current frequency of the SM (Streaming Multiprocessor) clock. Values below the base clock may indicate power or thermal throttling.

ECC & errors (4 charts)

Catch GPU hardware faults early — before they take down a training run.

ECC & errors section: ECC SBE/DBE and remapped rows charts

Chart	What it shows
ECC SBE (Correctable)	Single-bit memory errors corrected by hardware in the last 5 minutes. Occasional events are normal, but a sharp upward trend may precede uncorrectable errors.
ECC DBE (Uncorrectable)	Uncorrectable memory errors in the last 5 minutes. The healthy count is 0; a single event may corrupt in-flight computation or crash the process using the GPU.
Remapped Rows (Correctable Errors)	Correctable HBM row remap events in the last 5 minutes. An upward trend may signal HBM degradation.
Remapped Rows (Uncorrectable Errors)	Uncorrectable HBM row remap events in the last 5 minutes. The healthy count is 0; any occurrence is critical and may require GPU replacement.

InfiniBand detail (6 charts)

Monitor the node-to-node fabric used by multi-node distributed training — throughput, errors, and link stability. Each chart is plotted per HCA (Host Channel Adapter).

InfiniBand detail section: per-HCA throughput, receive, and symbol error charts

InfiniBand detail section: IB Link Error Recovery and IB Link Downed charts

Chart	What it shows
IB TX Throughput	Outbound InfiniBand throughput per HCA attached to the VM. Significant imbalance between HCAs may indicate a cable or switch issue; sustained 0 Gbps may indicate the link is down.
IB RX Throughput	Inbound InfiniBand throughput per HCA on the VM. Asymmetry against TX may indicate routing issues; per-HCA lines help isolate a faulty port.
IB Receive Errors	InfiniBand receive errors per HCA in the last 5 minutes. Sustained values above 0 may indicate cable or port damage, disrupting multi-VM workloads.
IB Symbol Errors	InfiniBand symbol errors on the VM’s HCAs in the last 5 minutes. May indicate cable aging or optical module issues.
IB Link Error Recovery	InfiniBand link recovery events on the VM’s HCAs in the last 5 minutes. A rising trend signals link instability and may escalate to Link Downed.
IB Link Downed	InfiniBand link-down events per HCA in the last 5 minutes. The healthy count is 0; any occurrence disconnects the VM from the IB fabric.

​GPU overview

​Time range

​Metric charts

​Utilization (2 charts)

​System (7 charts)

​Temperature & power (3 charts)

​Memory & clock detail (3 charts)

​ECC & errors (4 charts)

​InfiniBand detail (6 charts)