A Job runs a command to completion on a specified GPU or CPU resource. Unlike workspaces, jobs are non-interactive — they execute, produce output, and terminate automatically.
Jobs are ideal for:
- Model training and fine-tuning
- Batch inference and evaluation
- Data preprocessing pipelines
- Hyperparameter sweeps (submit multiple jobs in parallel)
Jobs vs Workspaces
| Job | Workspace |
|---|
| Interaction | Non-interactive (runs a command) | Interactive (SSH, JupyterLab) |
| Lifecycle | Starts → runs → completes automatically | Stays running until you pause or terminate |
| Billing | Only while running | While running or paused (at reduced rate) |
| Best for | Training, batch processing, sweeps | Development, debugging, exploration |
Creating a job
You can create jobs from the VESSL Cloud console or the CLI.
From the console: Navigate to Jobs in the sidebar and click Create Job. Select a cluster, resource spec, container image, and enter the command to run.
From the CLI:
vesslctl job create \
--name my-training-job \
--cluster <cluster-name> \
--resource-spec <spec-name> \
--image quay.io/vessl-ai/torch:2.9.1-cuda13.0.1-py3.13-slim \
--cmd "python train.py --epochs 10"
Run vesslctl cluster list and vesslctl resource-spec list to see available clusters and GPU specs. See vesslctl job for the full CLI reference.
Monitoring jobs
Once a job is submitted, you can monitor its progress from the Jobs list page. Each job shows its status, resources (GPU type and count), duration, and creator. Click on a job to view detailed logs and resource utilization.
From the CLI:
vesslctl job list --state running
vesslctl job show <job-slug>
Job statuses
| Status | Meaning |
|---|
scheduling | Waiting for resources to become available. The job shows a reason like Waiting for GPU capacity while it queues. |
running | Your command is actively executing on the allocated resources. |
completed | The command exited successfully (exit code 0). Output in mounted volumes is preserved. |
failed | The command exited with a non-zero code, or the container crashed (for example, OOMKilled). Check logs to debug. |
terminated | You manually cancelled the job before it finished. |
Persist job output
Jobs run in ephemeral containers — anything written outside a mounted volume disappears when the job ends. Attach at least one persistent volume so your outputs survive.
- Object storage (
--object-volume): Shared across clusters, ideal for final artifacts like trained models and evaluation metrics. Mount at a dedicated path such as /output.
- Cluster storage (
--cluster-volume): Fast in-cluster storage, ideal for intermediate checkpoints during long training. Mount at /workspace or similar.
vesslctl job create \
--name my-training-job \
--resource-spec <spec-name> \
--image quay.io/vessl-ai/torch:2.9.1-cuda13.0.1-py3.13-slim \
--object-volume <output-volume-slug>:/output \
--cmd "python train.py --output /output"
Temporary storage is cleared when a job ends, even after completed. If your training script writes to /tmp or the current directory without a mounted volume, the results are lost.
View job logs
Stream logs while a job runs or after it finishes:
vesslctl job logs <job-slug> --follow # stream in real time
vesslctl job logs <job-slug> --limit 500 # last 500 lines
Logs are also available in the Jobs detail page under the Logs tab.
Cancel a running job
Terminate a job from the console (kebab menu → Terminate) or from the CLI:
vesslctl job terminate <job-slug>
Cancellation stops compute billing immediately. Data already written to mounted volumes is preserved.
Next steps