> ## Documentation Index
> Fetch the complete documentation index at: https://docs.cloud.vessl.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Load data into a volume

> Upload to Object storage from the CLI, or load Cluster storage from inside a workspace that mounts it.

Containers in workspaces and jobs are ephemeral. To make a dataset, model checkpoint, or large config file available to your training scripts, put it in a [persistent volume](/member/volume/overview) and mount that volume into the workload. This guide walks through the practical ways to get data **into** a volume.

<Danger>
  Do not encode data into the job command itself. Pasting a gzip+base64 blob, a long heredoc, or other large payloads into `--cmd` is rejected by the API: command bodies over **256 KiB**, environment variable values over **8 KiB**, or more than **128 environment variable pairs** return a 4xx error. Use one of the patterns below instead.
</Danger>

## At a glance

The right approach depends on which kind of volume you target.

| Volume kind                                                          | How to load data                                                                                                          |
| -------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------- |
| **[Object storage](/member/volume/overview#using-object-storage)**   | Upload directly from your machine with `vesslctl volume upload`, or any S3-compatible client via `vesslctl volume token`. |
| **[Cluster storage](/member/volume/overview#using-cluster-storage)** | Mount the volume into a workspace, then bring the data into the mount path from inside that workspace.                    |

***

## Object storage — upload from your machine

`vesslctl volume upload` is the primary path. Files stream from your machine straight to S3, so transfer size is limited only by your network and storage quota.

<Steps>
  <Step title="Pick or create an Object storage volume">
    ```bash theme={null}
    vesslctl volume list --type object
    ```

    If you need a new volume, see [Create a volume](/member/volume/create) or use the CLI:

    ```bash theme={null}
    vesslctl volume create \
      --name training-data \
      --storage <object-storage-slug> \
      --teams <team-name>
    ```

    `--teams` controls which teams can mount this volume; it is required for Object storage.
  </Step>

  <Step title="Upload local files">
    ```bash theme={null}
    vesslctl volume upload <volume-slug> ./dataset/ \
      --remote-prefix datasets/v1/ \
      --exclude "*.pyc" \
      --exclude "__pycache__"
    ```

    `--dry-run` previews the file list without transferring. `--overwrite` replaces existing remote keys; without it, identical keys are skipped. See [`vesslctl volume upload`](/cli/commands/volume#upload) for the full flag reference.
  </Step>

  <Step title="Verify and mount">
    ```bash theme={null}
    vesslctl volume ls <volume-slug> --prefix datasets/v1/
    ```

    Mount the volume into a [workspace](/member/workspace/create#persistent-volume) or pass it to a job with `--object-volume <volume-slug>:/shared`.
  </Step>
</Steps>

<Tip>
  Need to drive the upload from another tool — `aws s3 cp`, `rclone`, [DVC](https://dvc.org/), or a custom pipeline? Run `vesslctl volume token <volume-slug>` to get temporary S3 credentials and an endpoint URL scoped to just that volume. See [`vesslctl volume token`](/cli/commands/volume#token).
</Tip>

***

## Cluster storage — load through a workspace

`vesslctl volume upload` does **not** support Cluster storage volumes. Instead, mount the Cluster storage volume into a workspace and bring the data into the mount path from inside that workspace.

<Steps>
  <Step title="Create a workspace that mounts the volume">
    Mount the Cluster storage volume at a clear path under [Persistent volume](/member/workspace/create#persistent-volume), for example `/data`:

    ```bash theme={null}
    vesslctl workspace create \
      --name data-loader \
      --cluster <cluster-slug> \
      --resource-spec <spec-slug> \
      --image quay.io/vessl-ai/torch:2.9.1-cuda13.0.1-py3.13-slim \
      --cluster-volume <cluster-volume-slug>:/data
    ```

    Any container image with the tools you need (`curl`, `wget`, `aws`, `huggingface-cli`, `git-lfs`, …) works. To minimize hourly cost while you move data, pick a **CPU-only** spec from `vesslctl resource-spec list`.
  </Step>

  <Step title="Connect to the workspace">
    Wait until the workspace is `running`, then connect over SSH or in JupyterLab. See [Connect to a workspace](/member/workspace/connect).

    Once connected, `cd /data` (or whatever mount path you chose). Anything you write below this path lands in the Cluster storage volume and persists after the workspace is paused or terminated.
  </Step>

  <Step title="Bring the data in (pick a pattern below)">
    Several patterns work. Pick the one that matches where the data lives.
  </Step>

  <Step title="Pause the workspace when you are done">
    Cluster storage data persists past `pause` — you do not need a running workspace to keep the data alive. Pause to stop compute billing:

    ```bash theme={null}
    vesslctl workspace pause <workspace-slug>
    ```

    Resume with `vesslctl workspace start <workspace-slug>` later if you need to add or modify data.
  </Step>
</Steps>

### Pattern A — Pull from the public internet

The simplest and most common case: the data is already at a public (or token-authenticated) URL.

```bash theme={null}
# Inside the workspace shell, with the volume mounted at /data
cd /data

# Plain HTTP(S) downloads
wget https://example.com/datasets/imagenet-subset.tar.gz
tar -xzf imagenet-subset.tar.gz

# Hugging Face datasets / model repos
pip install -U "huggingface_hub[cli]"
huggingface-cli download <org>/<repo> --local-dir ./hf-cache --repo-type dataset

# S3 / GCS / Azure (use the matching CLI)
aws s3 sync s3://my-bucket/datasets/v1/ ./datasets/v1/
```

For very large transfers, `aria2c -x 16` parallelizes HTTP downloads, and `rclone copy` handles cloud-storage providers with built-in retry and verification.

### Pattern B — Push from your laptop over SSH

When the data is only on your laptop and you want to skip the round trip through the public internet, use SSH to copy directly into the mount path.

```bash theme={null}
# scp a single file or directory
scp -i /path/to/<key> -P <port> ./dataset.tar.gz \
  root@<workspace-host>:/data/

# rsync (resumable, deduped, recommended for large trees)
rsync -avh --progress -e "ssh -i /path/to/<key> -p <port>" \
  ./dataset/ root@<workspace-host>:/data/
```

The host, port, and key path come from the workspace **Connect** tab — see [Connect to a workspace](/member/workspace/connect#ssh-key-connection). `rsync` is preferable for anything multi-gigabyte: it resumes after a dropped connection (`--partial`) and only retransmits changed files on a re-run.

### Pattern C — Stage through Object storage

When you want a one-time copy from your laptop into Cluster storage on a different cluster (or from one cluster to another), use Object storage as a portable intermediate. Object storage is reachable from any cluster.

```bash theme={null}
# 1. From your laptop: upload to an Object storage volume
vesslctl volume upload <object-volume-slug> ./dataset/ \
  --remote-prefix v1/

# 2. Create the data-loader workspace mounting BOTH volumes
vesslctl workspace create \
  --name data-loader \
  --cluster <cluster-slug> \
  --resource-spec <spec-slug> \
  --image quay.io/vessl-ai/torch:2.9.1-cuda13.0.1-py3.13-slim \
  --object-volume <object-volume-slug>:/shared \
  --cluster-volume <cluster-volume-slug>:/data

# 3. Inside the workspace: copy from /shared into /data
cp -r /shared/v1/. /data/v1/
```

After the copy completes, the Object storage staging copy is optional to keep — delete it from the volume if it is no longer needed, or keep it as a backup.

### Pattern D — Open a custom HTTP port

Need a browser drag-and-drop, a sync server, or a temporary webhook into the workspace? Open a custom HTTP or TCP port when you create the workspace (see [Workspace ports](/member/workspace/create#ports)) and serve directly from the mount path.

```bash theme={null}
# Quick browser-friendly file upload UI on a custom HTTP port (for example, 8000)
pip install --no-cache-dir uvicorn fastapi python-multipart
# … or use any small upload server such as filebrowser or miniserve --upload-files

# rclone serve: expose /data over HTTP/WebDAV/SFTP for a chosen client
rclone serve http /data --addr :8000  # then point your laptop's rclone at it
```

Use this when neither the CLI upload (Object) nor SSH copy (Pattern B) fits — for example, when a teammate without SSH access needs to drop files in, or when an external service is pushing data to the workspace.

<Warning>
  Custom ports are reachable from anywhere the workspace URL is — treat them like any other public endpoint. Add basic auth, a one-time token, or shut the port down once the load is finished.
</Warning>

***

## Anti-pattern: do not embed data in `--cmd`

A pattern that looks tempting — especially to LLM coding agents — is to gzip+base64 a dataset into a single shell line and pass it via `--cmd`:

```bash theme={null}
# DON'T DO THIS — the API rejects --cmd over 256 KiB,
# and even shorter inline blobs make jobs hard to reproduce and observe.
vesslctl job create \
  --cmd "echo 'H4sIAA...<3 MiB of base64>...' | base64 -d | gunzip > /tmp/data && python eval.py"
```

The API rejects this: `Job.command` is capped at 256 KiB, each environment variable value at 8 KiB, and the total environment variable count at 128. Requests beyond these thresholds return 4xx.

Always upload the data to a volume (this page) and mount it instead.

***

## Next steps

* [Understand storage](/member/volume/overview) — Cluster vs Object storage characteristics and pricing.
* [Create a volume](/member/volume/create) — provision a new Object storage volume from the console.
* [Create a workspace](/member/workspace/create) — attach volumes during workspace creation.
* [`vesslctl volume`](/cli/commands/volume) — full CLI reference for upload, download, token, and management.
