VESSL Cloud Documentation

This guide walks you through fine-tuning Gemma 4 E4B using QLoRA and Unsloth on VESSL Cloud. By the end, you will have a fine-tuned adapter saved to shared storage, ready for inference or team collaboration.

Prerequisites

Before starting, make sure you have:

A VESSL Cloud account with credits (sign up)
An organization with access to A100 SXM 80 GB GPU instances
Basic familiarity with Python and Hugging Face Transformers

New to VESSL Cloud? Complete the Member quickstart first to set up your account, payment, and storage.

Create a workspace

Set up storage volumes

You need two types of storage for this workflow:

Storage type	Mount path	Purpose
Cluster storage	`/root`	Home directory. Pip packages and conda environments persist across workspace restarts.
Object storage	`/shared`	Model checkpoints and outputs. Accessible from any cluster, shareable with teammates.

Why Cluster storage at /root? Your home directory ($HOME) is where pip installs packages by default. Mounting Cluster storage here means you only run pip install once — packages survive workspace pause/resume cycles.Why Object storage at /shared? Fine-tuned model weights need to be accessible from other workspaces or clusters. Object storage is S3-backed and reachable from anywhere, making it easy to share results with your team or deploy from a different region.Create both volumes before launching the workspace:

Cluster storage: Go to Cluster storage in the sidebar and click Create new volume. See Storage overview for details.
Object storage: Go to Object storage in the sidebar and click Create new volume. See Create a volume for details.

Do not mount Object storage at /root. Object storage is slower than Cluster storage and is not suitable as your primary workspace path. Use /shared or another separate mount point.

Launch the workspace

Create a new workspace with the following configuration:

Setting	Value
GPU	A100 SXM 80 GB
GPU count	1
Image	`pytorch/pytorch:2.5.1-cuda12.4-cudnn9-devel`
Cluster storage	Your cluster volume mounted at `/root`
Object storage	Your object volume mounted at `/shared`

See Create a workspace for the full creation flow.

4-bit quantization (QLoRA) keeps VRAM usage around 18-22 GB for the E4B model, well within the 80 GB available on an A100. This leaves headroom for larger batch sizes or longer sequences if needed.

Connect to the workspace

Once the workspace shows Running, connect via JupyterLab or SSH. See Connect to a workspace.

Install packages

Open a terminal in your workspace and install the required libraries:

pip install unsloth trl transformers datasets

If you mounted Cluster storage at /root, these packages persist across workspace pause/resume. You only need to run this once.

Load the model

from unsloth import FastModel

model, tokenizer = FastModel.from_pretrained(
    model_name="unsloth/gemma-4-E4B-it",
    max_seq_length=2048,
    load_in_4bit=True,
    full_finetuning=False,
)

What load_in_4bit=True does: Instead of loading each parameter as a 16-bit float (the default), 4-bit quantization compresses the weights to 4 bits using the QLoRA (NF4) technique. This reduces the model’s memory footprint by roughly 4x, allowing a model that would normally require ~32 GB of VRAM to fit in ~8-10 GB. The quality loss is minimal because only the frozen base weights are quantized — the LoRA adapter trains in full precision.

nvidia-smi during training, peak VRAM around 11 GB on an A100 80 GB

Configure LoRA adapter

model = FastModel.get_peft_model(
    model,
    finetune_vision_layers=False,
    finetune_language_layers=True,
    finetune_attention_modules=True,
    finetune_mlp_modules=True,
    r=8,
    lora_alpha=8,
    lora_dropout=0,
    bias="none",
    random_state=3407,
)

Hyperparameter reference

Parameter	Value	Explanation
`r`	`8`	LoRA rank — controls the capacity of the low-rank adapter. Higher values (16, 32) capture more complex patterns but use more memory and risk overfitting on small datasets. Start with 8 for general tasks.
`lora_alpha`	`8`	Scaling factor — controls how much the adapter output is amplified. Typically set equal to `r`. The effective learning rate of the adapter scales as `lora_alpha / r`.
`lora_dropout`	`0`	Dropout rate — probability of zeroing adapter outputs during training. Set to 0 for small datasets; increase to 0.05-0.1 if you see overfitting on larger datasets.
`finetune_vision_layers`	`False`	Gemma 4 is a multimodal model. Set to `False` for text-only training to skip vision encoder layers entirely.
`finetune_language_layers`	`True`	Train the language model layers.
`finetune_attention_modules`	`True`	Train the attention (Q, K, V, O) projection matrices.
`finetune_mlp_modules`	`True`	Train the feed-forward (MLP) layers. Together with attention modules, this covers the most impactful parts of the model.
`bias`	`"none"`	Do not train bias terms. Keeps the adapter small without measurable quality loss.
`random_state`	`3407`	Random seed for reproducibility.

When to increase r:

Complex domain adaptation (medical, legal, code): try r=16
Large, diverse datasets (100k+ samples): try r=16 or r=32
Simple style transfer or format following: r=8 is usually sufficient

Prepare the dataset

This example uses the FineTome-100k dataset with 3,000 samples for a quick demo. For production use, train on the full dataset or substitute your own data.

from unsloth.chat_templates import get_chat_template, standardize_data_formats
from datasets import load_dataset

tokenizer = get_chat_template(tokenizer, chat_template="gemma-4")
dataset = load_dataset("mlabonne/FineTome-100k", split="train[:3000]")
dataset = standardize_data_formats(dataset)

def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = [
        tokenizer.apply_chat_template(
            convo, tokenize=False, add_generation_prompt=False
        ).removeprefix("<bos>")
        for convo in convos
    ]
    return {"text": texts}

dataset = dataset.map(formatting_prompts_func, batched=True)

Using your own data

Your dataset should be a JSON file where each entry has a conversations field — a list of message objects with role and content:

[
  {
    "conversations": [
      {"role": "user", "content": "What is the capital of France?"},
      {"role": "assistant", "content": "The capital of France is Paris."}
    ]
  },
  {
    "conversations": [
      {"role": "user", "content": "Explain photosynthesis briefly."},
      {"role": "assistant", "content": "Photosynthesis is the process by which plants convert sunlight, water, and CO2 into glucose and oxygen."}
    ]
  }
]

Load it with:

dataset = load_dataset("json", data_files="/shared/my-data.json", split="train")
dataset = standardize_data_formats(dataset)
dataset = dataset.map(formatting_prompts_func, batched=True)

The 3,000-sample subset is for demonstration only. For meaningful quality improvements, use the full 100k dataset or at least 10k-20k high-quality samples of your own data.

Train

from trl import SFTTrainer, SFTConfig
from unsloth.chat_templates import train_on_responses_only

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    args=SFTConfig(
        dataset_text_field="text",
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        max_steps=60,
        learning_rate=2e-4,
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.001,
        lr_scheduler_type="linear",
        seed=3407,
        report_to="none",
        output_dir="/shared/gemma4-finetuned",
    ),
)

trainer = train_on_responses_only(
    trainer,
    instruction_part="<|turn>user\n",
    response_part="<|turn>model\n",
)

trainer_stats = trainer.train()

Training parameter reference

Parameter	Value	Explanation
`per_device_train_batch_size`	`1`	Number of samples per GPU per forward pass. Kept at 1 to fit within VRAM.
`gradient_accumulation_steps`	`4`	Accumulate gradients over 4 steps before updating weights. Effective batch size = 1 x 4 = 4.
`warmup_steps`	`5`	Linearly ramp up learning rate for the first 5 steps to stabilize early training.
`max_steps`	`60`	Total training steps. This is for demo purposes — for full training runs, remove `max_steps` and set `num_train_epochs=1` (or more).
`learning_rate`	`2e-4`	Standard learning rate for QLoRA fine-tuning.
`optim`	`"adamw_8bit"`	8-bit AdamW optimizer. Uses quantized optimizer states to save ~2 GB of VRAM compared to standard AdamW.
`weight_decay`	`0.001`	L2 regularization to prevent overfitting.
`lr_scheduler_type`	`"linear"`	Linearly decay the learning rate to zero over training.
`report_to`	`"none"`	Disable external logging such as Weights & Biases. Set to `"wandb"` if you want experiment tracking.
`output_dir`	`"/shared/gemma4-finetuned"`	Save checkpoints to Object storage so they persist and are shareable.

What train_on_responses_only does: By default, the loss is computed over the entire conversation (user + assistant turns). This option masks the user turns so the model only learns to generate the assistant responses. This improves training efficiency and prevents the model from memorizing user prompts.

Evaluate

After training, check how the model performs qualitatively and quantitatively.

Check training loss

The trainer_stats object contains the training log. A decreasing loss curve indicates the model is learning:

import json

# Print final training loss
print(f"Final training loss: {trainer_stats.training_loss:.4f}")

# Inspect per-step loss from the log history
for entry in trainer_stats.log_history[-5:]:
    if "loss" in entry:
        print(f"  Step {entry['step']}: loss = {entry['loss']:.4f}")

A healthy loss curve starts high (2-3+) and decreases steadily. If loss plateaus very early, consider increasing r or using more data. If loss spikes or diverges, reduce the learning rate.

60-step training loss decreasing steadily from around 2.37

Compare before and after

Run the same prompt through the fine-tuned model to see the effect:

from transformers import TextStreamer

messages = [
    {"role": "user", "content": [
        {"type": "text", "text": "Explain quantum computing in simple terms."}
    ]}
]

_ = model.generate(
    **tokenizer.apply_chat_template(
        messages,
        add_generation_prompt=True,
        tokenize=True,
        return_dict=True,
        return_tensors="pt",
    ).to("cuda"),
    max_new_tokens=256,
    use_cache=True,
    temperature=0.7,
    streamer=TextStreamer(tokenizer, skip_prompt=True),
)

Evaluating on a held-out test split

For a more rigorous evaluation, split your data upfront and compute loss on the held-out portion:

# Before training, split the dataset
split = dataset.train_test_split(test_size=0.1, seed=3407)
train_dataset = split["train"]
eval_dataset = split["test"]

# Pass eval_dataset to SFTTrainer
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    args=SFTConfig(
        # ... same args as above, plus:
        eval_strategy="steps",
        eval_steps=20,
        # ...
    ),
)

This reports evaluation loss at regular intervals, letting you detect overfitting (training loss decreases but eval loss increases).

Save the model

Save the fine-tuned adapter and tokenizer to Object storage:

model.save_pretrained("/shared/gemma4-finetuned/final")
tokenizer.save_pretrained("/shared/gemma4-finetuned/final")

Since /shared is backed by Object storage, the saved model is:

Persistent — survives workspace termination
Cross-cluster — accessible from workspaces in any region
Team-shareable — any team member with access to the volume can load the adapter

Adapter and tokenizer files written to /shared/gemma4-finetuned/final/ after training

To load the adapter later from another workspace:

from unsloth import FastModel

model, tokenizer = FastModel.from_pretrained(
    model_name="unsloth/gemma-4-E4B-it",
    max_seq_length=2048,
    load_in_4bit=True,
    full_finetuning=False,
)

from peft import PeftModel
model = PeftModel.from_pretrained(model, "/shared/gemma4-finetuned/final")

Gemma 4 model comparison

Choose the right Gemma 4 variant based on your GPU and task complexity:

Model	Parameters	Recommended GPU	QLoRA VRAM estimate	Best for
Gemma 4 E2B	2B	T4 16 GB, L4 24 GB	~6-8 GB	Quick experiments, edge deployment
Gemma 4 E4B	4B	A100 40/80 GB, L4 24 GB	~10-14 GB	Best balance of quality and efficiency
Gemma 4 12B	12B	A100 80 GB	~18-24 GB	Higher quality, single-GPU fine-tuning
Gemma 4 27B	27B	A100 80 GB (tight), 2xA100	~32-40 GB	Near-frontier quality, requires more VRAM

VRAM estimates assume 4-bit quantization with QLoRA, batch size 1, and sequence length 2048. Actual usage varies with batch size, sequence length, and gradient accumulation settings.

Next steps

Use your own data — Replace FineTome-100k with domain-specific conversation data for targeted improvements.
Try DPO or ORPO — After SFT, apply preference optimization (DPO/ORPO) to further align the model with desired behavior.
Scale up — Move to the 12B or 27B model for higher quality. Use r=16 or r=32 with larger datasets.
Export to GGUF — Convert the fine-tuned model to GGUF format for local inference with llama.cpp or Ollama.
Deploy as a batch job — Use VESSL Cloud batch jobs to run fine-tuning as a scheduled, reproducible pipeline.
Automate with vesslctl — Turn this interactive run into a one-line command with vesslctl job create. Swap the Jupyter cells for a train.py and submit it from your terminal or a CI pipeline.

​Prerequisites

​Create a workspace

​Install packages

​Load the model

​Configure LoRA adapter

​Hyperparameter reference

​Prepare the dataset

​Train

​Training parameter reference

​Evaluate

​Check training loss

​Compare before and after

​Save the model

​Gemma 4 model comparison

​Next steps

Prerequisites

Create a workspace

Install packages

Load the model

Configure LoRA adapter

Hyperparameter reference

Prepare the dataset

Train

Training parameter reference

Evaluate

Check training loss

Compare before and after

Save the model

Gemma 4 model comparison

Next steps