Skip to main content
This guide walks you through fine-tuning Gemma 4 E4B using QLoRA and Unsloth on VESSL Cloud. By the end, you will have a fine-tuned adapter saved to shared storage, ready for inference or team collaboration.

Prerequisites

Before starting, make sure you have:
  • A VESSL Cloud account with credits (sign up)
  • An organization with access to A100 SXM 80 GB GPU instances
  • Basic familiarity with Python and Hugging Face Transformers
New to VESSL Cloud? Complete the Member quickstart first to set up your account, payment, and storage.

Create a workspace

1

Set up storage volumes

You need two types of storage for this workflow:
Storage typeMount pathPurpose
Cluster storage/rootHome directory. Pip packages and conda environments persist across workspace restarts.
Object storage/sharedModel checkpoints and outputs. Accessible from any cluster, shareable with teammates.
Why Cluster storage at /root? Your home directory ($HOME) is where pip installs packages by default. Mounting Cluster storage here means you only run pip install once — packages survive workspace pause/resume cycles.Why Object storage at /shared? Fine-tuned model weights need to be accessible from other workspaces or clusters. Object storage is S3-backed and reachable from anywhere, making it easy to share results with your team or deploy from a different region.Create both volumes before launching the workspace:
  • Cluster storage: Go to Cluster storage in the sidebar and click Create new volume. See Storage overview for details.
  • Object storage: Go to Object storage in the sidebar and click Create new volume. See Create a volume for details.
Do not mount Object storage at /root. Object storage is slower than Cluster storage and is not suitable as your primary workspace path. Use /shared or another separate mount point.
2

Launch the workspace

Create a new workspace with the following configuration:
SettingValue
GPUA100 SXM 80 GB
GPU count1
Imagepytorch/pytorch:2.5.1-cuda12.4-cudnn9-devel
Cluster storageYour cluster volume mounted at /root
Object storageYour object volume mounted at /shared
See Create a workspace for the full creation flow.
4-bit quantization (QLoRA) keeps VRAM usage around 18-22 GB for the E4B model, well within the 80 GB available on an A100. This leaves headroom for larger batch sizes or longer sequences if needed.
3

Connect to the workspace

Once the workspace shows Running, connect via JupyterLab or SSH. See Connect to a workspace.

Install packages

Open a terminal in your workspace and install the required libraries:
pip install unsloth trl transformers datasets
If you mounted Cluster storage at /root, these packages persist across workspace pause/resume. You only need to run this once.

Load the model

from unsloth import FastModel

model, tokenizer = FastModel.from_pretrained(
    model_name="unsloth/gemma-4-E4B-it",
    max_seq_length=2048,
    load_in_4bit=True,
    full_finetuning=False,
)
What load_in_4bit=True does: Instead of loading each parameter as a 16-bit float (the default), 4-bit quantization compresses the weights to 4 bits using the QLoRA (NF4) technique. This reduces the model’s memory footprint by roughly 4x, allowing a model that would normally require ~32 GB of VRAM to fit in ~8-10 GB. The quality loss is minimal because only the frozen base weights are quantized — the LoRA adapter trains in full precision.

Configure LoRA adapter

model = FastModel.get_peft_model(
    model,
    finetune_vision_layers=False,
    finetune_language_layers=True,
    finetune_attention_modules=True,
    finetune_mlp_modules=True,
    r=8,
    lora_alpha=8,
    lora_dropout=0,
    bias="none",
    random_state=3407,
)

Hyperparameter reference

ParameterValueExplanation
r8LoRA rank — controls the capacity of the low-rank adapter. Higher values (16, 32) capture more complex patterns but use more memory and risk overfitting on small datasets. Start with 8 for general tasks.
lora_alpha8Scaling factor — controls how much the adapter output is amplified. Typically set equal to r. The effective learning rate of the adapter scales as lora_alpha / r.
lora_dropout0Dropout rate — probability of zeroing adapter outputs during training. Set to 0 for small datasets; increase to 0.05-0.1 if you see overfitting on larger datasets.
finetune_vision_layersFalseGemma 4 is a multimodal model. Set to False for text-only training to skip vision encoder layers entirely.
finetune_language_layersTrueTrain the language model layers.
finetune_attention_modulesTrueTrain the attention (Q, K, V, O) projection matrices.
finetune_mlp_modulesTrueTrain the feed-forward (MLP) layers. Together with attention modules, this covers the most impactful parts of the model.
bias"none"Do not train bias terms. Keeps the adapter small without measurable quality loss.
random_state3407Random seed for reproducibility.
When to increase r:
  • Complex domain adaptation (medical, legal, code): try r=16
  • Large, diverse datasets (100k+ samples): try r=16 or r=32
  • Simple style transfer or format following: r=8 is usually sufficient

Prepare the dataset

This example uses the FineTome-100k dataset with 3,000 samples for a quick demo. For production use, train on the full dataset or substitute your own data.
from unsloth.chat_templates import get_chat_template, standardize_data_formats
from datasets import load_dataset

tokenizer = get_chat_template(tokenizer, chat_template="gemma-4")
dataset = load_dataset("mlabonne/FineTome-100k", split="train[:3000]")
dataset = standardize_data_formats(dataset)

def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = [
        tokenizer.apply_chat_template(
            convo, tokenize=False, add_generation_prompt=False
        ).removeprefix("<bos>")
        for convo in convos
    ]
    return {"text": texts}

dataset = dataset.map(formatting_prompts_func, batched=True)
Your dataset should be a JSON file where each entry has a conversations field — a list of message objects with role and content:
[
  {
    "conversations": [
      {"role": "user", "content": "What is the capital of France?"},
      {"role": "assistant", "content": "The capital of France is Paris."}
    ]
  },
  {
    "conversations": [
      {"role": "user", "content": "Explain photosynthesis briefly."},
      {"role": "assistant", "content": "Photosynthesis is the process by which plants convert sunlight, water, and CO2 into glucose and oxygen."}
    ]
  }
]
Load it with:
dataset = load_dataset("json", data_files="/shared/my-data.json", split="train")
dataset = standardize_data_formats(dataset)
dataset = dataset.map(formatting_prompts_func, batched=True)
The 3,000-sample subset is for demonstration only. For meaningful quality improvements, use the full 100k dataset or at least 10k-20k high-quality samples of your own data.

Train

from trl import SFTTrainer, SFTConfig
from unsloth.chat_templates import train_on_responses_only

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    args=SFTConfig(
        dataset_text_field="text",
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        max_steps=60,
        learning_rate=2e-4,
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.001,
        lr_scheduler_type="linear",
        seed=3407,
        report_to="none",
        output_dir="/shared/gemma4-finetuned",
    ),
)

trainer = train_on_responses_only(
    trainer,
    instruction_part="<|turn>user\n",
    response_part="<|turn>model\n",
)

trainer_stats = trainer.train()

Training parameter reference

ParameterValueExplanation
per_device_train_batch_size1Number of samples per GPU per forward pass. Kept at 1 to fit within VRAM.
gradient_accumulation_steps4Accumulate gradients over 4 steps before updating weights. Effective batch size = 1 x 4 = 4.
warmup_steps5Linearly ramp up learning rate for the first 5 steps to stabilize early training.
max_steps60Total training steps. This is for demo purposes — for full training runs, remove max_steps and set num_train_epochs=1 (or more).
learning_rate2e-4Standard learning rate for QLoRA fine-tuning.
optim"adamw_8bit"8-bit AdamW optimizer. Uses quantized optimizer states to save ~2 GB of VRAM compared to standard AdamW.
weight_decay0.001L2 regularization to prevent overfitting.
lr_scheduler_type"linear"Linearly decay the learning rate to zero over training.
report_to"none"Disable external logging (Weights & Biases, etc.). Set to "wandb" if you want experiment tracking.
output_dir"/shared/gemma4-finetuned"Save checkpoints to Object storage so they persist and are shareable.
What train_on_responses_only does: By default, the loss is computed over the entire conversation (user + assistant turns). This option masks the user turns so the model only learns to generate the assistant responses. This improves training efficiency and prevents the model from memorizing user prompts.

Evaluate

After training, check how the model performs qualitatively and quantitatively.

Check training loss

The trainer_stats object contains the training log. A decreasing loss curve indicates the model is learning:
import json

# Print final training loss
print(f"Final training loss: {trainer_stats.training_loss:.4f}")

# Inspect per-step loss from the log history
for entry in trainer_stats.log_history[-5:]:
    if "loss" in entry:
        print(f"  Step {entry['step']}: loss = {entry['loss']:.4f}")
A healthy loss curve starts high (2-3+) and decreases steadily. If loss plateaus very early, consider increasing r or using more data. If loss spikes or diverges, reduce the learning rate.

Compare before and after

Run the same prompt through the fine-tuned model to see the effect:
from transformers import TextStreamer

messages = [
    {"role": "user", "content": [
        {"type": "text", "text": "Explain quantum computing in simple terms."}
    ]}
]

_ = model.generate(
    **tokenizer.apply_chat_template(
        messages,
        add_generation_prompt=True,
        tokenize=True,
        return_dict=True,
        return_tensors="pt",
    ).to("cuda"),
    max_new_tokens=256,
    use_cache=True,
    temperature=0.7,
    streamer=TextStreamer(tokenizer, skip_prompt=True),
)
For a more rigorous evaluation, split your data upfront and compute loss on the held-out portion:
# Before training, split the dataset
split = dataset.train_test_split(test_size=0.1, seed=3407)
train_dataset = split["train"]
eval_dataset = split["test"]

# Pass eval_dataset to SFTTrainer
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    args=SFTConfig(
        # ... same args as above, plus:
        eval_strategy="steps",
        eval_steps=20,
        # ...
    ),
)
This reports evaluation loss at regular intervals, letting you detect overfitting (training loss decreases but eval loss increases).

Save the model

Save the fine-tuned adapter and tokenizer to Object storage:
model.save_pretrained("/shared/gemma4-finetuned/final")
tokenizer.save_pretrained("/shared/gemma4-finetuned/final")
Since /shared is backed by Object storage, the saved model is:
  • Persistent — survives workspace termination
  • Cross-cluster — accessible from workspaces in any region
  • Team-shareable — any team member with access to the volume can load the adapter
To load the adapter later from another workspace:
from unsloth import FastModel

model, tokenizer = FastModel.from_pretrained(
    model_name="unsloth/gemma-4-E4B-it",
    max_seq_length=2048,
    load_in_4bit=True,
    full_finetuning=False,
)

from peft import PeftModel
model = PeftModel.from_pretrained(model, "/shared/gemma4-finetuned/final")

Gemma 4 model comparison

Choose the right Gemma 4 variant based on your GPU and task complexity:
ModelParametersRecommended GPUQLoRA VRAM estimateBest for
Gemma 4 E2B2BT4 16 GB, L4 24 GB~6-8 GBQuick experiments, edge deployment
Gemma 4 E4B4BA100 40/80 GB, L4 24 GB~10-14 GBBest balance of quality and efficiency
Gemma 4 12B12BA100 80 GB~18-24 GBHigher quality, single-GPU fine-tuning
Gemma 4 27B27BA100 80 GB (tight), 2xA100~32-40 GBNear-frontier quality, requires more VRAM
VRAM estimates assume 4-bit quantization with QLoRA, batch size 1, and sequence length 2048. Actual usage varies with batch size, sequence length, and gradient accumulation settings.

Next steps

  • Use your own data — Replace FineTome-100k with domain-specific conversation data for targeted improvements.
  • Try DPO or ORPO — After SFT, apply preference optimization (DPO/ORPO) to further align the model with desired behavior.
  • Scale up — Move to the 12B or 27B model for higher quality. Use r=16 or r=32 with larger datasets.
  • Export to GGUF — Convert the fine-tuned model to GGUF format for local inference with llama.cpp or Ollama.
  • Deploy as a batch job — Use VESSL Cloud batch jobs to run fine-tuning as a scheduled, reproducible pipeline.
  • Automate with vesslctl — Turn this interactive run into a one-line command with vesslctl job create. Swap the Jupyter cells for a train.py and submit it from your terminal or a CI pipeline.