Prerequisites
Before starting, make sure you have:- A VESSL Cloud account with credits (sign up)
- An organization with access to A100 SXM 80 GB GPU instances
- Basic familiarity with Python and Hugging Face Transformers
Create a workspace
Set up storage volumes
You need two types of storage for this workflow:
Why Cluster storage at
| Storage type | Mount path | Purpose |
|---|---|---|
| Cluster storage | /root | Home directory. Pip packages and conda environments persist across workspace restarts. |
| Object storage | /shared | Model checkpoints and outputs. Accessible from any cluster, shareable with teammates. |
/root? Your home directory ($HOME) is where pip installs packages by default. Mounting Cluster storage here means you only run pip install once — packages survive workspace pause/resume cycles.Why Object storage at /shared? Fine-tuned model weights need to be accessible from other workspaces or clusters. Object storage is S3-backed and reachable from anywhere, making it easy to share results with your team or deploy from a different region.Create both volumes before launching the workspace:- Cluster storage: Go to Cluster storage in the sidebar and click Create new volume. See Storage overview for details.
- Object storage: Go to Object storage in the sidebar and click Create new volume. See Create a volume for details.
Launch the workspace
Create a new workspace with the following configuration:
See Create a workspace for the full creation flow.
| Setting | Value |
|---|---|
| GPU | A100 SXM 80 GB |
| GPU count | 1 |
| Image | pytorch/pytorch:2.5.1-cuda12.4-cudnn9-devel |
| Cluster storage | Your cluster volume mounted at /root |
| Object storage | Your object volume mounted at /shared |
4-bit quantization (QLoRA) keeps VRAM usage around 18-22 GB for the E4B model, well within the 80 GB available on an A100. This leaves headroom for larger batch sizes or longer sequences if needed.
Connect to the workspace
Once the workspace shows Running, connect via JupyterLab or SSH. See Connect to a workspace.
Install packages
Open a terminal in your workspace and install the required libraries:If you mounted Cluster storage at
/root, these packages persist across workspace pause/resume. You only need to run this once.Load the model
load_in_4bit=True does: Instead of loading each parameter as a 16-bit float (the default), 4-bit quantization compresses the weights to 4 bits using the QLoRA (NF4) technique. This reduces the model’s memory footprint by roughly 4x, allowing a model that would normally require ~32 GB of VRAM to fit in ~8-10 GB. The quality loss is minimal because only the frozen base weights are quantized — the LoRA adapter trains in full precision.
Configure LoRA adapter
Hyperparameter reference
| Parameter | Value | Explanation |
|---|---|---|
r | 8 | LoRA rank — controls the capacity of the low-rank adapter. Higher values (16, 32) capture more complex patterns but use more memory and risk overfitting on small datasets. Start with 8 for general tasks. |
lora_alpha | 8 | Scaling factor — controls how much the adapter output is amplified. Typically set equal to r. The effective learning rate of the adapter scales as lora_alpha / r. |
lora_dropout | 0 | Dropout rate — probability of zeroing adapter outputs during training. Set to 0 for small datasets; increase to 0.05-0.1 if you see overfitting on larger datasets. |
finetune_vision_layers | False | Gemma 4 is a multimodal model. Set to False for text-only training to skip vision encoder layers entirely. |
finetune_language_layers | True | Train the language model layers. |
finetune_attention_modules | True | Train the attention (Q, K, V, O) projection matrices. |
finetune_mlp_modules | True | Train the feed-forward (MLP) layers. Together with attention modules, this covers the most impactful parts of the model. |
bias | "none" | Do not train bias terms. Keeps the adapter small without measurable quality loss. |
random_state | 3407 | Random seed for reproducibility. |
Prepare the dataset
This example uses the FineTome-100k dataset with 3,000 samples for a quick demo. For production use, train on the full dataset or substitute your own data.Using your own data
Using your own data
Your dataset should be a JSON file where each entry has a Load it with:
conversations field — a list of message objects with role and content:The 3,000-sample subset is for demonstration only. For meaningful quality improvements, use the full 100k dataset or at least 10k-20k high-quality samples of your own data.
Train
Training parameter reference
| Parameter | Value | Explanation |
|---|---|---|
per_device_train_batch_size | 1 | Number of samples per GPU per forward pass. Kept at 1 to fit within VRAM. |
gradient_accumulation_steps | 4 | Accumulate gradients over 4 steps before updating weights. Effective batch size = 1 x 4 = 4. |
warmup_steps | 5 | Linearly ramp up learning rate for the first 5 steps to stabilize early training. |
max_steps | 60 | Total training steps. This is for demo purposes — for full training runs, remove max_steps and set num_train_epochs=1 (or more). |
learning_rate | 2e-4 | Standard learning rate for QLoRA fine-tuning. |
optim | "adamw_8bit" | 8-bit AdamW optimizer. Uses quantized optimizer states to save ~2 GB of VRAM compared to standard AdamW. |
weight_decay | 0.001 | L2 regularization to prevent overfitting. |
lr_scheduler_type | "linear" | Linearly decay the learning rate to zero over training. |
report_to | "none" | Disable external logging (Weights & Biases, etc.). Set to "wandb" if you want experiment tracking. |
output_dir | "/shared/gemma4-finetuned" | Save checkpoints to Object storage so they persist and are shareable. |
train_on_responses_only does: By default, the loss is computed over the entire conversation (user + assistant turns). This option masks the user turns so the model only learns to generate the assistant responses. This improves training efficiency and prevents the model from memorizing user prompts.
Evaluate
After training, check how the model performs qualitatively and quantitatively.Check training loss
Thetrainer_stats object contains the training log. A decreasing loss curve indicates the model is learning:
Compare before and after
Run the same prompt through the fine-tuned model to see the effect:Evaluating on a held-out test split
Evaluating on a held-out test split
For a more rigorous evaluation, split your data upfront and compute loss on the held-out portion:This reports evaluation loss at regular intervals, letting you detect overfitting (training loss decreases but eval loss increases).
Save the model
Save the fine-tuned adapter and tokenizer to Object storage:/shared is backed by Object storage, the saved model is:
- Persistent — survives workspace termination
- Cross-cluster — accessible from workspaces in any region
- Team-shareable — any team member with access to the volume can load the adapter
Gemma 4 model comparison
Choose the right Gemma 4 variant based on your GPU and task complexity:| Model | Parameters | Recommended GPU | QLoRA VRAM estimate | Best for |
|---|---|---|---|---|
| Gemma 4 E2B | 2B | T4 16 GB, L4 24 GB | ~6-8 GB | Quick experiments, edge deployment |
| Gemma 4 E4B | 4B | A100 40/80 GB, L4 24 GB | ~10-14 GB | Best balance of quality and efficiency |
| Gemma 4 12B | 12B | A100 80 GB | ~18-24 GB | Higher quality, single-GPU fine-tuning |
| Gemma 4 27B | 27B | A100 80 GB (tight), 2xA100 | ~32-40 GB | Near-frontier quality, requires more VRAM |
VRAM estimates assume 4-bit quantization with QLoRA, batch size 1, and sequence length 2048. Actual usage varies with batch size, sequence length, and gradient accumulation settings.
Next steps
- Use your own data — Replace FineTome-100k with domain-specific conversation data for targeted improvements.
- Try DPO or ORPO — After SFT, apply preference optimization (DPO/ORPO) to further align the model with desired behavior.
- Scale up — Move to the 12B or 27B model for higher quality. Use
r=16orr=32with larger datasets. - Export to GGUF — Convert the fine-tuned model to GGUF format for local inference with llama.cpp or Ollama.
- Deploy as a batch job — Use VESSL Cloud batch jobs to run fine-tuning as a scheduled, reproducible pipeline.
- Automate with vesslctl — Turn this interactive run into a one-line command with
vesslctl job create. Swap the Jupyter cells for atrain.pyand submit it from your terminal or a CI pipeline.
