Skip to main content
Open-web LLMs look better at financial prediction than they are: their pretraining corpus already contains future information relative to any historical test window, so part of the score is look-ahead leakage rather than skill. The remedy studied in Kelly, Malamud, Schwab & Xu, “Scaling Point-in-Time Language Models” (SSRN 6681860) is point-in-time training: only train on text published before a hard cutoff date. This guide walks through a small, reproducible version of that idea on VESSL Cloud. The recipe continues pretraining Qwen/Qwen3.5-35B-A3B-Base on a FineWeb slice clipped to Common Crawl dumps published on or before 2017-06-30, then measures the leakage premium on Japanese equities — the R² gap between an honest chronological split and a stock-disjoint split that lets temporal leakage back in. The example is a vehicle for three VESSL Cloud patterns you can reuse for any long-running training workload:
  • One Object storage volume as the persistent backbone. The tokenized corpus, the evaluation data, and the trained checkpoints all live on a single volume that every batch job mounts.
  • Terminal-driven training with vesslctl. Edit train.py locally, submit with a wrapper script, and read metrics back from the job logs. No notebook, and no GPU lease held while you think.
  • Two arms on the same primitives. A single-H100 LoRA arm and an 8-GPU full-weight arm share the base model, the data, and the evaluation, so “train 2.6% of the parameters” and “train 100% of the parameters” can be compared head-to-head.
The full code lives in the aqr-finance cookbook recipe. This page focuses on the VESSL Cloud workflow.
Diagram of the two recipe arms: a single-H100 LoRA job and an 8-GPU full-weight FSDP2 job, both mounting one Object storage volume and scored by the same leakage evaluation
What this experiment found.
  1. Leakage is real. Both the base model and the continued-pretrained checkpoint show a statistically significant leakage premium on Kaggle JPX equities (both 95% CIs exclude zero).
  2. One continued-pretraining pass does not significantly reduce it. Both the LoRA and full-weight arms have a premium_reduction CI that includes zero.
  3. The evaluation protocol is the next lever. A ChronoGPT control showed the premium is nearly independent of the base model’s knowledge cutoff year, which implies evaluation construction drives much of the signal.
The companion blog post covers the full analysis — ChronoGPT control, robustness checks, and what it means for a trading firm.

Prerequisites

  • A VESSL Cloud account with credits (sign up)
  • An organization with access to H100 SXM ×1 for the LoRA arm; the full-weight arm additionally needs a single-node H100 SXM ×8 spec
  • vesslctl installed and authenticated (vesslctl auth status)
  • Git and a local shell — the submit scripts run on your machine and shell out to vesslctl
  • Kaggle credentials for the evaluation data — export KAGGLE_USERNAME and KAGGLE_KEY before data prep so the job can download the JPX Tokyo Stock Exchange Prediction dataset
New to VESSL Cloud? Complete the Member quickstart first to set up your account, payment, and storage.

Run the single-H100 LoRA arm

1

Create the cache volume

Everything that must survive between jobs — the tokenized FineWeb shards, the JPX prices, the trained adapter — lives on one Object storage volume:
vesslctl volume create \
  --name aqr-finance-cache \
  --storage <your-object-storage-slug> \
  --teams <your-team>

vesslctl volume list
export AQR_CACHE_VOLUME=objvol-...
Then pick a single-H100 resource spec for the training job:
vesslctl resource-spec list
export AQR_RESOURCE_SPEC=resourcespec-...
The volume mount is a hard prerequisite, not an optimization. A batch job’s own disk is ephemeral — anything written there is gone when the pod terminates. A 23-hour training run that saves its checkpoint to ephemeral disk produces nothing you can evaluate.
2

Run data prep once

Clone the cookbook and submit the one-off prep job. It streams FineWeb, keeps only Common Crawl dumps published on or before 2017-06-30, tokenizes the text with the Qwen3.5 tokenizer, and writes the shards to the volume:
git clone https://github.com/vessl-ai/vessl-cloud-cookbook.git
cd vessl-cloud-cookbook/aqr-finance
bash batch-job/prep.sh
The prep job runs on a CPU spec in 30-60 minutes. Every training job after this mounts the same cache and skips data prep entirely.
3

Submit the training batch job

Each run lives on its own branch, so the container clones exactly the code you submitted:
git checkout -b aqr-finance/my-run
bash batch-job/submit.sh > run.log 2>&1
grep "^r2_leakage_off:\|^r2_leakage_on:\|^val_loss_final:\|^peak_vram_mb:" run.log
submit.sh pushes your branch, calls vesslctl job create with the cache volume mounted at /root/.cache/aqr-finance, polls until the job reaches a terminal state, and writes the full job log to run.log. Inside the container, the job clones your branch, installs dependencies, runs python train.py (about 23 hours on one H100), then python eval.py (about 2 hours). To try a variant, edit train.py, commit on a fresh aqr-finance/<tag> branch, and rerun submit.sh.
The 35B Mixture-of-Experts (MoE) base trains as a single process on one H100 80 GB: LoRA touches about 945M parameters (~2.6%) and VRAM peaks around 74 GB. The LoRA target list is the recipe’s main gotcha — Qwen3.5 mixes Gated DeltaNet and standard attention layers, and targeting only the standard q/k/v/o projections freezes 75% of the model, so the loss diverges. The cookbook README documents the verified 12-entry target list.

What a run costs and reports

Measured on VESSL Cloud, H100 SXM ×1. The full table is in the cookbook’s benchmarks.md:
MetricValue
Wall time~23 h train + ~2 h eval
Cost~$60 (~$56 train + ~$4 eval at $2.39/hr)
Peak VRAM~74 GB
Trained parameters~945M (~2.6%)
Final train loss2.26
The evaluation builds a date-conditional embedding for each (stock, date) pair, fits a Ridge regression to next-day returns, and reports R² on two splits:
  • Chronological split — train on years up to 2020, test on 2021 and later. This is the honest score; no future information can leak in.
  • Stock-disjoint GroupKFold split — a stock never appears in both train and test, so memorizing stock identities earns nothing, but time periods mix freely, which lets temporal leakage back in.
The gap between the two is the leakage premium, and a clustered bootstrap over stocks puts a 95% confidence interval (CI) on it:
  • Leakage is real. The premium is statistically significant for both the base model (0.22) and the LoRA adapter (0.14); both CIs exclude zero.
  • One pass does not remove it. The premium reduction (0.08) has a CI of [-0.04, 0.32], which crosses zero. Treat the recipe as a measuring instrument, not a leakage fix.

Scale up to the 8-GPU full-weight arm

A single 80 GB H100 physically cannot full-weight train a 35B model: weights, gradients, and optimizer state alone exceed 140 GB. The companion multigpu recipe shards them with FSDP2 (PyTorch Fully Sharded Data Parallel) across a single H100 SXM ×8 node using axolotl, training all 35B parameters at about 51 GB per GPU:
export AQR_CACHE_VOLUME=objvol-...        # the same volume as the LoRA arm
export AQR_RESOURCE_SPEC=resourcespec-... # an 8-GPU H100 SXM single-node spec
bash multigpu/stage3_real_v2_submit.sh
From the platform side this is the same pattern as the LoRA arm — one batch job, one volume mount, metrics from the job logs:
LoRA armFull-weight arm
GPUH100 SXM ×1H100 SXM ×8 (single node)
Trained parameters~945M (~2.6%)35B (100%)
Wall time~23 h~18.6 h
Cost~$60~$386
Peak VRAM~74 GB~51 GB per GPU
The FSDP2 configuration that makes a 35B full-weight run fit on 8×80 GB is non-obvious: gradient accumulation must stay at 1 (anything higher accumulates unsharded full-model gradients and blows the VRAM budget), 8-bit optimizers fail on FSDP2’s DTensor, turning activation checkpointing off trades ~9 GB of headroom for about +32% throughput, and the checkpoint merge has to run as a separate single-GPU batch job to avoid NCCL watchdog timeouts. The multigpu README documents each invariant.
Because both arms share eval.py, the full-weight checkpoint and the LoRA adapter are scored on an identical measurement:
Forest plot of the leakage premium with bootstrap 95% confidence intervals: base model 0.20, full-weight checkpoint 0.13, and a reduction of 0.07 whose interval crosses zero

Things worth knowing

  • Per-run cost is real. ~$60 for the LoRA arm and ~$386 for the full-weight arm, per run. At ~23 hours, a LoRA run is effectively a full-day slot — treat each run as expensive, and check spend with vesslctl billing show.
  • This is a measurement, not an alpha recipe. Both arms’ chronological R² values are negative — neither model predicts JPX returns out-of-sample, which is expected for a leakage probe.
  • More trainable parameters did not reduce leakage. The full-weight arm’s premium reduction (0.07, CI [-0.06, 0.22]) also crosses zero. On this measurement, the lever is the evaluation protocol, not the trainable-parameter count.
  • The leakage measurement has its own precision limits. GroupKFold separates by stock identity but lets dates mix across folds, so the measured premium includes both temporal leakage and evaluation-construction noise. Embargoed and walk-forward robustness checks confirm the premium is real and not a split-edge artifact — but replacing the leaky-side GroupKFold with a purged time-series CV (a rolling window with a date-embargo gap) is still the open refinement lever. The companion blog post covers the ChronoGPT control experiment and the robustness checks in detail.
  • Kaggle credentials must be staged before prep. If prep.sh skipped the JPX download, eval.py refuses to run. Export KAGGLE_USERNAME and KAGGLE_KEY before running prep.sh (the script forwards them into the container), or stage stock_prices.csv on the cache volume manually.
  • The base model needs trust_remote_code=True. Qwen3.5’s hybrid DeltaNet and Gated Attention architecture is not yet upstream in transformers. Pin the model revision for production use.
  • Branch hygiene. Every run pushes its own aqr-finance/<tag> branch to origin. Use a fresh tag per run so reruns don’t clobber an earlier run’s commits.

Next steps