How GPT Audited Six ML Momentum Models — and Why Five of Them Failed

26.05.26 04:46 PM

Research Notes · 6 min read

TL;DR: We tested six machine learning models on 30 U.S. equities with walk-forward validation and embargo. Initial results from several models looked excellent. We then used GPT to systematically audit each pipeline for lookahead bias and data leakage. Five of the six models failed the audit. The one that survived — Gradient Boosting — produced modest but structurally sound results: 15.4% CAGR vs SPY’s 9.5%, with a slightly lower maximum drawdown. The bigger story isn’t the returns. It’s how GPT compressed a research audit cycle from weeks to hours.

The problem with most ML momentum backtests

A pattern we keep seeing in published quant research: the headline results look extraordinary, the methodology is described in two paragraphs, and somewhere buried in the implementation is a subtle form of lookahead bias that makes the entire result an artifact of leakage rather than signal.

The mechanisms are well-known to anyone who has built these systems professionally:

Future data in features — calculating a rolling statistic across a window that overlaps the prediction date
Survivorship bias in the universe — testing on the stocks that exist today, ignoring the ones that delisted
Information leakage through preprocessing — fitting a scaler or imputer on the full dataset before splitting train and test
Target-adjacent column names that accidentally encode the label
Time-split integrity violations without an embargo period between train and validation

Each one is technically subtle. None of them produce visible errors. All of them inflate backtested returns in ways that disappear immediately when the model is deployed.

The traditional defense is code review — a second pair of eyes going through the pipeline line by line. It works, but it’s slow, and it depends entirely on the reviewer recognizing the specific anti-patterns. Most reviewers miss at least one.

What we tested

We ran six machine learning models against the same setup:

Universe: 30 U.S. equities, weekly rebalancing
Strategy: Long/Cash (no shorting)
Validation: Walk-forward with embargo period between train and test windows
Period: Multi-year out-of-sample window

The six models: Gradient Boosting, Random Forest, XGBoost, Logistic Regression, Linear SVM, and a small Neural Network. Each was given the same features, the same labels, and the same validation framework.

Initial results were mixed-but-interesting. Several models produced backtested CAGRs in the low-to-mid teens. Two showed Sharpe ratios above 1.0 in the in-sample period.

We then ran the GPT audit.

What GPT caught

We used GPT-4 with a structured prompting framework to systematically check each model’s pipeline for the leakage patterns above. The prompts walk through the data flow at each stage and ask targeted questions:

Are any features calculated using a window that includes data after the prediction date?
Is any preprocessing step fit on data that includes the test set?
Does the universe construction depend on knowing which stocks survive to the present day?
Is the target column derivable from any feature column through a deterministic transformation?

Five of the six models had at least one violation that GPT identified within the first audit pass. Specific findings:

Two models had a rolling-window feature whose lookback period extended forward by one period — a classic off-by-one error in the windowing logic
One model had a preprocessing step (a robust scaler) fit on the entire dataset before the train/test split, leaking distributional information about future periods into the training set
One model had a feature that was a deterministic transformation of the target variable — invisible because the column names were unrelated, but mathematically equivalent
One model had an embargo period that was zero weeks instead of the intended one week — the off-by-one error this time was in the validation split

Could a careful human have found all five of these? Yes — eventually. Each one is a known pattern. None of them required novel insight. The point isn’t that GPT is smarter than a senior quant researcher. The point is that GPT can run the same audit checklist in two hours that would take a human reviewer two weeks, and it doesn’t get tired or skip steps.

What survived

The one model that passed the audit cleanly was Gradient Boosting (specifically HistGradientBoostingClassifier from scikit-learn). Its walk-forward validated results:

CAGR: 15.4% vs SPY’s 9.5%
Sharpe ratio: 0.717 vs SPY’s 0.385
Maximum drawdown: −22.6% vs SPY’s −23.9%

The CAGR is fine. The Sharpe is fine. The number that matters is the max drawdown.

A strategy that outperforms the benchmark while taking less risk isn’t just profitable — it’s structurally sound. Strategies that beat the market by taking more risk are easy to construct; the question is always whether the risk premium is genuine or whether the model has fit to a risk factor that hasn’t shown up yet. A model that produces better returns with lower drawdown over a multi-year out-of-sample window is the kind of result that survives transition from research to production.

The other five models were rejected, not adjusted. We didn’t try to fix the leakage — we just removed them from the analysis. This is intentional. The point of validation isn’t to find the best version of every model; it’s to find the models whose first-run pipeline survives audit. Anything else is a slippery slope toward retrofitting the validation to the result.

The bigger insight

The Gradient Boosting numbers are nice. They’re not the point.

The point is that we ran a six-model audit in hours, not weeks. We caught five subtle leakage patterns that would have passed a standard code review. And we ended up with a single model whose result we genuinely trust — not because the numbers are flashy, but because the path from data to result has been examined at every step.

This compressibility is what changes when you treat GPT as part of the research pipeline rather than as a code-completion tool. The economics of careful research shift dramatically. A workflow that previously required weeks of senior researcher time can now be run in an afternoon, which means the threshold for “is this worth auditing thoroughly?” drops to nearly zero. You audit everything. You reject most things. You ship what survives.

That’s the operating model we’ve been refining. The Research Kit below shows the full methodology, the Gradient Boosting result, and a Python walk-forward template you can run in Colab in under 60 seconds.

Get the Research Kit (free)

Two PDFs and a runnable Python notebook that walks through the methodology end-to-end:

Results one-pager — the full Gradient Boosting comparison vs SPY across CAGR, Sharpe, max drawdown, and the rolling Sharpe trajectory
Methodology one-pager — the GPT audit framework, the leakage patterns we check for, and how the prompts are structured
Walk-forward Python template — runs in Google Colab, no setup, no API keys, sample data included

Get the Research Kit →

If you build ML systems for equity research and run into the leakage patterns above, the methodology PDF is worth the two minutes to skim. If you adapt the framework to a different asset class or frequency, I’d genuinely appreciate a note about what you find.

Mehrzad Mahdavi, PhD — Founder, Digital Hub Insights. 30 years in quantitative research and financial technology. Former Executive Director, Financial Data Professional Institute.