5.5 Backtesting Properly

Understand how to test a strategy on historical data without lying to yourself. Avoid the seven deadly sins of backtesting that make profitable backtests fail in live trading.

Layer 5: The Meta Game — Chapter 5 Goal: Understand how to test a strategy on historical data without lying to yourself. Avoid the seven deadly sins of backtesting that make profitable backtests fail in live trading.

The Core Idea

A bad backtest will lie to you with mathematical confidence.

You can build a backtest that shows 500% annual returns and a 95% win rate. Then you go live and lose money on the first 20 trades. What went wrong? Not the live market. The backtest was contaminated with one or more of the biases below.

This chapter teaches you how to backtest in a way you can actually trust.

What a Backtest Is

A backtest applies your strategy rules to historical data and simulates what would have happened if you had traded it.

What you need

A precise rule set. Entry, exit, stop, size — no judgment calls.
Historical data. Price + volume, ideally with enough granularity (daily for swing, intraday for scalping).
Realistic execution assumptions. Slippage, commissions, fills at realistic prices.
A measurable outcome. P&L, win rate, drawdown, Sharpe ratio.

What you produce

Equity curve over the test period
Distribution of trades (wins/losses, sizes, durations)
Max drawdown
Performance vs benchmark (typically S&P 500)

The Seven Deadly Sins of Backtesting

Sin 1: Lookahead Bias

You use information that wasn't available at the time of the trade.

Examples:

Calculating an MA crossover using today's close, then "buying" at yesterday's close
Using earnings results to predict the earnings reaction
Filtering for "stocks that doubled this year" before they doubled

Fix: At every decision point, only use data available strictly before that point. Be paranoid about this. Even one bar of lookahead destroys the validity.

Sin 2: Survivorship Bias

You backtest on stocks that exist today, ignoring stocks that went bankrupt, got delisted, or got acquired during the test period.

Why it lies: Your universe is the "winners." You're systematically excluding failures. A strategy might look amazing because every stock that disappeared isn't in your test set.

Famous example: Backtesting "buy and hold" on the current S&P 500 looks great because you're holding the survivors. The actual S&P 500 components from 1990 had Enron, Lehman Brothers, Eastman Kodak, etc. — all of which went to zero or near-zero.

Fix: Use survivorship-bias-free data sets (e.g., CRSP, certain Polygon/IB data feeds). For retail testing, at minimum acknowledge the bias is present in free data like yfinance.

Sin 3: Look-back Optimization (Curve Fitting)

You tweak your parameters on the same data you're testing on, until you find the magic settings.

Example: You test MA periods 5, 10, 15, 20, 25, 30, 35, 40. The 23-period MA gives the best returns. You declare 23 is the magic number.

Why it lies: You found the parameter that best fits this specific historical sequence. Run on different data, it underperforms because 23 had no real edge — it just happened to fit the noise of that period.

Fix: Out-of-sample testing.

Split your data: 70% in-sample, 30% out-of-sample.
Optimize parameters on the in-sample data only.
Test those parameters (no further tweaking) on the out-of-sample data.
If out-of-sample matches in-sample, you may have a real edge. If out-of-sample is much worse, you curve-fit.

Sin 4: P-Hacking / Multiple Comparisons

You test 100 different strategies. Five of them show "significant" returns. You declare those five work.

Why it lies: With 100 tests, you'd expect ~5 to show "significance" at 95% confidence purely by chance. The five winners might be statistical luck, not real edges.

Fix:

Have a thesis BEFORE running the backtest, not after
Apply Bonferroni or similar correction if you must test multiple variants
Use out-of-sample data to validate

Sin 5: Unrealistic Execution Assumptions

Your backtest assumes you got filled at exact prices with no slippage or commissions.

Reality:

Stop-loss orders fill at the next available price, often worse than your stop
Market orders incur bid-ask spread + slippage (often 5-50 bps in liquid stocks, more in illiquid)
Commissions, while now near-zero for stocks, exist for options
Gaps mean your stop at $48 fills at $42 if the stock opens there

Fix:

Assume 5-15 bps slippage on entries and exits in liquid stocks
Assume 50 bps+ in less liquid stocks
Model gap risk explicitly for swing trades held overnight
For options, model bid-ask spread (often 5-10% of contract value in lower-volume strikes)

Sin 6: Regime Blindness

You backtest a strategy that only works in one market regime.

Example: "Buy any pullback" worked spectacularly from 2009-2021. From 1973-1982, it would have died. From 2000-2003, same. A backtest covering only 2009-2021 will show this as a guaranteed strategy.

Why it lies: You're testing on one market regime. The strategy isn't robust; it's regime-dependent.

Fix:

Test across multiple market regimes: bull market, bear market, choppy/range-bound, high volatility, low volatility
Look at performance separately in different regimes
If the strategy only works in bull markets, know that and trade accordingly (and stop in bears)

Sin 7: Selection Bias on the Trade Universe

You test only on "stocks like NVDA, TSLA, AMD" because they fit your idea. But you cherry-picked the test universe based on what you already know happened.

Example: You test a momentum strategy on the top 20 performers of the decade. It looks amazing. Of course — you handpicked the winners.

Fix: Define your universe rule (e.g., "all stocks with avg daily volume >$50M and price >$10") at the START of the test period, not based on hindsight.

In-Sample / Out-of-Sample Split

The single most important backtesting concept.

How to do it

In-sample period: 70% of your data. You can optimize parameters here.
Out-of-sample period: 30% of your data. You touch this exactly once, after you've finalized your parameters.
Compare results. If out-of-sample performance is significantly worse, you've curve-fit. Restart with simpler parameters or different logic.

Example

Data: 2015-2024 (10 years)
In-sample: 2015-2021 (7 years). Optimize here.
Out-of-sample: 2022-2024 (3 years). Test final parameters here.

Discipline required

The hardest part is not peeking at the out-of-sample data. Once you do, it's no longer out-of-sample. If you fail out-of-sample and then "fix" your strategy and re-test, your "out-of-sample" data has now become in-sample. You need new data.

Walk-Forward Analysis

A more rigorous version of out-of-sample testing for strategies that need periodic re-optimization.

How it works

Take a 1-year training window.
Optimize parameters on that window.
Test on the next 3 months.
Slide forward 3 months.
Repeat.

Over 5 years, you produce 20 separate test periods. If the strategy works in most of them, it's robust. If it works in 5 out of 20, it's regime-dependent.

This is more honest than a single in/out split, but takes more work. Worth doing for serious strategies.

Backtest Metrics That Matter

Don't just look at total return.

Metric	What It Tells You
Total return	How much money you'd have made (alone, misleading)
CAGR (annual return)	Annualized growth rate
Max drawdown	The worst peak-to-trough loss
Sharpe ratio	Return per unit of volatility
Sortino ratio	Return per unit of downside volatility
Calmar ratio	CAGR / Max Drawdown — return per unit of pain
Win rate	% of trades that won
Avg win / Avg loss	The "payoff" ratio
Profit factor	Gross profit / Gross loss
Time in market	What % of the time you're holding positions

Red flags to look for

CAGR is great but max drawdown is 70%. Real humans can't trade through 70% drawdowns.
High return but only 5 trades per year. Small sample. Might be luck.
Win rate is 90%+ on hundreds of trades. Almost certainly wrong (lookahead bias or simulation error).
Sharpe ratio above 3. Unrealistic for retail strategies. Probably curve-fit.

The Quick Sanity Check Backtest

Before building a full backtest framework, do a "back-of-the-envelope" sanity check.

Open a chart.
Manually identify 20 recent instances of your setup.
Mark entry, stop, target on each.
Tally wins, losses, average R.

This is crude but catches obvious issues fast. Does the setup even occur enough to trade? Do you have selection bias when you "see" it?

If the manual check looks promising, then invest in a proper coded backtest.

Tools for Retail Backtesting

Tool	Best For	Notes
TradingView Pine Script	Visual strategies, simple logic	Limited data, but accessible
Python + Backtrader	Full flexibility	Steep learning curve
Python + vectorbt	Fast vectorized backtests	More advanced
Python + zipline	Quantopian-style	Less actively maintained
QuantConnect	Cloud-based, multi-asset	Free tier limited
Custom Python	Maximum control	What pros use

For you, with your engineering background and planned tech stack (yfinance, Polygon, Alpaca), Python + a simple custom backtester or backtrader is the right tool. We'll cover this in Bonus chapter B.3.

The Honest Backtest Workflow

Articulate the thesis in plain English BEFORE coding. "I believe stocks pulling back to the 20EMA in an uptrend bounce more often than fail."
Define rules precisely. No ambiguity. A machine could trade them.
Define your universe at the START of the test period (no hindsight selection).
Get clean data. Survivorship-bias-free if possible. Check for splits, dividends, errors.
Code the strategy with no lookahead. Use only data available at the simulated time.
Split data: 70/30 in-sample/out-of-sample.
Optimize on in-sample. ONE TIME on out-of-sample.
Add realistic execution costs. Slippage, commissions, gaps.
Compute all metrics — total return, drawdown, Sharpe, Calmar, win rate, profit factor.
Run Monte Carlo simulation on the trade results to understand distribution.
Test across regimes. Bull, bear, sideways. Calculate performance in each separately.
Be skeptical. If results are amazing, look harder for the bug.

Common Mistakes

"This must be working — look at the equity curve!" Without out-of-sample validation, you've proven nothing.
Re-testing after a parameter tweak on the same data. That's just more in-sample fitting.
Looking at total return as if it were the only metric. Drawdown, Sharpe, and consistency matter.
Excluding "obvious bad data" without rigorous rules. "I'll skip the COVID crash because it's not normal." Now your test is biased.
Manually skipping certain trades during the backtest because they "felt wrong." That's not a backtest, that's a fantasy.
Treating one backtest run as truth. It's one sample of a noisy distribution. Run Monte Carlo.
Believing in 5-year backtests as if they cover all market regimes. They don't.

A Mental Model: The Drug Trial

A pharma company doesn't trust a drug based on one small uncontrolled study. They run:

Phase 1 (safety on a few people)
Phase 2 (efficacy on a larger group, controlled)
Phase 3 (large randomized trial, double-blind)

Even after all that, drugs fail in real-world use because the trial population didn't match real patients.

Backtesting is your Phase 1-3 trial. Out-of-sample data is your Phase 3 control group. Live trading is the real world.

Most retail traders skip straight from Phase 1 ("looks good in my Excel sheet") to giving the drug to themselves at full dose. Then they're shocked when it doesn't work.

Be the pharma company. Be rigorous. Save your money.

Practical Takeaways

Do an out-of-sample test or you have no result. Period.
Don't tweak parameters after seeing out-of-sample results. Either accept it or start over with new data.
Add realistic costs. Slippage and gaps eat far more than commissions in modern markets.
Test across regimes. A bull-market-only strategy is a half-strategy.
Be skeptical of high win rates. Above 65% on hundreds of trades is suspicious. Look for bugs.
Profit factor > 2 and Sharpe > 1.5 is the goal. Much higher is suspect.
Run Monte Carlo on the trade list to understand worst-case scenarios.
Don't fall in love with the equity curve. Look at the trade distribution. Are wins concentrated in 3 trades? That's not robust.
Document everything. Future you needs to know what assumptions current you made.
A good backtest is one you tried hard to break and couldn't. Adversarial testing finds the truth.

Quick Self-Check

I understand all seven deadly sins of backtesting
I will use in-sample / out-of-sample splits, not optimize on the whole dataset
I will model realistic slippage and gap risk
I will test across multiple market regimes
I will run a Monte Carlo on trade results, not just trust one equity curve
I know to be skeptical of "too good" results
I have a plan for which tool I'll use (TradingView, Python, etc.)
I will document my thesis BEFORE running the backtest

Previous: 5.4 Variance and Sample Size Next: 5.6 Stop Losses