5.5 Backtesting Properly
Understand how to test a strategy on historical data without lying to yourself. Avoid the seven deadly sins of backtesting that make profitable backtests fail in live trading.
Layer 5: The Meta Game — Chapter 5 Goal: Understand how to test a strategy on historical data without lying to yourself. Avoid the seven deadly sins of backtesting that make profitable backtests fail in live trading.
The Core Idea
A bad backtest will lie to you with mathematical confidence.
You can build a backtest that shows 500% annual returns and a 95% win rate. Then you go live and lose money on the first 20 trades. What went wrong? Not the live market. The backtest was contaminated with one or more of the biases below.
This chapter teaches you how to backtest in a way you can actually trust.
What a Backtest Is
A backtest applies your strategy rules to historical data and simulates what would have happened if you had traded it.
What you need
- A precise rule set. Entry, exit, stop, size — no judgment calls.
- Historical data. Price + volume, ideally with enough granularity (daily for swing, intraday for scalping).
- Realistic execution assumptions. Slippage, commissions, fills at realistic prices.
- A measurable outcome. P&L, win rate, drawdown, Sharpe ratio.
What you produce
- Equity curve over the test period
- Distribution of trades (wins/losses, sizes, durations)
- Max drawdown
- Performance vs benchmark (typically S&P 500)
The Seven Deadly Sins of Backtesting
Sin 1: Lookahead Bias
You use information that wasn't available at the time of the trade.
Examples:
- Calculating an MA crossover using today's close, then "buying" at yesterday's close
- Using earnings results to predict the earnings reaction
- Filtering for "stocks that doubled this year" before they doubled
Fix: At every decision point, only use data available strictly before that point. Be paranoid about this. Even one bar of lookahead destroys the validity.
Sin 2: Survivorship Bias
You backtest on stocks that exist today, ignoring stocks that went bankrupt, got delisted, or got acquired during the test period.
Why it lies: Your universe is the "winners." You're systematically excluding failures. A strategy might look amazing because every stock that disappeared isn't in your test set.
Famous example: Backtesting "buy and hold" on the current S&P 500 looks great because you're holding the survivors. The actual S&P 500 components from 1990 had Enron, Lehman Brothers, Eastman Kodak, etc. — all of which went to zero or near-zero.
Fix: Use survivorship-bias-free data sets (e.g., CRSP, certain Polygon/IB data feeds). For retail testing, at minimum acknowledge the bias is present in free data like yfinance.
Sin 3: Look-back Optimization (Curve Fitting)
You tweak your parameters on the same data you're testing on, until you find the magic settings.
Example: You test MA periods 5, 10, 15, 20, 25, 30, 35, 40. The 23-period MA gives the best returns. You declare 23 is the magic number.
Why it lies: You found the parameter that best fits this specific historical sequence. Run on different data, it underperforms because 23 had no real edge — it just happened to fit the noise of that period.
Fix: Out-of-sample testing.
- Split your data: 70% in-sample, 30% out-of-sample.
- Optimize parameters on the in-sample data only.
- Test those parameters (no further tweaking) on the out-of-sample data.
- If out-of-sample matches in-sample, you may have a real edge. If out-of-sample is much worse, you curve-fit.
Sin 4: P-Hacking / Multiple Comparisons
You test 100 different strategies. Five of them show "significant" returns. You declare those five work.
Why it lies: With 100 tests, you'd expect ~5 to show "significance" at 95% confidence purely by chance. The five winners might be statistical luck, not real edges.
Fix:
- Have a thesis BEFORE running the backtest, not after
- Apply Bonferroni or similar correction if you must test multiple variants
- Use out-of-sample data to validate
Sin 5: Unrealistic Execution Assumptions
Your backtest assumes you got filled at exact prices with no slippage or commissions.
Reality:
- Stop-loss orders fill at the next available price, often worse than your stop
- Market orders incur bid-ask spread + slippage (often 5-50 bps in liquid stocks, more in illiquid)
- Commissions, while now near-zero for stocks, exist for options
- Gaps mean your stop at $48 fills at $42 if the stock opens there
Fix:
- Assume 5-15 bps slippage on entries and exits in liquid stocks
- Assume 50 bps+ in less liquid stocks
- Model gap risk explicitly for swing trades held overnight
- For options, model bid-ask spread (often 5-10% of contract value in lower-volume strikes)
Sin 6: Regime Blindness
You backtest a strategy that only works in one market regime.
Example: "Buy any pullback" worked spectacularly from 2009-2021. From 1973-1982, it would have died. From 2000-2003, same. A backtest covering only 2009-2021 will show this as a guaranteed strategy.
Why it lies: You're testing on one market regime. The strategy isn't robust; it's regime-dependent.
Fix:
- Test across multiple market regimes: bull market, bear market, choppy/range-bound, high volatility, low volatility
- Look at performance separately in different regimes
- If the strategy only works in bull markets, know that and trade accordingly (and stop in bears)
Sin 7: Selection Bias on the Trade Universe
You test only on "stocks like NVDA, TSLA, AMD" because they fit your idea. But you cherry-picked the test universe based on what you already know happened.
Example: You test a momentum strategy on the top 20 performers of the decade. It looks amazing. Of course — you handpicked the winners.
Fix: Define your universe rule (e.g., "all stocks with avg daily volume >$50M and price >$10") at the START of the test period, not based on hindsight.
In-Sample / Out-of-Sample Split
The single most important backtesting concept.
How to do it
- In-sample period: 70% of your data. You can optimize parameters here.
- Out-of-sample period: 30% of your data. You touch this exactly once, after you've finalized your parameters.
- Compare results. If out-of-sample performance is significantly worse, you've curve-fit. Restart with simpler parameters or different logic.
Example
- Data: 2015-2024 (10 years)
- In-sample: 2015-2021 (7 years). Optimize here.
- Out-of-sample: 2022-2024 (3 years). Test final parameters here.
Discipline required
The hardest part is not peeking at the out-of-sample data. Once you do, it's no longer out-of-sample. If you fail out-of-sample and then "fix" your strategy and re-test, your "out-of-sample" data has now become in-sample. You need new data.
Walk-Forward Analysis
A more rigorous version of out-of-sample testing for strategies that need periodic re-optimization.
How it works
- Take a 1-year training window.
- Optimize parameters on that window.
- Test on the next 3 months.
- Slide forward 3 months.
- Repeat.
Over 5 years, you produce 20 separate test periods. If the strategy works in most of them, it's robust. If it works in 5 out of 20, it's regime-dependent.
This is more honest than a single in/out split, but takes more work. Worth doing for serious strategies.
Backtest Metrics That Matter
Don't just look at total return.
| Metric | What It Tells You |
|---|---|
| Total return | How much money you'd have made (alone, misleading) |
| CAGR (annual return) | Annualized growth rate |
| Max drawdown | The worst peak-to-trough loss |
| Sharpe ratio | Return per unit of volatility |
| Sortino ratio | Return per unit of downside volatility |
| Calmar ratio | CAGR / Max Drawdown — return per unit of pain |
| Win rate | % of trades that won |
| Avg win / Avg loss | The "payoff" ratio |
| Profit factor | Gross profit / Gross loss |
| Time in market | What % of the time you're holding positions |
Red flags to look for
- CAGR is great but max drawdown is 70%. Real humans can't trade through 70% drawdowns.
- High return but only 5 trades per year. Small sample. Might be luck.
- Win rate is 90%+ on hundreds of trades. Almost certainly wrong (lookahead bias or simulation error).
- Sharpe ratio above 3. Unrealistic for retail strategies. Probably curve-fit.
The Quick Sanity Check Backtest
Before building a full backtest framework, do a "back-of-the-envelope" sanity check.
- Open a chart.
- Manually identify 20 recent instances of your setup.
- Mark entry, stop, target on each.
- Tally wins, losses, average R.
This is crude but catches obvious issues fast. Does the setup even occur enough to trade? Do you have selection bias when you "see" it?
If the manual check looks promising, then invest in a proper coded backtest.
Tools for Retail Backtesting
| Tool | Best For | Notes |
|---|---|---|
| TradingView Pine Script | Visual strategies, simple logic | Limited data, but accessible |
| Python + Backtrader | Full flexibility | Steep learning curve |
| Python + vectorbt | Fast vectorized backtests | More advanced |
| Python + zipline | Quantopian-style | Less actively maintained |
| QuantConnect | Cloud-based, multi-asset | Free tier limited |
| Custom Python | Maximum control | What pros use |
For you, with your engineering background and planned tech stack (yfinance, Polygon, Alpaca), Python + a simple custom backtester or backtrader is the right tool. We'll cover this in Bonus chapter B.3.
The Honest Backtest Workflow
-
Articulate the thesis in plain English BEFORE coding. "I believe stocks pulling back to the 20EMA in an uptrend bounce more often than fail."
-
Define rules precisely. No ambiguity. A machine could trade them.
-
Define your universe at the START of the test period (no hindsight selection).
-
Get clean data. Survivorship-bias-free if possible. Check for splits, dividends, errors.
-
Code the strategy with no lookahead. Use only data available at the simulated time.
-
Split data: 70/30 in-sample/out-of-sample.
-
Optimize on in-sample. ONE TIME on out-of-sample.
-
Add realistic execution costs. Slippage, commissions, gaps.
-
Compute all metrics — total return, drawdown, Sharpe, Calmar, win rate, profit factor.
-
Run Monte Carlo simulation on the trade results to understand distribution.
-
Test across regimes. Bull, bear, sideways. Calculate performance in each separately.
-
Be skeptical. If results are amazing, look harder for the bug.
Common Mistakes
-
"This must be working — look at the equity curve!" Without out-of-sample validation, you've proven nothing.
-
Re-testing after a parameter tweak on the same data. That's just more in-sample fitting.
-
Looking at total return as if it were the only metric. Drawdown, Sharpe, and consistency matter.
-
Excluding "obvious bad data" without rigorous rules. "I'll skip the COVID crash because it's not normal." Now your test is biased.
-
Manually skipping certain trades during the backtest because they "felt wrong." That's not a backtest, that's a fantasy.
-
Treating one backtest run as truth. It's one sample of a noisy distribution. Run Monte Carlo.
-
Believing in 5-year backtests as if they cover all market regimes. They don't.
A Mental Model: The Drug Trial
A pharma company doesn't trust a drug based on one small uncontrolled study. They run:
- Phase 1 (safety on a few people)
- Phase 2 (efficacy on a larger group, controlled)
- Phase 3 (large randomized trial, double-blind)
Even after all that, drugs fail in real-world use because the trial population didn't match real patients.
Backtesting is your Phase 1-3 trial. Out-of-sample data is your Phase 3 control group. Live trading is the real world.
Most retail traders skip straight from Phase 1 ("looks good in my Excel sheet") to giving the drug to themselves at full dose. Then they're shocked when it doesn't work.
Be the pharma company. Be rigorous. Save your money.
Practical Takeaways
-
Do an out-of-sample test or you have no result. Period.
-
Don't tweak parameters after seeing out-of-sample results. Either accept it or start over with new data.
-
Add realistic costs. Slippage and gaps eat far more than commissions in modern markets.
-
Test across regimes. A bull-market-only strategy is a half-strategy.
-
Be skeptical of high win rates. Above 65% on hundreds of trades is suspicious. Look for bugs.
-
Profit factor > 2 and Sharpe > 1.5 is the goal. Much higher is suspect.
-
Run Monte Carlo on the trade list to understand worst-case scenarios.
-
Don't fall in love with the equity curve. Look at the trade distribution. Are wins concentrated in 3 trades? That's not robust.
-
Document everything. Future you needs to know what assumptions current you made.
-
A good backtest is one you tried hard to break and couldn't. Adversarial testing finds the truth.
Quick Self-Check
- I understand all seven deadly sins of backtesting
- I will use in-sample / out-of-sample splits, not optimize on the whole dataset
- I will model realistic slippage and gap risk
- I will test across multiple market regimes
- I will run a Monte Carlo on trade results, not just trust one equity curve
- I know to be skeptical of "too good" results
- I have a plan for which tool I'll use (TradingView, Python, etc.)
- I will document my thesis BEFORE running the backtest
Previous: 5.4 Variance and Sample Size Next: 5.6 Stop Losses