Refusing to fool yourself: deflated Sharpe under 4,550 trials

A scanner that ranks 13 strategies across 350 tickers by Sharpe ratio has a problem that no amount of careful backtesting fixes, because it is not a bug. It is a property of ranking. The top of that leaderboard is the maximum of 4,550 draws from a noisy distribution, and the expected maximum of 4,550 draws is high even when none of them has any real edge. Most of the height of the number at the top is luck.

This is the bias every “best of N” surface carries: A/B-test dashboards, model leaderboards, hyperparameter sweeps, feature-importance rankings. The more candidates you search, the more the winner’s headline number overstates its true quality. A scanner that ranks by raw Sharpe and stops there will confidently recommend noise.

This is a writeup of how the Thoth scanner (case page) refuses to do that. None of it is novel statistics. The point is that the machinery for not fooling yourself is cheap, it is testable, and most tools skip it.

Deflated Sharpe, with the trial count passed in

The standard correction is the Deflated Sharpe Ratio (Bailey and López de Prado, 2014). The idea: take the Sharpe you observed and ask how likely a Sharpe that high is to appear by chance, given how many strategies you tried. The more trials, the higher the bar.

Two pieces make it work, and Thoth implements both from primitives rather than importing a black box:

The expected maximum under the null hypothesis. Across N trials with no edge, the best Sharpe you would expect from luck alone grows with N. Thoth computes that expected maximum with a Gumbel approximation, scaled by the variance of the Sharpe estimator. That is the bar the observed Sharpe has to clear.
A Sharpe estimator that knows the returns are not Gaussian. Trade returns are skewed and fat-tailed, which makes a naive Sharpe standard error wrong. The probabilistic Sharpe ratio corrects the estimator’s variance using the return series’ skew and excess kurtosis. The inverse-normal it needs is an Acklam rational approximation, again written out rather than pulled from a library.

The load-bearing detail is the trial count. Thoth passes it through explicitly: the scanner computes tickers × strategies, and that number flows into the deflation. A 350-ticker by 13-strategy scan deflates against N = 4,550. The scanner ranks by the deflated number, not the raw one, and a row whose deflated p-value sits above a threshold is filtered out before ranking, so a one-trade, 100%-win-rate fluke cannot lap the board.

The same penalty, applied where it is most tempting to cheat

Selection bias does not only happen across the scan. It happens again, quietly, every time you tune a strategy’s parameters. Grid-search a strategy over fifty parameter combinations, keep the best, report its Sharpe, and you have run a fifty-trial search whose winner is inflated by exactly the same effect.

Most backtesters report the winning cell’s in-sample number and move on. Thoth’s parameter-sensitivity search deflates the winning cell by n_trials = K (the size of the grid) and returns the whole distribution of results across the grid, not just the maximum: mean, median, the inter-quartile range. A parameter set that only looks good at one lucky point in the grid shows up as a spike against a flat field. The system declines to fool itself at the one place it is most tempted to.

Point estimates lie about themselves

A deflated Sharpe is still a point estimate, and a point estimate says nothing about how wide its own error bars are. So every metric the scanner reports (Sharpe, Sortino, win rate, max drawdown) comes with a confidence interval (CI) from a stationary block bootstrap over the trade-return series.

The “block” part matters. A plain bootstrap resamples individual trades, which destroys any short-horizon autocorrelation in the series and produces intervals that are too tight. A stationary block bootstrap resamples contiguous blocks (here sized around the square root of the trade count, with wrap-around), so the autocorrelation structure survives and the intervals stay honest. The scanner then ranks on the lower bound of the Sharpe interval, not the point estimate: a strategy with a high but wildly uncertain Sharpe loses to a slightly lower but tight one, which is the correct preference when the number sizes a position. Separately, a Monte-Carlo shuffle of the trade order separates “this strategy has edge” from “this strategy got lucky in the sequencing of its wins,” reported as a drawdown distribution.

The proof is the test, not the prose

Anyone can write the paragraphs above. The reason I trust this layer is that its behaviour is pinned by tests that assert the headline claim directly. From the engine’s stats test:

A strategy with a Sharpe of 0.5 clears the bar as a single hypothesis (n_trials = 1) and fails it once it is the best of 2,500 trials. That is selection bias made mechanical: the same number, significant on its own, is noise once you admit how hard you searched for it.
A strategy with a genuine Sharpe of 4.0 survives the 2,500-trial deflation. The correction is not a blanket haircut that kills everything; it lets a real edge through.

A test that asserts “a mediocre edge dies under search and a strong one survives” is what “honest under selection pressure” means operationally. It is the difference between a tool that claims rigour and one that can prove it on demand.

Why this is the layer to build first

For any system that surfaces the best of many candidates, the deflation-and-CI machinery is the same regardless of domain. Swap “strategy on a ticker” for “variant in an A/B test,” “configuration in a sweep,” or “feature in a ranking,” and the bias and its fix are identical. It is cheap to build, it is provably correct, and it is almost always missing, which means the tool that has it can tell you something its competitors cannot.

This pairs with the other half of Thoth’s honesty work, the calibration layer. They fix two different lies. The deflation here corrects the bias the backtester introduces against its own data: the gap between the best of 4,550 and the truth. Calibration corrects the gap between that data and live reality: the gap between the backtest and the fill. A research tool needs the first. A tool you act on with real money needs both.