The calibration layer that makes a backtester safe to act on

Every backtester eventually publishes a number that turns out to be a lie. A 1.8 Sharpe becomes 0.6 in live trading. A +3.5% expected per-trade return becomes −0.4% realised. The gap is not noise; it’s data. The interesting engineering question isn’t how to make it go away (you mostly can’t); it’s how to make sure the system shipping the numbers also ships its own honesty about them.

This is a writeup of the calibration layer I built into the Thoth backtester / scanner (case page), because it’s the single piece of that project I’d defend most strongly as transferable to any quantitative system that publishes decisions people are going to act on.

The shape of the lie

The mechanism is well-understood: a backtester reports an in-sample optimum over a search space whose width the user didn’t pay for. Run thirteen strategies × three hundred and fifty tickers and rank them by Sharpe; the top of the leaderboard is the maximum of 4,550 samples from a noisy distribution. Most of the height of that maximum is sampling variability, not edge.

I had already implemented the standard fixes for this layer:

Deflated Sharpe Ratio with the trial count passed explicitly (4,550 in this scan) rather than implicitly assumed to be 1.
Stationary block bootstrap over the per-trade return series, producing confidence intervals (CIs) on Sharpe / Sortino / win-rate / max drawdown. The scanner uses the lower bound of the Sharpe CI as its actionable column, not the point estimate.
Walk-forward validation to catch parameter-fitting overfit per strategy.
A survivorship haircut that subtracts a Sharpe-units penalty per year of backtest window beyond a threshold, because delisted-and-gone tickers don’t appear in the current universe and longer windows therefore overstate edge.

All of those are correct and necessary. None of them caught the gap I’m actually worried about. They reduce the bias the backtester introduces against the backtest’s own data-generating process. They have nothing to say about the gap between that data-generating process and live trading, which is a different and much larger problem.

A specific number that triggered the layer

A sweep run in early 2026 covered the full universe across two and a half years of history with all of the above corrections enabled. Aggregated across the 110 trades the sweep took, the published expected per-trade return was +6.21%. The realised return, once the trades had had time to play out under the same engine that generated them and the same accounting that priced them, was +0.15%.

Two-point-four percent of the predicted edge, materialising. That’s not a “tune the parameters” gap; that’s the system being wrong about the size of its own claim by a factor of forty. Walk-forward had caught the in-sample / out-of-sample drift, the deflated Sharpe had corrected for trial count, the bootstrap CIs had honestly reported that the lower bound was thinner than the point estimate. None of those layers were positioned to catch the gap between the engine’s view of execution (deterministic fills at the next open, perfect slippage estimates) and the messy reality of intra-day prints, liquidity timing, and the dozen-and-one things the engine doesn’t model.

So I built a layer that does catch it.

The calibration ratio

Per strategy (and globally), compute:

ratio = mean(realized_pct) / mean(predicted_pct)

over closed trades from the live journal. The numerator comes from a trade_outcomes row written at close: realised profit and loss (P&L) in local currency, converted to Danish kroner (DKK) at the exit-date foreign-exchange (FX) rate, divided by the entry-date FX-converted cost basis. The denominator comes from a trade_decisions row written at decision time: the full signal_snapshot JSONB captures avg_return_pct exactly as the scanner published it, so the predicted number is the one the user actually saw on the screen.

The pair is durable. Even if the scanner cache is gone in six months, the decision and outcome row pair lives in Postgres and is replayable forever.

Three properties of the ratio that matter:

Clamped to [0.1, 1.0]. If realised exceeds predicted, the system is allowed to publish at most “predicted held”; we don’t reward the tool for lucky outcomes the predictor had nothing to do with. If realised collapses, we floor at 0.1 so sizing degrades safely toward zero but doesn’t go all the way there on a single bad month.
Blended with a bootstrap prior of fixed weight PRIOR_N = 110. That number is the trade count of the historical sweep cited above, so the blend behaves naturally as fresh evidence accumulates: at zero live trades the published ratio is the prior; at exactly PRIOR_N live trades the live and prior weights are equal; past 2 × PRIOR_N the source field flips to "live" and the prior is negligible.
Per-strategy with global fallback. Strategies with no live trades inherit the global ratio. Strategies with a meaningful sample size publish their own.

What the ratio does in the system

The deflated Sharpe times the calibration ratio is the calibrated Sharpe. That’s what the scanner ranks by, and that’s what the Morning Brief reads when deciding which opportunities to surface to the user.

Downstream, the risk budget consumes the global ratio too: half-Kelly sizing gets scaled by the ratio, so when the system is currently overshooting realised edge by a factor of three, position sizes shrink by that factor before any heat / sector / currency cap binds. The user does not have to remember to size down; the system does it for them.

And visibly: when the global ratio is below 30%, the scanner emits a low_trust_mode warning string that the Morning Brief renders as a caution banner above the ranked opportunities. The system is currently in low-trust mode. Realised edge is X% of predicted (< 30%). That sentence is the tool telling its user not to trust its own raw numbers. It is, I think, the single most important UX in the whole stack.

Why the ratio matters more than the corrections

Imagine a tool that’s selection-bias-corrected, bootstrap-CI’d, walk-forward-validated, survivorship-discounted, and ranks 4,550 candidates by lower-bound Sharpe. It looks rigorous. It reports a 1.4 Sharpe with a 1.1 lower bound on the top candidate. The user sizes a position at quarter-Kelly against that 1.1.

Then the live trade returns 12% of predicted. The user thinks the trade was unlucky and takes another one. And another. The system has no built-in mechanism to learn from those outcomes; the scanner is stateless across runs, the journal lives in a separate schema nobody reads on Monday morning. The user keeps sizing against the published number until they personally notice the gap, which on a noisy edge takes months and many losing trades to recognise.

The calibration layer collapses that loop. Every closed position adjusts the ratio. Every adjusted ratio adjusts the calibrated Sharpe. Every adjusted calibrated Sharpe re-ranks the leaderboard. The system is in feedback with its own performance, on the timescale of the actual trade rather than the timescale of the user noticing.

The pieces that have to exist for this to work

Three durable records, each with its own table:

trade_decisions: what the system predicted at decision time. Includes a signal_snapshot JSONB so the published edge survives any subsequent change to the scanner code, the strategy parameters, or the universe.
trade_outcomes: what actually happened. Realised P&L in local currency, in DKK with explicit FX dates, exit reason, hold time. Linked to the decision via decision_id.
scan_snapshots: what the system was telling the user across whole scans, not just on the rows they acted on. Useful for retrospective questions about what the scanner would have told you to do.

And one transactional flow: closing a position has to atomically write the outcome row and update the position. If those two get out of step, the calibration math is permanently biased, with half the closes recorded and the other half lost. The /portfolio/close endpoint is a single transaction for exactly that reason.

What this is not

Calibration is not a backstop for bad strategies. It will dutifully report a 0.1 ratio on a strategy that’s losing money in live trading, and the calibrated Sharpe will correctly be a tenth of the predicted Sharpe. But the system will still publish the strategy, just smaller, and the user will still take some of those trades. The right response to a sustained calibration ratio collapse is to retire the strategy, not to let the calibration layer permanently throttle it.

The strategy-decay detector is a separate layer that flags a strategy whose recent walk-forward folds have deteriorated by N consecutive sigma vs the baseline. That’s what handles the retirement case. Calibration handles the gap between predicted and realised for a strategy that is still in regime; decay detection handles the strategy walking out of its regime. Both are necessary; neither subsumes the other.

The takeaway

A research backtester ranks candidates by an expected statistic. A live trading tool ranks candidates by an expected statistic that has been held accountable to what actually materialised. The two are different systems, and the difference is one table and a multiplication.

For any quantitative system whose output the user is going to act on, whether a forecaster, a recommender, a scoring model, or a trading signal, the layer that closes this loop is the one I’d build before tuning the model itself.