Skip to content

Equities backtester + risk budget

Thoth

An equities backtester built with the statistical honesty of a real-money system: selection-bias-aware deflated Sharpe with explicit trial count, stationary block-bootstrap confidence intervals, predicted-vs-realized calibration against a live trade journal, correlation-adjusted quarter-Kelly sizing with heat / sector / currency caps. The engine work (pure Polars strategies, vectorised regime detection with hysteresis, threaded scanner) is what makes the trust layer cheap enough to actually run every morning.

Optimization Fullstack Data
FastAPI Polars PostgreSQL React Vite TanStack

DSR + CI

Selection-bias-aware ranking

Live calibration

Predicted vs realized, journal-blended

Quarter-Kelly

Correlation-adjusted with heat caps

<7 s

13-strategy universe scan

What it is

A personal equities research platform I'm building toward the bar I'd want a tool to clear before I let it size positions with real money. Thoth used to be three separate pages: a scanner, a chart, a portfolio view. Each was useful in isolation. None of them answered the only question I actually have on a Monday morning:

Given my portfolio, my cash, my open risk, and today's regime, what (if anything) should I do today?

The platform is organised around answering that one question honestly. The engineering effort that I'm proudest of, and that I think is most transferable to any quantitative system under selection pressure, is the layer between "the backtest looked great" and "the live trade actually made money." Below is what's in that layer, and why each piece is there.

The financial trust layer

Trustworthy expected profit and loss (P&L) has three preconditions, and each one has to be independently honest: a selection-bias-aware estimate of edge, calibrated uncertainty around that estimate, and a persistent record that holds those estimates accountable to what later actually happened.

Deflated Sharpe Ratio with an explicit trial count

Bailey & López de Prado (2014). A raw Sharpe ratio is a strongly upward-biased statistic when you've tried many strategies on many tickers and reported the best, because selection bias swamps any actual edge. The Deflated Sharpe Ratio (DSR) applies the expected maximum of N independent trials drawn from a null distribution as a penalty, scaled by the higher-order moments (skew, excess kurtosis) of the return distribution. Thoth passes the trial count to the statistic explicitly: every scanner row is one (ticker, strategy) trial, so a 350-ticker × 13-strategy scan has N = 4,550 trials, and that number is what the deflation is calibrated to. The scanner ranks by DSR, not raw Sharpe. A p-value comes out of the same calculation; results above a configurable threshold can be filtered out before ranking, so a one-trade 100% win-rate fluke doesn't lap the board. A unit test pins the behaviour: a 0.5 Sharpe that clears the bar as a single hypothesis fails it once it is the best of 2,500 trials, while a genuine 4.0 Sharpe survives.

Stationary block bootstrap CIs and Monte-Carlo trade-order shuffle

Point estimates lie about themselves. Every metric the scanner reports (Sharpe, Sortino, win-rate, max drawdown) is accompanied by bootstrap confidence intervals (CIs) from a stationary block bootstrap on the trade-return series. Blocks preserve any short-horizon autocorrelation that a simple independent and identically distributed (IID) bootstrap would destroy. The scanner uses the lower bound of the Sharpe CI as its actionable column, not the point estimate. Separately, a Monte-Carlo trade-order shuffle resamples the order of wins and losses to separate "this strategy has edge" from "this strategy got lucky with the timing of its wins," reported as P5/P50/P95 drawdown and terminal return.

The calibration engine: realized vs predicted edge

This is the keystone, and the part I think makes Thoth actually distinctive. Every decision the scanner publishes goes into a trade_decisions table with a JSONB signal_snapshot capturing the full predicted edge at decision time. When a position closes, an atomic write inserts a paired trade_outcomes row with realized P&L, converted to Danish kroner (DKK) at the exit-date foreign-exchange (FX) rate. A calibration job computes, per strategy and globally:

ratio = mean(realized_pct) / mean(predicted_pct)

Blended with a fixed-weight bootstrap prior (PRIOR_N = 110 virtual trades from a closed historical sweep), so a brand-new strategy doesn't degenerate to zero or to infinity. Live evidence dominates the prior once accumulated trade count exceeds it. The scanner then publishes a calibrated_sharpe = the deflated Sharpe × the calibration ratio. The number ranks the leaderboard.

The painful, honest truth this layer published when it was first turned on: a 2026 sweep showed predicted ~+6.2% per trade becoming realized ~+0.15%. Calibration ratio ≈ 0.024, clamped to a 0.1 floor so position sizing degrades safely rather than going to zero on a single bad month. The scanner emits a "low-trust mode" warning on the Morning Brief when the global ratio is below 30%, which it currently is. That is the system telling its user not to trust its raw numbers, and instructing the rest of the pipeline (Kelly sizing, heat budget) to size down accordingly. The point isn't that the strategies don't work; the point is that the backtest overshoots reality, and the system that ships these numbers should be the one telling you that before you size a position on them.

Walk-forward validation, strategy decay, and survivorship

Walk-forward cross-validation (CV) runs per strategy on a sliding window. Train on [t0, T), test on a slab immediately after, slide forward. An overfit ratio (in-sample edge vs out-of-sample edge) is recorded per fold; the scanner renders it as a column and dims any row whose recent folds have decayed beyond N consecutive sigma. A separate haircut subtracts a fixed Sharpe-units penalty per year of backtest window beyond a threshold, on the principle that delisted-and-gone tickers don't appear in the current universe and longer windows therefore over-state edge by an amount that grows with the window. The haircut is applied after the calibration so the raw deflated number stays diagnostic and the visible ranking is net of survivorship.

Sharpe annualisation, a worked example of "look up one level"

Earlier versions of the engine annualised Sharpe by sqrt(n_trades). That looks reasonable until you notice it's biased by turnover rate: a 100-trades/year strategy and an equal-edge 20-trades/year strategy would report a √5 = 2.24× Sharpe ratio under that convention, which is plainly nonsense. Sharpe is supposed to be a property of the edge, not a property of how often you swing the bat. The current engine builds a daily-return series (each trade distributes its net_return_pct uniformly across the trading days it was held; idle days earn zero) and annualises by sqrt(252). The Sortino denominator was simultaneously rebuilt against the canonical formula (downside deviation over all observations, not only negative ones, which inflates Sortino for strategies that rarely lose).

Portfolio sizing: Kelly with caps and correlations

Even with a calibrated, deflated, bootstrapped expected edge, raw Kelly sizing is too aggressive in practice. Thoth's /portfolio/proposals endpoint takes a ranked candidate list and an existing book and returns a sized buy list plus a per-candidate rejection log. The sizing math:

  • Continuous Kelly f* = mean / variance on per-trade returns, with Bessel-style sample-size shrinkage (n-2)/n so a 3-trade Kelly doesn't size a position to insolvency. Quarter-Kelly is the default; the raw value is exposed for diagnostics.
  • Correlation-adjusted via a return-correlation matrix loaded in a single batched SQL call across "held + candidate" symbols. Two highly correlated candidates don't both get full quarter-Kelly; the second one is sized down proportional to the correlation already present in the book.
  • Caps: 6% portfolio heat (sum of stop_distance × position_size), 40% per sector, 70% per currency. Each rejection carries the exact cap that bound the proposal so the user can read the reason.

FX-aware DKK P&L

A book that holds USD, SEK, NOK, EUR, GBP, and CHF positions cannot pretend FX rates are 1.0. Thoth ingests daily FX rates per currency and records P&L in DKK at the actual rates that bracketed the trade: entry-date FX for cost basis, exit-date FX for proceeds. Missing FX surfaces as an explicit warning in the portfolio pulse, never as a silent 1.0. realized_pnl_dkk is populated atomically with realized_pnl_local on close, so the calibration layer that feeds back into edge estimation reads honest, FX-corrected, slippage-included outcomes.

Regime detection with hysteresis

Strategies declare a regime affinity (trending_up, volatile_bull, trending_down, correction, ranging); the scanner gates entry signals on the current regime so a mean-reversion strategy doesn't fire in a trend, and vice versa. Computing regime cleanly turns out to be the hard part: a sharp cutoff on whether price sits above its 200-day simple moving average (SMA200) flips every bar a noisy index nicks SMA200 by a tick. Two filters fix this:

  • Band hysteresis. Price must clear SMA200 by +0.5% to enter trending_up, and fall to −0.5% below SMA200 to leave it. Volatility ratio uses an asymmetric enter/exit (1.5× to enter elevated, 1.3× to exit).
  • Persistence. A candidate regime only commits after it has held for N consecutive bars. A single-bar blip back into trending_up in the middle of a correction doesn't flip the regime; three bars in a row does.

Both filters are vectorised on Polars. The committed regime per bar feeds the join that gates strategies during the scan.

The engine and the scanner

Everything above presupposes that the underlying backtest is cheap enough to run the full universe several times a day. That's where the loop-structural optimisation work lives. None of it is micro-tuning; all of it is making sure Polars gets to do what Polars is built for.

BacktestEngine: vectorised, then trade-iterated

Naive backtests iterate over price rows. The faster shape iterates over trades, and trades are sparse. BacktestEngine.run takes a Polars frame with pre-computed entry_signal and exit_signal boolean columns, filters for entry candidates (typically a handful per ticker per year), and scans forward only between entries. Two micro-decisions make the inner scan cheap: columns are pre-extracted to Python lists before the loop because Polars columns are not random-access-cheap and the hot scan needs closes[i]; per-symbol cost profiles are looked up once and passed in, not resolved inside the trade loop. The engine averages under 2 ms per ticker on the standard universe, including stop-loss / time-limit / trailing-stop logic that respects gap-throughs at the open separately from intraday piercing.

The scanner runs all 13 strategies against the full ticker universe, regime-gated, and ranks the survivors by the deflated and calibrated number rather than raw Sharpe. Keeping that scan cheap enough to run several times a day is what lets the statistics above be recomputed live instead of cached and trusted.

BulkRunner: concurrent execution with SSE progress

For research workflows I want to run a strategy against hundreds of tickers and watch progress as it happens. BulkRunner uses a 16-worker ThreadPoolExecutor (Polars releases Python's global interpreter lock, the GIL, on the compute), pushes per-ticker updates into a bounded Queue, and the frontend consumes that as Server-Sent Events (SSE). The UI shows a live progress bar as results stream in, rather than spinning silently for ten seconds.

Predicted vs realized edge, the bootstrap prior, the clamp, and why I built this layer before any production-scale alpha are covered in the calibration deep-dive →

The deflated Sharpe with its explicit trial count, the block-bootstrap CIs, and the test that proves a real edge survives selection pressure are covered in the selection-bias deep-dive →

The decision surface tying it together

Every layer above feeds a single landing page, the Morning Brief. Regime pills (US + Nordic) at the top, a portfolio pulse with current heat coloured against the 6% cap, capital deployed, top-3 concentration, per-position stop distance. Three ranked action lists (OPEN / CLOSE / WATCH) filtered by deflated-Sharpe significance, signal status (ACTIVE / RECENT / NONE), and the calibration warning. Data-quality and FX warnings surface inline rather than being silently suppressed. Drill-downs into the scanner, the chart, the signals view, and the journal sit one click away.

Inline take/decline buttons on every opportunity write a decision into the journal regardless of whether I act on it. Declined signals are data too: if the strategy decays, I want to see whether the trades I didn't take would have worked, not only the ones I did.

What this project is deliberately not

Not a claim about returns. I have forward-test data and opinions about which strategies hold up; the calibration layer says, today, "the system is currently overshooting realized edge by more than 3×, do not treat the rankings as tradeable." Posting in-sample numbers from a system that publishes that warning about itself would be dishonest. The engineering is the artefact: the trust layer, the engine, the scanner, the portfolio sizer, the journal. Returns are what those layers exist to keep honest.

Surrounding stack

FastAPI + Polars on the backend, PostgreSQL for journal and price storage. Vite + React 19 + TanStack Router + lightweight-charts on the frontend with native multi-pane support. TanStack Query at the root with a centralised query-keys registry so cache invalidations don't drift across components. Husky + lint-staged + Biome gate formatting on commit.

Related work