Equities backtester + risk budget
Thoth
An equities backtester built with the statistical honesty of a real-money system: selection-bias-aware deflated Sharpe with explicit trial count, stationary block-bootstrap confidence intervals, predicted-vs-realized calibration against a live trade journal, correlation-adjusted quarter-Kelly sizing with heat / sector / currency caps. The engine work (pure Polars strategies, vectorised regime detection with hysteresis, threaded scanner) is what makes the trust layer cheap enough to actually run every morning.
DSR + CI
Selection-bias-aware ranking
Live calibration
Predicted vs realized, journal-blended
Quarter-Kelly
Correlation-adjusted with heat caps
<7 s
13-strategy universe scan
What it is
A personal equities research platform I'm building toward the bar I'd want a tool to clear before I let it size positions with real money. Thoth used to be three separate pages: a scanner, a chart, a portfolio view. Each was useful in isolation. None of them answered the only question I actually have on a Monday morning:
Given my portfolio, my cash, my open risk, and today's regime, what (if anything) should I do today?
The platform is organised around answering that one question honestly. The engineering effort that I'm proudest of, and that I think is most transferable to any quantitative system under selection pressure, is the layer between "the backtest looked great" and "the live trade actually made money." Below is what's in that layer, and why each piece is there.
The financial trust layer
Trustworthy expected profit and loss (P&L) has three preconditions, and each one has to be independently honest: a selection-bias-aware estimate of edge, calibrated uncertainty around that estimate, and a persistent record that holds those estimates accountable to what later actually happened.
Deflated Sharpe Ratio with an explicit trial count
Bailey & López de Prado (2014). A raw Sharpe ratio is a strongly upward-biased
statistic when you've tried many strategies on many tickers and reported the
best, because selection bias swamps any actual edge. The Deflated Sharpe Ratio (DSR)
applies the expected maximum of N independent trials drawn from a
null distribution as a penalty, scaled by the higher-order moments (skew, excess
kurtosis) of the return distribution. Thoth passes the trial count to the
statistic explicitly: every scanner row is one (ticker, strategy) trial,
so a 350-ticker × 13-strategy scan has N = 4,550 trials, and that
number is what the deflation is calibrated to. The scanner ranks by DSR, not raw
Sharpe. A p-value comes out of the same calculation; results above a configurable
threshold can be filtered out before ranking, so a one-trade 100% win-rate fluke
doesn't lap the board. A unit test pins the behaviour: a 0.5 Sharpe that clears the bar
as a single hypothesis fails it once it is the best of 2,500 trials, while a genuine
4.0 Sharpe survives.
Stationary block bootstrap CIs and Monte-Carlo trade-order shuffle
Point estimates lie about themselves. Every metric the scanner reports (Sharpe, Sortino, win-rate, max drawdown) is accompanied by bootstrap confidence intervals (CIs) from a stationary block bootstrap on the trade-return series. Blocks preserve any short-horizon autocorrelation that a simple independent and identically distributed (IID) bootstrap would destroy. The scanner uses the lower bound of the Sharpe CI as its actionable column, not the point estimate. Separately, a Monte-Carlo trade-order shuffle resamples the order of wins and losses to separate "this strategy has edge" from "this strategy got lucky with the timing of its wins," reported as P5/P50/P95 drawdown and terminal return.
The calibration engine: realized vs predicted edge
This is the keystone, and the part I think makes Thoth actually distinctive. Every
decision the scanner publishes goes into a trade_decisions table with
a JSONB signal_snapshot capturing the full predicted edge at decision
time. When a position closes, an atomic write inserts a paired
trade_outcomes row with realized P&L, converted to Danish kroner (DKK) at the exit-date foreign-exchange (FX) rate. A calibration job computes, per strategy and globally:
ratio = mean(realized_pct) / mean(predicted_pct)
Blended with a fixed-weight bootstrap prior (PRIOR_N = 110 virtual
trades from a closed historical sweep), so a brand-new strategy doesn't degenerate
to zero or to infinity. Live evidence dominates the prior once accumulated trade
count exceeds it. The scanner then publishes a calibrated_sharpe = the
deflated Sharpe × the calibration ratio. The number ranks the leaderboard.
The painful, honest truth this layer published when it was first turned on: a 2026 sweep showed predicted ~+6.2% per trade becoming realized ~+0.15%. Calibration ratio ≈ 0.024, clamped to a 0.1 floor so position sizing degrades safely rather than going to zero on a single bad month. The scanner emits a "low-trust mode" warning on the Morning Brief when the global ratio is below 30%, which it currently is. That is the system telling its user not to trust its raw numbers, and instructing the rest of the pipeline (Kelly sizing, heat budget) to size down accordingly. The point isn't that the strategies don't work; the point is that the backtest overshoots reality, and the system that ships these numbers should be the one telling you that before you size a position on them.
Walk-forward validation, strategy decay, and survivorship
Walk-forward cross-validation (CV) runs per strategy on a sliding window. Train on
[t0, T), test on a slab immediately after, slide forward. An overfit
ratio (in-sample edge vs out-of-sample edge) is recorded per fold; the scanner
renders it as a column and dims any row whose recent folds have decayed beyond
N consecutive sigma. A separate haircut subtracts a fixed Sharpe-units
penalty per year of backtest window beyond a threshold, on the principle that
delisted-and-gone tickers don't appear in the current universe and longer
windows therefore over-state edge by an amount that grows with the window. The
haircut is applied after the calibration so the raw deflated number
stays diagnostic and the visible ranking is net of survivorship.
Sharpe annualisation, a worked example of "look up one level"
Earlier versions of the engine annualised Sharpe by sqrt(n_trades).
That looks reasonable until you notice it's biased by turnover rate: a 100-trades/year
strategy and an equal-edge 20-trades/year strategy would report a √5 = 2.24× Sharpe
ratio under that convention, which is plainly nonsense. Sharpe is supposed to be
a property of the edge, not a property of how often you swing the bat. The current
engine builds a daily-return series (each trade distributes its net_return_pct
uniformly across the trading days it was held; idle days earn zero) and
annualises by sqrt(252). The Sortino denominator was simultaneously
rebuilt against the canonical formula (downside deviation over all
observations, not only negative ones, which inflates Sortino for strategies
that rarely lose).
Portfolio sizing: Kelly with caps and correlations
Even with a calibrated, deflated, bootstrapped expected edge, raw Kelly sizing is
too aggressive in practice. Thoth's /portfolio/proposals endpoint
takes a ranked candidate list and an existing book and returns a sized buy list
plus a per-candidate rejection log. The sizing math:
- Continuous Kelly
f* = mean / varianceon per-trade returns, with Bessel-style sample-size shrinkage(n-2)/nso a 3-trade Kelly doesn't size a position to insolvency. Quarter-Kelly is the default; the raw value is exposed for diagnostics. - Correlation-adjusted via a return-correlation matrix loaded in a single batched SQL call across "held + candidate" symbols. Two highly correlated candidates don't both get full quarter-Kelly; the second one is sized down proportional to the correlation already present in the book.
- Caps: 6% portfolio heat (sum of
stop_distance × position_size), 40% per sector, 70% per currency. Each rejection carries the exact cap that bound the proposal so the user can read the reason.
FX-aware DKK P&L
A book that holds USD, SEK, NOK, EUR, GBP, and CHF positions cannot pretend FX
rates are 1.0. Thoth ingests daily FX rates per currency and records P&L in
DKK at the actual rates that bracketed the trade: entry-date FX for cost basis,
exit-date FX for proceeds. Missing FX surfaces as an explicit warning in the
portfolio pulse, never as a silent 1.0. realized_pnl_dkk is populated
atomically with realized_pnl_local on close, so the calibration layer
that feeds back into edge estimation reads honest, FX-corrected, slippage-included
outcomes.
Regime detection with hysteresis
Strategies declare a regime affinity (trending_up, volatile_bull,
trending_down, correction, ranging); the
scanner gates entry signals on the current regime so a mean-reversion strategy
doesn't fire in a trend, and vice versa. Computing regime cleanly turns out to be
the hard part: a sharp cutoff on whether price sits above its 200-day simple moving average (SMA200) flips every bar a noisy
index nicks SMA200 by a tick. Two filters fix this:
- Band hysteresis. Price must clear SMA200 by +0.5% to enter
trending_up, and fall to −0.5% below SMA200 to leave it. Volatility ratio uses an asymmetric enter/exit (1.5× to enter elevated, 1.3× to exit). - Persistence. A candidate regime only commits after it has
held for
Nconsecutive bars. A single-bar blip back intotrending_upin the middle of acorrectiondoesn't flip the regime; three bars in a row does.
Both filters are vectorised on Polars. The committed regime per bar feeds the join that gates strategies during the scan.
The engine and the scanner
Everything above presupposes that the underlying backtest is cheap enough to run the full universe several times a day. That's where the loop-structural optimisation work lives. None of it is micro-tuning; all of it is making sure Polars gets to do what Polars is built for.
BacktestEngine: vectorised, then trade-iterated
Naive backtests iterate over price rows. The faster shape iterates over
trades, and trades are sparse. BacktestEngine.run takes a
Polars frame with pre-computed entry_signal and exit_signal
boolean columns, filters for entry candidates (typically a handful per ticker per
year), and scans forward only between entries. Two micro-decisions make the inner
scan cheap: columns are pre-extracted to Python lists before the loop because
Polars columns are not random-access-cheap and the hot scan needs
closes[i]; per-symbol cost profiles are looked up once and passed in,
not resolved inside the trade loop. The engine averages under 2 ms per ticker on
the standard universe, including stop-loss / time-limit / trailing-stop logic that
respects gap-throughs at the open separately from intraday piercing.
The scanner runs all 13 strategies against the full ticker universe, regime-gated, and ranks the survivors by the deflated and calibrated number rather than raw Sharpe. Keeping that scan cheap enough to run several times a day is what lets the statistics above be recomputed live instead of cached and trusted.
BulkRunner: concurrent execution with SSE progress
For research workflows I want to run a strategy against hundreds of tickers and
watch progress as it happens. BulkRunner uses a 16-worker
ThreadPoolExecutor (Polars releases Python's global interpreter lock, the GIL, on the compute), pushes
per-ticker updates into a bounded Queue, and the frontend consumes
that as Server-Sent Events (SSE). The UI shows a live progress bar as results stream
in, rather than spinning silently for ten seconds.
Predicted vs realized edge, the bootstrap prior, the clamp, and why I built this layer before any production-scale alpha are covered in the calibration deep-dive →
The deflated Sharpe with its explicit trial count, the block-bootstrap CIs, and the test that proves a real edge survives selection pressure are covered in the selection-bias deep-dive →
The decision surface tying it together
Every layer above feeds a single landing page, the Morning Brief. Regime pills (US + Nordic) at the top, a portfolio pulse with current heat coloured against the 6% cap, capital deployed, top-3 concentration, per-position stop distance. Three ranked action lists (OPEN / CLOSE / WATCH) filtered by deflated-Sharpe significance, signal status (ACTIVE / RECENT / NONE), and the calibration warning. Data-quality and FX warnings surface inline rather than being silently suppressed. Drill-downs into the scanner, the chart, the signals view, and the journal sit one click away.
Inline take/decline buttons on every opportunity write a decision into the journal regardless of whether I act on it. Declined signals are data too: if the strategy decays, I want to see whether the trades I didn't take would have worked, not only the ones I did.
What this project is deliberately not
Not a claim about returns. I have forward-test data and opinions about which strategies hold up; the calibration layer says, today, "the system is currently overshooting realized edge by more than 3×, do not treat the rankings as tradeable." Posting in-sample numbers from a system that publishes that warning about itself would be dishonest. The engineering is the artefact: the trust layer, the engine, the scanner, the portfolio sizer, the journal. Returns are what those layers exist to keep honest.
Surrounding stack
FastAPI + Polars on the backend, PostgreSQL for journal and price storage. Vite +
React 19 + TanStack Router + lightweight-charts on the frontend with
native multi-pane support.
TanStack Query at the root with a centralised query-keys registry so cache
invalidations don't drift across components. Husky + lint-staged + Biome gate
formatting on commit.
Related work
Anonymised, long-catalogue specialty e-commerce
Inventory decision engine
Replacing a legacy 121K-line per-SKU integer-programming procurement system, whose actual demand forecaster was this-year-over-last-year, with a two-stage decision engine: a LightGBM quantile demand forecaster feeding a HiGHS LP capital allocator. A four-way ablation cleanly attributes wins between the forecaster and the allocator, on a simulation engine that runs ~30× faster than the Python-idiomatic baseline and is locked by nine source-level invariants.
Optimization Fullstack DataTwo acts on a parish-admin platform
Provstiskyen: optimising then rewriting a 10-year SaaS
Two acts on a 44,000-line R Shiny platform that runs about half of Denmark's deaneries. Act I cut cold start from 50s to 18s and deploys from 35min to 80s on the existing codebase. Act II, once the architecture itself was the ceiling, is a full rewrite onto FastAPI, Polars, and React: performant by default, far more maintainable, with the legacy app retiring as the last module ports across.
DevOps Optimization Fullstack