Skip to content

Backtesting engine and scanner

Thoth

Hot-loop discipline at the small scale: a 13-strategy backtest of the US equities universe finishes in seconds. Pure Polars expressions, threaded bulk runner, regime-gated strategies. The kind of code-level performance work I bring to bigger systems.

Optimization Fullstack Data
FastAPI Polars PostgreSQL TimescaleDB Astro React

13

Auto-discovered strategies

<2 ms

per-ticker backtest

16

Concurrent workers

5.7×

Scanner join speed-up

What it is

A personal backtesting platform I'm building for my own equities research. The goal is engineering, not finance: take a portable strategy specification (an entry rule, an exit rule, some risk parameters), run it against the daily price history of the entire US universe, and have results back fast enough that exploring the parameter space is a conversation rather than a batch job.

This page is about the engine: what's inside it, why the inner loop is shaped the way it is, and where the speed-ups came from. I'm deliberately not making claims about market returns here; the engine is the artefact.

Engine shape

BacktestEngine.run: vectorised, then trade-iterated

The naïve backtest iterates over price rows. The faster shape iterates over trades, and trades are sparse. BacktestEngine.run takes a Polars frame with pre-computed entry_signal and exit_signal boolean columns, filters for the entry candidates (typically a handful per ticker per year), and scans forward only between entries.

The two micro-decisions that make this work:

  • Columns are pre-extracted to Python lists before the inner loop. The hot scan needs random index access for closes[i], dates[i], exit_signals[i]; Polars columns are not random-access-cheap, but Python lists are O(1). The conversion is paid once at the start of run().
  • Cost profiles are looked up once per symbol. Commission structures vary by venue; resolving them inside the trade loop would mean a dictionary lookup per trade. Hoisting it outside cuts that out.

Combined, the engine averages under 2 ms per ticker on the standard universe.

BulkRunner: concurrent execution with progress streaming

For research workflows I want to run a strategy against hundreds of tickers and watch progress as it happens. BulkRunner uses a ThreadPoolExecutor with 16 workers (Polars releases the GIL on the actual compute), pushes per-ticker updates into a bounded Queue, and the frontend consumes that as Server-Sent Events. The UI shows a live progress bar as results stream in, rather than spinning for ten seconds then displaying everything at once.

Scanner: strategies × tickers without paying for it twice

The scanner runs multiple strategies against every ticker, with regime gating (a mean-reversion strategy shouldn't trade in a trending market, a momentum strategy shouldn't trade in a chop). Naïvely that means computing the market regime once per ticker per strategy (13× redundant work) and joining the regime frame onto OHLCV inside every strategy's hot path.

The regime is a function of the ticker's price history, not of the strategy. The scanner hoists the regime computation and the regime-join up one level, so each ticker pays for it exactly once and every strategy reads the joined frame for free. That single change shaved ~5.7× off the join phase of a 350-ticker scan. The full argument is in the deep-dive writeup linked below.

Strategies are pure Polars expressions, auto-discovered

Each of the 13 strategies (volatility, mean-reversion, momentum, trend, multi-factor) is a single file in app/strategies/ with a StrategyMeta attribute that declares its regime affinity. They're discovered at startup via pkgutil; adding a new strategy means dropping a file in the directory. The body of each is pure Polars expressions over the OHLCV frame; no row loops, no iterrows, no Python iteration over price bars.

Surrounding architecture

  • Router → service → dataloader. Same shape I use everywhere else. Routers do HTTP, services own business logic, dataloaders return Polars frames. Tests mock at the dataloader boundary.
  • 5-minute TTL cache on OHLCV reads. Strategies running in sequence against the same universe don't re-hit the database; the second one reads from the cache. Simple in-process dict, no Redis.
  • Nordnet bucket fees per trade. The accounting module knows the Danish broker's commission structure (Nordnet has tiered bucket pricing, not a flat per-trade fee). I want backtests to match what live execution would actually cost.
  • Earnings blackout. Symbols are excluded from entries for a configurable window around earnings announcements. The default is 5 days before and 1 day after; the historical earnings calendar is loaded from a separate provider and joined onto the universe at scan time.
  • Frontend. Astro 6 + React 19 + lightweight-charts. QueryProvider at the root; three pages (sector scanner, single-symbol chart, signals view), each a single client-only island.

What's interesting about this as a project

Two things, both engineering rather than finance:

  1. The optimisation work that matters is always loop-structural. None of the wins here came from micro-optimising a Polars expression. They came from noticing that a join was at the wrong loop level, or that a column was being read in a way that made random access expensive, or that the engine was iterating over rows when it should have been iterating over trades.
  2. Real-money discipline forces engineering discipline. Because this is a system I want to eventually trust with my own capital, the engineering bar is higher than it would be for a portfolio-piece backtest. Per-symbol commission models, earnings blackouts, an accounting module that tracks slippage and effective entry/exit prices: the kind of thing that's tempting to skip if the only consumer is a screenshot.

What this project deliberately is not

It's not a claim about market returns. I have running forward-test data and opinions about which strategies hold up; that's research that belongs in a research document, not a portfolio page. What's on this page is the engine and the choices that went into it.

Want the full argument for why moving the regime join out of the inner loop was the entire 5.7× win, and why a cache would have been the wrong fix? See the deep-dive writeup →

Related work