Moving a join out of an inner loop: a 5.7× scanner speed-up

The Thoth scanner runs thirteen strategies across hundreds of US-equities tickers in under seven seconds. Most of that budget belongs to the strategies themselves: pure Polars expressions over OHLCV frames, vectorised, no row loops. But the scanner used to take three times longer than it does now, and the bug wasn’t in any of the strategies. It was in where the loop was nested.

This is a short writeup about that one specific change, because it illustrates a pattern I keep hitting in production code: the most valuable optimisations are usually the ones that move work out of the inner loop, not make the inner loop faster.

The naive shape

The scanner’s job, in pseudo-code:

for ticker in universe:               # ~350 tickers
    ohlcv = load_ohlcv(ticker)
    for strategy in strategies:       # 13 strategies
        if strategy.regime_affinity_matches(ticker):
            regime = compute_regime(ohlcv)
            ohlcv_with_regime = ohlcv.join(regime, on="date")
            result = strategy.run(ohlcv_with_regime)
            yield (ticker, strategy, result)

The regime computation looks at the OHLCV frame and tags each day as trending_up, volatile_bull, ranging, etc. Strategies gate on it — a mean-reversion strategy shouldn’t run in a trending regime, a momentum strategy shouldn’t run in a ranging regime. So the join was where it needed to be, computed inside the strategy loop, for the strategy to use.

Total joins per scan: ~350 tickers × 13 strategies = 4,550 joins.

The profile

Before changing anything, I profiled. The hot path looked like this (proportions, not absolute):

Phase	% of scan time
OHLCV loading (cached)	5
Regime computation	6
Polars joins (the in-loop ones)	62
Strategy expression evaluation	22
Aggregation and output	5

The joins were the dominant cost, not the strategies. That was the first surprise — the expressive Polars strategy code is what looks expensive, but it’s fast. The join glue between the frames was where the time went.

The observation

Look at the loop again:

for ticker in universe:
    ohlcv = load_ohlcv(ticker)
    for strategy in strategies:
        regime = compute_regime(ohlcv)            # ← same answer every iteration
        ohlcv_with_regime = ohlcv.join(regime, on="date")  # ← same join every iteration
        ...

The regime is a function of the OHLCV frame, not the strategy. Inside the inner loop, both regime and ohlcv_with_regime are computed with the same inputs thirteen times. They get exactly the same result.

This is the optimisation. The join — and the regime computation that feeds it — belongs one level up.

The fix

for ticker in universe:
    ohlcv = load_ohlcv(ticker)
    regime = compute_regime(ohlcv)
    ohlcv_with_regime = ohlcv.join(regime, on="date")
    for strategy in strategies:
        if strategy.regime_affinity_matches(ticker):
            result = strategy.run(ohlcv_with_regime)
            yield (ticker, strategy, result)

Five lines moved up. Total joins per scan: ~350 tickers × 1 = 350 joins.

The measurement

I wrote a microbench that runs the join phase in isolation, on 350 tickers × 13 strategy iterations of the synthetic OHLCV fixture. Before the move: 3.1 seconds. After: 0.54 seconds. 5.7× faster on the join work alone.

End-to-end scan time dropped less dramatically — the rest of the work isn’t joins — but the overall scan went from ~12 seconds to under 7. The bin design for which strategies to run live, downstream of the scanner, is now bottlenecked on strategy evaluation rather than on infrastructure plumbing.

The pattern

This is the same pattern as every other “move work out of the inner loop” optimisation:

Find the loop with the highest iteration count.
Identify expressions inside it whose inputs don’t depend on the loop variable.
Move them outward until they depend on something the outer loop changes.

Computer science 101 calls this loop-invariant code motion. Compilers do it for scalar expressions. They do not do it for Polars dataframe joins, or for any I/O-bound computation that crosses a non-obvious abstraction boundary. The vast majority of high-leverage optimisation work I’ve done in Polars and pandas codebases has been this pattern, applied to operations the runtime is happy to recompute redundantly because it has no way to know the developer didn’t intend them to be recomputed.

What didn’t work

Two things I tried before noticing the join-nesting issue:

Caching the regime per ticker, keyed by ticker name. This would have worked, but it adds state to the scanner, and that state needs to be invalidated when OHLCV data refreshes. The structural fix (move the call site, no cache) was cleaner.
Caching the join result. Same problem, twice over: invalidation is hard, and the cache is itself memory pressure. Better to not have the redundant computation in the first place.

Caches are sometimes the right answer. But before reaching for a cache, ask whether the expensive operation needs to happen at all. In this case, the expensive operation didn’t need to happen 12 of the 13 times — they were duplicates.

Why this matters for production analytics

Polars and similar columnar engines make individual operations fast. They do not, on their own, make poorly-structured pipelines fast. The framework gives you primitives that scale; the structure of your code decides whether the framework gets to use them.

The scanner’s strategies are exactly the kind of code Polars is built for: vectorised expressions, no Python-level loops, no row-by-row work. None of that helped before the join nesting was fixed. The performance ceiling of the system was determined by where the joins were called, not by how fast each join ran.

The lesson, two ways:

Profile first. The naive expectation was that the strategies would be the cost. They weren’t.
Look up one level before optimising the level you’re on. The join was fast as a primitive; calling it 13× more than necessary was the cost.

That’s the whole story. One block of code moved up one indentation level. 5.7× win on the join phase. The scanner runs the universe in under seven seconds, and the bin for strategy runtime is now where it actually belongs.