Locking a Polars hot loop with source-level invariants

A backtesting simulator has a tick loop. The tick loop runs millions of times across multi-year backtests. Every line of Python in that loop is a tax — paid per tick, per strategy, per backtest. The win from a 5% improvement in the tick body is repeated tens of millions of times in a single CI run.

The inventory-arena simulator runs seven competing procurement strategies head-to-head against years of real demand data. Its tick loop is in engine.py (case page). After several rounds of profiling and re-architecting, the loop runs roughly twenty times faster than its original Python-idiomatic version.

The problem with optimisations like that is they decay. Someone refactors a year later, in good faith. The refactor reads cleaner. The tick loop is now half as fast as it was. Nobody notices for a month because the simulator still works — it just takes 20 minutes instead of 10. The original optimisation is gone, and the reason for it is gone with it.

This post is about how I stopped that from happening here, by writing the optimisation contract into the test suite, not just into the code.

The nine invariants

The hot loop has nine invariants. Each one is a deliberate design choice, each one is the load-bearing part of a specific optimisation, and each one is enforced by tests/engine/test_invariants.py. The test file mixes source-level assertions (it greps the engine file for specific patterns) with behavioural assertions (it constructs a known input, runs the engine, asserts that the output matches what a correct implementation would produce).

In rough order of how much they matter:

1. Pre-built immutable references

Static offers and the flattened orderable-offers table are built once at simulation startup and handed to the tick loop by reference. The tick loop never reconstructs them, never copies them, never mutates them.

The naive version did rebuild these per-tick because the original developer thought of them as derived state. They’re not — they’re invariant for the simulation’s lifetime. Building once at startup, then passing references through SystemState, eliminates hundreds of microseconds per tick.

The invariant: a source-level grep for static_offers = or orderable_offers_flat = inside the inner loop scope fails the test. The construction must happen outside it.

2. Polars `partition_by` for demand grouping

Demand for the day comes in as a Polars DataFrame with hundreds of rows. The naive grouping is a Python loop over rows — which means crossing the Polars boundary thousands of times per tick.

partition_by is the vectorised primitive: it returns a list of sub-frames, one per group, in a single Polars operation that stays in the engine’s columnar representation. Replacing the Python loop with partition_by was a 4× win on the demand-grouping phase alone.

The invariant: any Python-level loop over pl.DataFrame.iter_rows() inside the demand phase fails the test. The grouping must be vectorised.

3. Bisect-sorted pending orders for O(log n) arrivals

Pending orders are a list. Each tick we ask: which ones arrive today? The naive answer is a linear scan: iterate every pending order, check its arrival date against today.

With hundreds of orders over a multi-year backtest, that scan dominates. The fix is to keep the list sorted by arrival date and use bisect.bisect_right to find the split point — O(log n) instead of O(n). Insertions use bisect.insort to maintain the order.

The invariant: the engine must import bisect and the add_pending_order function must call bisect.insort. A grep for pending_orders.append( outside the construction phase fails the test.

4. Incremental `inv_value_dkk`

Inventory value (in DKK) is a summary statistic the engine reports per day. The naive implementation re-sums it from scratch every tick. With thousands of inventory positions, that’s another inner loop that doesn’t need to exist.

Instead, inv_value_dkk is updated incrementally: on arrival, add the value of the new items; on sale, subtract the value of the sold items. The state on day N is computed from the state on day N-1 and the deltas, never from a full re-summation.

The invariant: a behavioural test runs the simulator with a small synthetic dataset and asserts inv_value_dkk matches the value computed by an independent reference function. The test passes if the incremental update is correct; it would fail if someone “fixed” the code by reintroducing a re-sum (the value would still be right, but the test framework also asserts the re-sum function call never appears in the engine file’s grep).

5. `model_construct` on hot-path domain objects

The simulator uses Pydantic for its domain model: InventoryPosition, SystemState, PurchaseOrder, PendingOrder. Pydantic’s normal constructor runs validation on every field. That validation is exactly what we want at the simulation boundaries (when loading from disk, when receiving HTTP requests). It’s exactly what we don’t want inside the tick loop, where every object is constructed from data that’s already been validated upstream.

model_construct is Pydantic’s escape hatch: build the object without running any validation, trusting that the caller knows what they’re doing. It’s 5–10× faster than the validating constructor for these specific objects.

The invariant: a source-level grep for InventoryPosition(, SystemState(, PurchaseOrder(, PendingOrder( inside the tick loop fails the test. The hot path must use *.model_construct(...).

6. Holding cost paid daily, not aggregated

Holding cost is inv_value_dkk * holding_rate / 365. The naive shape is to compute it as a single annualised number at the end of the simulation. The correct shape is to pay it daily, deducting from cash each tick.

Why? Because the strategy decisions on day N depend on cash-on-hand at day N. If holding cost is bookkept only at the end, strategies that should have run out of cash mid-backtest appear viable. The simulation is wrong.

This isn’t quite an optimisation invariant — it’s a correctness invariant — but it’s in the same source-level test because the failure mode is identical: a future refactor that “simplifies” by aggregating would silently break results without breaking any test.

The invariant: the daily settlement function must include the holding-cost deduction, and the end-of-simulation reporting must not contain it.

7. Censored demand from real production data

The demand model is not synthetic — it’s derived from page_views_daily.parquet and conversion_rates.parquet, both produced from real production logs. Censoring (capping demand when inventory is insufficient) happens at simulation time, not in the input data.

The invariant: the engine constructor must read both parquet files. The test asserts that removing either of them causes the engine to refuse to initialise.

8. SystemState passed by reference, never copied

SystemState is a Pydantic model containing every running statistic the simulator tracks. Naive code copies it per tick (for serialisation, for “snapshot” purposes, for fear of mutability). The hot loop must hold a single instance and mutate it in place.

The invariant: a grep for system_state.model_copy( or copy.deepcopy(system_state inside the tick loop fails the test.

9. Pending orders are bisect-inserted, not list-appended

Closely related to invariant 3 but worth its own check: the test specifically verifies that the engine never does pending_orders.append(new_order). That single line, in a single place, was responsible for a O(n²) regression in an earlier version. The grep is the single-line firewall.

Why source-level tests, not just benchmarks?

The argument against source-level grep tests is that they’re brittle. Rename a function, break a test. Format the code differently, break a test. Pin the code to a particular shape rather than to its semantics.

I think this argument is right in general and wrong here, for one specific reason: the failure mode I’m defending against is “someone reverts the optimisation in a refactor that looks correct.” Benchmarks would catch this, but benchmarks have to be run, have to be compared to a baseline, have to have a regression threshold tuned — and in practice nobody runs benchmarks in a code review. The source-level grep tests run on every commit.

The brittleness has been worth it. When the test fails, it fails at the moment of the change that introduced the regression, which means the engineer making the change is the one who has to understand why the rule exists. That’s how the rule survives. Nobody is going to look at a 20% perf regression a month later and remember why partition_by was there.

What goes in the test, what goes in a comment

A useful framing: the test file is for invariants that a careful refactor would silently break. The comment block above each invariant in the engine is for context — what the optimisation does, why it matters, what the alternative looks like.

# Invariant 5: model_construct on hot-path domain objects.
# Pydantic's validating constructor is 5–10× slower than model_construct for
# these specific models, measured on the standard benchmark fixture. The
# inputs to these constructors are already validated at simulation boundaries
# (CSV/parquet loaders, HTTP request parsing); revalidating per tick is a tax
# that pays for nothing.
#
# Locked by tests/engine/test_invariants.py::test_no_validating_constructor.
position = InventoryPosition.model_construct(
    sku=row["sku"],
    on_hand=row["on_hand"],
    on_order=row["on_order"],
)

The comment names the test that locks the rule. Anyone wondering “why is this written weird” gets pointed at the test, which explains the what; the code comment explains the why. Future-me reading either one has enough context to either keep the rule or challenge it deliberately.

When invariants outlive their reason

This pattern has a failure mode. An invariant whose underlying reason has gone away — say, because Pydantic releases a version where __init__ is as fast as model_construct — is just a pinned-shape rule with no payoff. It’s overhead.

The discipline that goes with the pattern is to periodically re-justify each invariant. Once a quarter, re-benchmark the alternative. If the gap has closed, retire the invariant. The test file’s docstring lists the date each rule was last validated. If a rule is older than the most recent Pydantic release, that’s the signal to retest.

I haven’t had to retire one yet. But the rule for retirement is part of the pattern, because pinned-shape tests with no rationale are how codebases ossify into something nobody wants to touch.

Takeaway

If you have a hot loop whose performance matters, and a team that’s going to keep editing the code around it, the two-line version of this post is:

Lock the optimisations in source-level tests. Pair each test with a comment explaining the rule and naming the test. Re-justify each invariant on a cadence.

It’s a small amount of work upfront, and it’s the only thing I’ve seen reliably stop “someone refactored it back” — which is the real failure mode you defend against once your code outlives your full attention.