Inventory simulation arena

What it is

A specialty Danish retailer with a long catalogue of slow-moving items, hand-picked suppliers, and a stockroom that's both the backbone of the business and the place where most of the working capital sits. The procurement question they kept asking was the one every catalogue-heavy retailer eventually asks: are we ordering the right things in the right quantities?

Their existing answer was experience and gut feel, which had built a working business but had no way of saying whether a particular policy was costing or saving money. They wanted to know what a different procurement approach would have done against the same actual customer demand. I built a simulator that answers exactly that question.

The arena

Seven candidate procurement strategies compete head-to-head against years of real page-view and conversion data drawn from the production logs. Each tick simulates one day: incoming demand (with censoring when stock would have run out), arriving purchase orders, sales, holding cost, replenishment decisions. The engine runs at roughly one-day-per-tick for years of history, and the strategies are scored on a combination of working capital tied up, stock-out frequency, and gross margin delivered.

Strategies are not graded on revenue alone. The point is to find the policy with the best balance: a strategy that maximises sales by holding twice the inventory is not better than one that delivers 95% of those sales with half the capital. The reporting surface makes that trade-off legible.

Hot-loop architecture

The tick loop runs millions of times across multi-year backtests. Every line in that loop is a tax paid per tick, per strategy, per backtest. The interesting engineering work was making it cheap enough to run sweeps of parameter space and not just single comparisons.

Pre-built immutable references. Static offers and the flattened orderable-offers table are built once at simulation startup and passed by reference into SystemState. The tick loop never reconstructs them. Zero per-tick allocation for these structures.
Polars partition_by for demand grouping. Vectorised grouping replaces what would otherwise be a Python loop over thousands of demand rows per tick.
Bisect-sorted pending orders. Order arrivals are an O(log n) bisect insert/lookup against a sorted list, not a linear scan.
Incremental inventory valuation. Inventory value is updated on each arrival and sale, never re-summed from scratch.
Pydantic model_construct. Hot-path domain objects skip Pydantic validation, which delivers a 5–10× construction speed-up in the inner loop. The inputs were already validated at the simulation boundary.
Censored demand from real data. Daily page views and conversion rates from production logs feed the demand model; not a synthetic distribution. The simulator answers what the policy would have done against the demand they actually saw.

Why the invariants are tested

Each design choice above is locked in by tests/engine/test_invariants.py, which mixes source-level grep assertions ("the engine must call model_construct and not InventoryPosition(...)") with behavioural assertions on simulation output. If any of the nine invariants regress, the test suite stops; the rule is to investigate rather than to fix the test.

This pattern exists because optimisations decay. Optimisations land, someone refactors them later in good faith, the refactor reads cleaner, and the original win is gone. Source-level tests catch the shape of a regression the moment it's introduced, when the engineer making the change is the one who has to understand why the rule exists.

Determinism as a feature

The simulator runs deterministically under a fixed seed. A standalone test runs a two-week backtest twice and asserts byte-for-byte identical output. This is the property that makes the arena meaningful: strategies can only be compared head-to-head if the only thing varying between runs is the strategy.

Surrounding stack

FastAPI + Polars on the backend, Pixi for environment management. The dashboard is a pure React + Vite + Bun SPA with Tailwind 4 (no Astro, because the page is 100% interactive: live backtest comparisons, parameter sweeps, animation of inventory state across simulated time). Astro's static-shell model would add layers without benefit in that specific app.

Want the full enumeration (what each invariant does, what optimisation it locks, why source-level grep tests instead of benchmarks)? See the deep-dive →

If your business has this shape

Catalogue-driven retailers with a long tail of slow-moving SKUs (specialty books, audio gear, replacement parts, niche food and drink) all hit a version of this problem. The model is portable: the unique inputs are demand data and procurement constraints, not the kind of product. If you're sitting on years of order history and you've never been able to ask "what would have happened if we'd done this differently," the arena pattern is a small project that pays for itself the first time the numbers say something you didn't expect.