Anonymised, Danish specialty retailer
Inventory simulation arena
Polars-based simulation engine where seven candidate inventory strategies compete on years of real demand data. Nine source-level invariants guard the hot loop from accidental refactor regressions: the optimisation contract is written into the test suite.
7
Competing strategies
9
Hot-loop invariants locked
5–10×
model_construct vs Pydantic
Byte-equal
Determinism under same seed
What it is
A specialty Danish retailer with a long catalogue of slow-moving items, hand-picked suppliers, and a stockroom that's both the backbone of the business and the place where most of the working capital sits. The procurement question they kept asking was the one every catalogue-heavy retailer eventually asks: are we ordering the right things in the right quantities?
Their existing answer was experience and gut feel, which had built a working business but had no way of saying whether a particular policy was costing or saving money. They wanted to know what a different procurement approach would have done against the same actual customer demand. I built a simulator that answers exactly that question.
The arena
Seven candidate procurement strategies compete head-to-head against years of real page-view and conversion data drawn from the production logs. Each tick simulates one day: incoming demand (with censoring when stock would have run out), arriving purchase orders, sales, holding cost, replenishment decisions. The engine runs at roughly one-day-per-tick for years of history, and the strategies are scored on a combination of working capital tied up, stock-out frequency, and gross margin delivered.
Strategies are not graded on revenue alone. The point is to find the policy with the best balance: a strategy that maximises sales by holding twice the inventory is not better than one that delivers 95% of those sales with half the capital. The reporting surface makes that trade-off legible.
Hot-loop architecture
The tick loop runs millions of times across multi-year backtests. Every line in that loop is a tax paid per tick, per strategy, per backtest. The interesting engineering work was making it cheap enough to run sweeps of parameter space and not just single comparisons.
- Pre-built immutable references. Static offers and the flattened
orderable-offers table are built once at simulation startup and passed by
reference into
SystemState. The tick loop never reconstructs them. Zero per-tick allocation for these structures. - Polars
partition_byfor demand grouping. Vectorised grouping replaces what would otherwise be a Python loop over thousands of demand rows per tick. - Bisect-sorted pending orders. Order arrivals are an
O(log n)bisect insert/lookup against a sorted list, not a linear scan. - Incremental inventory valuation. Inventory value is updated on each arrival and sale, never re-summed from scratch.
- Pydantic
model_construct. Hot-path domain objects skip Pydantic validation, which delivers a 5–10× construction speed-up in the inner loop. The inputs were already validated at the simulation boundary. - Censored demand from real data. Daily page views and conversion rates from production logs feed the demand model; not a synthetic distribution. The simulator answers what the policy would have done against the demand they actually saw.
Why the invariants are tested
Each design choice above is locked in by
tests/engine/test_invariants.py, which mixes source-level grep
assertions ("the engine must call model_construct and not
InventoryPosition(...)") with behavioural assertions on simulation
output. If any of the nine invariants regress, the test suite stops; the rule is to
investigate rather than to fix the test.
This pattern exists because optimisations decay. Optimisations land, someone refactors them later in good faith, the refactor reads cleaner, and the original win is gone. Source-level tests catch the shape of a regression the moment it's introduced, when the engineer making the change is the one who has to understand why the rule exists.
Determinism as a feature
The simulator runs deterministically under a fixed seed. A standalone test runs a two-week backtest twice and asserts byte-for-byte identical output. This is the property that makes the arena meaningful: strategies can only be compared head-to-head if the only thing varying between runs is the strategy.
Surrounding stack
FastAPI + Polars on the backend, Pixi for environment management. The dashboard is a pure React + Vite + Bun SPA with Tailwind 4 (no Astro, because the page is 100% interactive: live backtest comparisons, parameter sweeps, animation of inventory state across simulated time). Astro's static-shell model would add layers without benefit in that specific app.
Want the full enumeration (what each invariant does, what optimisation it locks, why source-level grep tests instead of benchmarks)? See the deep-dive →
If your business has this shape
Catalogue-driven retailers with a long tail of slow-moving SKUs (specialty books, audio gear, replacement parts, niche food and drink) all hit a version of this problem. The model is portable: the unique inputs are demand data and procurement constraints, not the kind of product. If you're sitting on years of order history and you've never been able to ask "what would have happened if we'd done this differently," the arena pattern is a small project that pays for itself the first time the numbers say something you didn't expect.
Related work
Backtesting engine and scanner
Thoth
Hot-loop discipline at the small scale: a 13-strategy backtest of the US equities universe finishes in seconds. Pure Polars expressions, threaded bulk runner, regime-gated strategies. The kind of code-level performance work I bring to bigger systems.
Optimization Fullstack DataEngineer on the optimisation arc
Provstiskyen: performance work on a 10-year SaaS
Profiled and fixed the cold-start path on a 44,000-line R Shiny production app: 50-second logins down to 18, and 35-minute deploys down to 80 seconds, all on the existing codebase. The full rewrite that came later was made possible by a year of targeted optimisation work first.
Optimization Fullstack DevOps