Anonymised, long-catalogue specialty e-commerce
Inventory decision engine
Replacing a legacy 121K-line per-SKU integer-programming procurement system, whose actual demand forecaster was this-year-over-last-year, with a two-stage decision engine: a LightGBM quantile demand forecaster feeding a HiGHS LP capital allocator. A four-way ablation cleanly attributes wins between the forecaster and the allocator, on a simulation engine that runs ~30× faster than the Python-idiomatic baseline and is locked by nine source-level invariants.
27%
Lower tail-forecast error vs baseline
+21%
Canonical vs baseline backtest profit
30×
Engine speedup, full backtest
9
Source-level hot-loop invariants
The business problem
A long-catalogue specialty e-commerce business with a stockroom that's both the backbone of the operation and the place where most of the working capital sits. Revenue has been growing steadily for years. The publicly filed annual reports tell a less reassuring story underneath:
- Inventory grew roughly 4.3× over five fiscal years.
- Gross profit grew roughly 3× over the same window.
- Net profit collapsed by an order of magnitude, on bank debt at ~9.6%.
Inventory growing faster than gross profit, financed by bank debt, eats profitability. The procurement system, which decides what to order, from which vendor, in what quantities, at what cadence, is the central lever. It hadn't kept up.
The legacy system being replaced
The existing procurement codebase was a long-lived binary integer program run per stock-keeping unit (SKU): roughly 121K lines of code, ~2,700 commits, 50+ cost components in the profit matrix, 250+ configuration parameters. Substantial machinery. The hidden punchline is that under all of that, the demand forecaster driving the optimisation is this year's sales divided by last year's sales. Linear extrapolation with guard rails. ARIMA (autoregressive integrated moving average) and Prophet were implemented at one point and never used in production. The system optimises each SKU independently with no portfolio-level capital allocation: every line item competes against itself, never against the others.
That's the system the replacement is benchmarked against. The point isn't that the legacy code is bad; it's that you cannot fix the demand-forecasting layer by adding more cost components to the cost matrix.
The replacement: two-stage decision engine
A deliberate decoupling. One module predicts demand as a distribution. A second module decides what to buy given those predictions, today's capital, lead times, and holding costs. A better forecaster automatically improves the allocation; a better allocator automatically benefits from the forecasts.
- Forecaster (codename MINERVA). Per (SKU, day) over the next 60
days, predicts a daily-demand distribution as three quantiles
q10/q50/q90. LightGBM gradient-boosted decision trees, one booster per quantile. - Allocator (codename HEPHAESTUS). A linear program (LP) solved with HiGHS that turns the quantile forecasts plus the current capital, stock, and vendor costs into integer order quantities. Globally optimal under a shared capital constraint, not locally optimal per SKU.
MINERVA: quantile demand forecasting
Point estimates are useless for procurement: "expected demand 3 units" tells you
nothing about whether stocking 3 is safe or whether the right number is 8. The
allocator needs the shape of the distribution to reason about stockout
risk. Three quantile boosters cover this cheaply: LightGBM's objective="quantile"
with alpha set to 0.1, 0.5, 0.9. Three models, ~3 seconds of training each.
The single sharpest engineering choice in MINERVA is in the training data, not the
model. Raw daily sales counts actual sales, not actual demand: when
a SKU was out of stock, the demand was suppressed, and training on those rows
teaches the model that stockouts cause low demand, which is the opposite of useful.
The catalog data already carries a per-(SKU, day) orderable flag from
the vendor side. So the training grid uses only orderable (SKU, day) cells; censored
rows are discarded. The model learns
E[units sold | orderable=True, features], which is exactly the
conditional the allocator needs. No imputation, no synthetic demand layer.
24 features in four families: lagged demand and rolling sums and exponentially weighted moving averages (EWMAs), the dominant signal; page-view counts and conversion rates (the leading indicator, since people browse before they buy); availability and vendor side (orderable share, number of vendors with stock, minimum vendor price); calendar and catalog (day-of-week, day-of-month, days-to-Christmas, release year, age, genres, formats). SKU identity is a categorical feature; LightGBM target-encodes natively without one-hot explosion.
Validation is 12-fold walk-forward. Train on [t0, T), test on a 60-day
slab immediately after, slide forward by 90 days. Random k-fold leaks the future
into the past and is forbidden on time-series data. Aggregate metrics over the
12-fold window:
- Pinball-90 (tail-forecast error): ~27% lower than the zero baseline. This is the metric that matters most. For procurement, the cost of mis-sizing buffer stock lives in the right tail of demand, not at the median.
- Coverage-80: 0.980 against a 0.80 target. The
[q10, q90]interval is well-calibrated. - Pinball-50: marginally below zero baseline. Demand at the (SKU, day) granularity is ~97% zeros; "always predict zero" is a punishingly tough median baseline, and any lift over it on the tail comes from real signal on the high-demand SKUs.
HEPHAESTUS: capital allocation as an LP
A greedy allocator can rank SKUs by expected profit-per-unit-cost and fill top-down. That's easy to reason about, but it can't trade off locally: "buy 5 more units of A or 2 units of B?" is a global question because the capital constraint is shared across every SKU on the same day. Linear programming solves the global trade-off exactly. The secondary reason: minimum order quantities and multi-vendor selection are coming, and at that point the problem becomes a genuine mixed-integer program (MIP). LP now → MIP later is a straight-line upgrade; a heuristic would have to be rewritten from scratch.
The formulation in words. Per SKU i: decision variable x_i
(units to order today), auxiliary variable s_i (units expected to sell
over the 60-day horizon), auxiliary variable u_i (unmet demand under
the pessimistic q90 scenario). Maximise revenue from expected sales,
minus order cost, minus holding cost on unsold stock, minus a λ-weighted
stockout penalty. Subject to a shared capital constraint, stock-cap on sales (you
can't sell what you don't have), demand-cap on sales (capped at q50),
and unmet-demand growth under q90.
Built directly through highspy, HiGHS's Python bindings, as a
column-major sparse matrix. The first implementation used PuLP as the modelling
layer; profiling found PuLP was eating ~90% of solve wall time on Python-side
LpVariable / LpAffineExpression construction and the MPS-style
serialisation underneath. Rebuilding the problem in plain Python lists and handing
them to HiGHS in one call is roughly 5× faster at our ~5K-SKU
problem size. LP is continuous; integer order quantities come out via
largest-remainder rounding under the capital cap. Floor everyone, then hand the
leftover budget to the SKUs with the largest fractional remainder, until budget
is exhausted.
The four-way ablation
Where exactly does the win come from: the forecaster, the allocator, or both? The cleanest answer is to run all four combinations head-to-head:
- Baseline: EWMA forecast + greedy allocator (the sanity floor)
- Forecaster only: trained MINERVA + greedy allocator
- Allocator only: EWMA forecast + HEPHAESTUS LP
- Canonical: trained MINERVA + HEPHAESTUS LP
On a 2024-H1 backtest with five noise seeds, the canonical configuration leads by roughly +21% over the EWMA-plus-greedy baseline. The forecaster-only variant captures most of that lift on its own; the allocator-only variant underperforms the baseline when paired with a weak EWMA forecast. LP needs sharp inputs to beat greedy, because greedy's structural simplicity hides forecast noise that LP propagates. That's not a bug; it's the answer to "where the win comes from" being made unambiguous by the ablation. The forecaster does most of the lifting; the LP adds a small but consistent margin conditional on the forecast being good.
The simulation arena underneath
Both the legacy system and the replacement are evaluated against the same engine: a
discrete-event simulator that replays years of real history day-by-day. Each tick
applies vendor-offer deltas, processes arrivals, sells to real historical demand,
injects synthetic demand for out-of-stock SKUs via page-views × conversion-rates,
builds the algorithm's view of the world, calls algorithm.decide(state),
executes the resulting orders, deducts holding cost, snapshots key performance indicators (KPIs).
Censored-demand injection is the subtle piece. Raw demand parquet only records sales. If we didn't stock something, we don't see the customers who wanted it. Daily page-view counts joined to per-SKU conversion rates synthesise an estimated loss; that loss feeds the velocity signal seen by the next tick's algorithm decision, but doesn't move revenue. Without it, an algorithm that stocks nothing looks like a market where nobody wants the product.
Time-varying vendor prices come in from a separate ~370 MB parquet exported from
production MongoDB. Vendors change prices and stock status every day; the engine
pre-builds a per-date delta list at startup (~20K deltas/day) and applies only that
day's diffs into cheapest_offer_by_item, an incremental
O(log n) maintenance that the inner loop reads as a dict.
Hot loop performance: 12–22× per decision, ~30× over a full backtest
Roughly 170K SKUs × 2,000 days = 340M events for a 5+ year backtest. Per-phase wall-clock instrumentation is recorded on every run; here's the impact breakdown, in order of leverage:
-
build_statewent 444× (7,101 ms → 16 ms over a 6-month run).InventoryPositionobjects are built once at startup with stable identities. Adirty_positionsset persists across ticks; only positions touched by today's activity are synced, and that sync is direct__dict__writes rather than Pydantic__setattr__.SystemStateitself usesmodel_constructto skip Pydantic validation on the hot path. The inputs originate from typed internal code, so the validation contract is upheld at the simulation boundary, not on every tick. -
algorithm_decidewent 12–22× per algorithm. Engine-maintained sparse dicts (on_hand_by_sku,on_order_by_sku,cheapest_offer_by_item) replace per-tick Pydantic attribute-access scans over 53K positions. Deque-based rolling velocity replaces anO(items_sold_ever × window)list-filter. -
offers_deltawent 7×. Pre-resolved(vendor_id, item_id) → OfferSummaryreferences, differential emission (fields that didn't change stored asNoneso the hot loop skips the!=check), incremental cheapest-offer maintenance. Most deltas resolve via a singlenew_price < current.retailcomparison. -
Bisect-sorted pending orders. Arrivals split in
O(log n)viabisect.bisect_rightagainst a list kept sorted by arrival date.bisect.insortmaintains the order on insert. -
Incremental
inv_value_dkk: updated on arrival and on sale, never re-summed across the full inventory. -
Polars
partition_byfor the per-day demand grouping. Compiled Rust, single-instruction, multiple-data (SIMD) on AVX-512 targets, returns plain Python lists forO(1)tuple unpack in the inner loop.
The target hardware is an AMD 9950X3D with a 144 MB L3 V-Cache. The entire engine working set (on-hand dict, demand lists per day, offer summaries, per-algorithm velocity caches) fits inside the L3 cache, so dict lookups are L3-latency rather than memory-latency. Polars runs compiled Rust with AVX-512 for the columnar pre-processing. No NumPy; Polars outperforms it on this workload.
Locking the wins with source-level invariants
Optimisations decay. Someone refactors a year later in good faith, the refactor
reads cleaner, the tick loop is now half as fast, and nobody notices for a month.
The nine highest-leverage invariants above are locked into a dedicated test module
that mixes behavioural assertions (run the engine, check the output) with
source-level grep assertions (model_construct appears here,
bisect.insort is called, the construction line for
orderable_offers_flat sits outside the loop). The rule for a failing
invariant is to investigate, not to fix the test.
What each of the nine invariants does, and why source-level grep beats benchmark regression for catching refactor decay, is covered in the hot-loop invariants writeup →
Why excluding censored cells from MINERVA's training grid gives you a forecaster whose output is exactly the conditional the allocator needs is covered in the quantile forecaster deep-dive →
Parallel sweeps and hyperparameter search
The platform supports batched multi-run execution with a forkserver pool so each
worker starts from a primed shared state (trained MINERVA model, parquet caches,
per-date offer deltas). Shared state ships through three hooks on the algorithm
(warm_shared_state, get_shared_state_snapshot,
restore_shared_state), so workers don't each retrain or reload from
disk. A fork-after-OpenMP deadlock with LightGBM was debugged out of the way to
get this working reliably on Linux. Each worker peaks at ~10 GB resident;
concurrency on a 32 GB workstation is capped at three, instrumented in
BatchSummary.peak_worker_rss_mb.
On top of that runs Optuna for hyperparameter tuning: each trial wraps a
start_batch call, persists into SQLite, and is resumable across
sessions. The search space is scaffolded and ready; the search itself waits on a
single piece of stakeholder input (the exact stockout penalty λ the
business wants to optimise against). That's the same constraint that gates the
full multi-year calibration run. The engine is ready; the question is what
objective the business actually wants to maximise.
Determinism as a feature
The simulator runs deterministically under a fixed seed. A standalone test runs a two-week backtest twice and asserts byte-for-byte identical output. This is the property that makes the arena meaningful: strategies can only be compared head-to-head if the only thing varying between runs is the strategy. The seed is also the entry point for stochastic robustness runs: same strategy, multiple noise seeds on the vendor lead-time jitter, to measure variance under realistic execution uncertainty.
Surrounding stack
FastAPI + Polars on the backend, Pixi for environment management, MariaDB as the
source of truth, MongoDB exports for vendor-side history. The dashboard is a pure
React + Vite + Bun single-page application (SPA) on TanStack Router (file-based routing) with TanStack Query
for cache-aware reads. LightGBM for the forecaster, HiGHS (via direct
highspy bindings) for the allocator. No NumPy anywhere.
If your business has this shape
Catalogue-driven retailers with a long tail of slow-moving SKUs (specialty books, audio gear, replacement parts, niche food and drink) all hit a version of this problem. The architecture is portable: the unique inputs are demand data and procurement constraints, not the kind of product. Two-stage decoupled forecasting-then-allocation lets you replace either side independently as better models or better solvers appear. The ablation pattern makes "is the new forecaster worth it?" answerable with numbers rather than vibes. If you're sitting on years of order history and you've never been able to ask "what would have happened if we'd done this differently," the arena is a small project that pays for itself the first time the numbers say something you didn't expect.
Related work
Equities backtester + risk budget
Thoth
An equities backtester built with the statistical honesty of a real-money system: selection-bias-aware deflated Sharpe with explicit trial count, stationary block-bootstrap confidence intervals, predicted-vs-realized calibration against a live trade journal, correlation-adjusted quarter-Kelly sizing with heat / sector / currency caps. The engine work (pure Polars strategies, vectorised regime detection with hysteresis, threaded scanner) is what makes the trust layer cheap enough to actually run every morning.
Optimization Fullstack DataTwo acts on a parish-admin platform
Provstiskyen: optimising then rewriting a 10-year SaaS
Two acts on a 44,000-line R Shiny platform that runs about half of Denmark's deaneries. Act I cut cold start from 50s to 18s and deploys from 35min to 80s on the existing codebase. Act II, once the architecture itself was the ceiling, is a full rewrite onto FastAPI, Polars, and React: performant by default, far more maintainable, with the legacy app retiring as the last module ports across.
DevOps Optimization Fullstack