Skip to content

Anonymised, long-catalogue specialty e-commerce

Inventory decision engine

Replacing a legacy 121K-line per-SKU integer-programming procurement system, whose actual demand forecaster was this-year-over-last-year, with a two-stage decision engine: a LightGBM quantile demand forecaster feeding a HiGHS LP capital allocator. A four-way ablation cleanly attributes wins between the forecaster and the allocator, on a simulation engine that runs ~30× faster than the Python-idiomatic baseline and is locked by nine source-level invariants.

Optimization Fullstack Data
FastAPI Polars LightGBM HiGHS MariaDB MongoDB React TanStack

27%

Lower tail-forecast error vs baseline

+21%

Canonical vs baseline backtest profit

30×

Engine speedup, full backtest

9

Source-level hot-loop invariants

The business problem

A long-catalogue specialty e-commerce business with a stockroom that's both the backbone of the operation and the place where most of the working capital sits. Revenue has been growing steadily for years. The publicly filed annual reports tell a less reassuring story underneath:

  • Inventory grew roughly 4.3× over five fiscal years.
  • Gross profit grew roughly over the same window.
  • Net profit collapsed by an order of magnitude, on bank debt at ~9.6%.

Inventory growing faster than gross profit, financed by bank debt, eats profitability. The procurement system, which decides what to order, from which vendor, in what quantities, at what cadence, is the central lever. It hadn't kept up.

The legacy system being replaced

The existing procurement codebase was a long-lived binary integer program run per stock-keeping unit (SKU): roughly 121K lines of code, ~2,700 commits, 50+ cost components in the profit matrix, 250+ configuration parameters. Substantial machinery. The hidden punchline is that under all of that, the demand forecaster driving the optimisation is this year's sales divided by last year's sales. Linear extrapolation with guard rails. ARIMA (autoregressive integrated moving average) and Prophet were implemented at one point and never used in production. The system optimises each SKU independently with no portfolio-level capital allocation: every line item competes against itself, never against the others.

That's the system the replacement is benchmarked against. The point isn't that the legacy code is bad; it's that you cannot fix the demand-forecasting layer by adding more cost components to the cost matrix.

The replacement: two-stage decision engine

A deliberate decoupling. One module predicts demand as a distribution. A second module decides what to buy given those predictions, today's capital, lead times, and holding costs. A better forecaster automatically improves the allocation; a better allocator automatically benefits from the forecasts.

  • Forecaster (codename MINERVA). Per (SKU, day) over the next 60 days, predicts a daily-demand distribution as three quantiles q10/q50/q90. LightGBM gradient-boosted decision trees, one booster per quantile.
  • Allocator (codename HEPHAESTUS). A linear program (LP) solved with HiGHS that turns the quantile forecasts plus the current capital, stock, and vendor costs into integer order quantities. Globally optimal under a shared capital constraint, not locally optimal per SKU.

MINERVA: quantile demand forecasting

Point estimates are useless for procurement: "expected demand 3 units" tells you nothing about whether stocking 3 is safe or whether the right number is 8. The allocator needs the shape of the distribution to reason about stockout risk. Three quantile boosters cover this cheaply: LightGBM's objective="quantile" with alpha set to 0.1, 0.5, 0.9. Three models, ~3 seconds of training each.

The single sharpest engineering choice in MINERVA is in the training data, not the model. Raw daily sales counts actual sales, not actual demand: when a SKU was out of stock, the demand was suppressed, and training on those rows teaches the model that stockouts cause low demand, which is the opposite of useful. The catalog data already carries a per-(SKU, day) orderable flag from the vendor side. So the training grid uses only orderable (SKU, day) cells; censored rows are discarded. The model learns E[units sold | orderable=True, features], which is exactly the conditional the allocator needs. No imputation, no synthetic demand layer.

24 features in four families: lagged demand and rolling sums and exponentially weighted moving averages (EWMAs), the dominant signal; page-view counts and conversion rates (the leading indicator, since people browse before they buy); availability and vendor side (orderable share, number of vendors with stock, minimum vendor price); calendar and catalog (day-of-week, day-of-month, days-to-Christmas, release year, age, genres, formats). SKU identity is a categorical feature; LightGBM target-encodes natively without one-hot explosion.

Validation is 12-fold walk-forward. Train on [t0, T), test on a 60-day slab immediately after, slide forward by 90 days. Random k-fold leaks the future into the past and is forbidden on time-series data. Aggregate metrics over the 12-fold window:

  • Pinball-90 (tail-forecast error): ~27% lower than the zero baseline. This is the metric that matters most. For procurement, the cost of mis-sizing buffer stock lives in the right tail of demand, not at the median.
  • Coverage-80: 0.980 against a 0.80 target. The [q10, q90] interval is well-calibrated.
  • Pinball-50: marginally below zero baseline. Demand at the (SKU, day) granularity is ~97% zeros; "always predict zero" is a punishingly tough median baseline, and any lift over it on the tail comes from real signal on the high-demand SKUs.

HEPHAESTUS: capital allocation as an LP

A greedy allocator can rank SKUs by expected profit-per-unit-cost and fill top-down. That's easy to reason about, but it can't trade off locally: "buy 5 more units of A or 2 units of B?" is a global question because the capital constraint is shared across every SKU on the same day. Linear programming solves the global trade-off exactly. The secondary reason: minimum order quantities and multi-vendor selection are coming, and at that point the problem becomes a genuine mixed-integer program (MIP). LP now → MIP later is a straight-line upgrade; a heuristic would have to be rewritten from scratch.

The formulation in words. Per SKU i: decision variable x_i (units to order today), auxiliary variable s_i (units expected to sell over the 60-day horizon), auxiliary variable u_i (unmet demand under the pessimistic q90 scenario). Maximise revenue from expected sales, minus order cost, minus holding cost on unsold stock, minus a λ-weighted stockout penalty. Subject to a shared capital constraint, stock-cap on sales (you can't sell what you don't have), demand-cap on sales (capped at q50), and unmet-demand growth under q90.

Built directly through highspy, HiGHS's Python bindings, as a column-major sparse matrix. The first implementation used PuLP as the modelling layer; profiling found PuLP was eating ~90% of solve wall time on Python-side LpVariable / LpAffineExpression construction and the MPS-style serialisation underneath. Rebuilding the problem in plain Python lists and handing them to HiGHS in one call is roughly 5× faster at our ~5K-SKU problem size. LP is continuous; integer order quantities come out via largest-remainder rounding under the capital cap. Floor everyone, then hand the leftover budget to the SKUs with the largest fractional remainder, until budget is exhausted.

The four-way ablation

Where exactly does the win come from: the forecaster, the allocator, or both? The cleanest answer is to run all four combinations head-to-head:

  • Baseline: EWMA forecast + greedy allocator (the sanity floor)
  • Forecaster only: trained MINERVA + greedy allocator
  • Allocator only: EWMA forecast + HEPHAESTUS LP
  • Canonical: trained MINERVA + HEPHAESTUS LP

On a 2024-H1 backtest with five noise seeds, the canonical configuration leads by roughly +21% over the EWMA-plus-greedy baseline. The forecaster-only variant captures most of that lift on its own; the allocator-only variant underperforms the baseline when paired with a weak EWMA forecast. LP needs sharp inputs to beat greedy, because greedy's structural simplicity hides forecast noise that LP propagates. That's not a bug; it's the answer to "where the win comes from" being made unambiguous by the ablation. The forecaster does most of the lifting; the LP adds a small but consistent margin conditional on the forecast being good.

The simulation arena underneath

Both the legacy system and the replacement are evaluated against the same engine: a discrete-event simulator that replays years of real history day-by-day. Each tick applies vendor-offer deltas, processes arrivals, sells to real historical demand, injects synthetic demand for out-of-stock SKUs via page-views × conversion-rates, builds the algorithm's view of the world, calls algorithm.decide(state), executes the resulting orders, deducts holding cost, snapshots key performance indicators (KPIs).

Censored-demand injection is the subtle piece. Raw demand parquet only records sales. If we didn't stock something, we don't see the customers who wanted it. Daily page-view counts joined to per-SKU conversion rates synthesise an estimated loss; that loss feeds the velocity signal seen by the next tick's algorithm decision, but doesn't move revenue. Without it, an algorithm that stocks nothing looks like a market where nobody wants the product.

Time-varying vendor prices come in from a separate ~370 MB parquet exported from production MongoDB. Vendors change prices and stock status every day; the engine pre-builds a per-date delta list at startup (~20K deltas/day) and applies only that day's diffs into cheapest_offer_by_item, an incremental O(log n) maintenance that the inner loop reads as a dict.

Hot loop performance: 12–22× per decision, ~30× over a full backtest

Roughly 170K SKUs × 2,000 days = 340M events for a 5+ year backtest. Per-phase wall-clock instrumentation is recorded on every run; here's the impact breakdown, in order of leverage:

  • build_state went 444× (7,101 ms → 16 ms over a 6-month run). InventoryPosition objects are built once at startup with stable identities. A dirty_positions set persists across ticks; only positions touched by today's activity are synced, and that sync is direct __dict__ writes rather than Pydantic __setattr__. SystemState itself uses model_construct to skip Pydantic validation on the hot path. The inputs originate from typed internal code, so the validation contract is upheld at the simulation boundary, not on every tick.
  • algorithm_decide went 12–22× per algorithm. Engine-maintained sparse dicts (on_hand_by_sku, on_order_by_sku, cheapest_offer_by_item) replace per-tick Pydantic attribute-access scans over 53K positions. Deque-based rolling velocity replaces an O(items_sold_ever × window) list-filter.
  • offers_delta went 7×. Pre-resolved (vendor_id, item_id) → OfferSummary references, differential emission (fields that didn't change stored as None so the hot loop skips the != check), incremental cheapest-offer maintenance. Most deltas resolve via a single new_price < current.retail comparison.
  • Bisect-sorted pending orders. Arrivals split in O(log n) via bisect.bisect_right against a list kept sorted by arrival date. bisect.insort maintains the order on insert.
  • Incremental inv_value_dkk: updated on arrival and on sale, never re-summed across the full inventory.
  • Polars partition_by for the per-day demand grouping. Compiled Rust, single-instruction, multiple-data (SIMD) on AVX-512 targets, returns plain Python lists for O(1) tuple unpack in the inner loop.

The target hardware is an AMD 9950X3D with a 144 MB L3 V-Cache. The entire engine working set (on-hand dict, demand lists per day, offer summaries, per-algorithm velocity caches) fits inside the L3 cache, so dict lookups are L3-latency rather than memory-latency. Polars runs compiled Rust with AVX-512 for the columnar pre-processing. No NumPy; Polars outperforms it on this workload.

Locking the wins with source-level invariants

Optimisations decay. Someone refactors a year later in good faith, the refactor reads cleaner, the tick loop is now half as fast, and nobody notices for a month. The nine highest-leverage invariants above are locked into a dedicated test module that mixes behavioural assertions (run the engine, check the output) with source-level grep assertions (model_construct appears here, bisect.insort is called, the construction line for orderable_offers_flat sits outside the loop). The rule for a failing invariant is to investigate, not to fix the test.

What each of the nine invariants does, and why source-level grep beats benchmark regression for catching refactor decay, is covered in the hot-loop invariants writeup →

Why excluding censored cells from MINERVA's training grid gives you a forecaster whose output is exactly the conditional the allocator needs is covered in the quantile forecaster deep-dive →

Parallel sweeps and hyperparameter search

The platform supports batched multi-run execution with a forkserver pool so each worker starts from a primed shared state (trained MINERVA model, parquet caches, per-date offer deltas). Shared state ships through three hooks on the algorithm (warm_shared_state, get_shared_state_snapshot, restore_shared_state), so workers don't each retrain or reload from disk. A fork-after-OpenMP deadlock with LightGBM was debugged out of the way to get this working reliably on Linux. Each worker peaks at ~10 GB resident; concurrency on a 32 GB workstation is capped at three, instrumented in BatchSummary.peak_worker_rss_mb.

On top of that runs Optuna for hyperparameter tuning: each trial wraps a start_batch call, persists into SQLite, and is resumable across sessions. The search space is scaffolded and ready; the search itself waits on a single piece of stakeholder input (the exact stockout penalty λ the business wants to optimise against). That's the same constraint that gates the full multi-year calibration run. The engine is ready; the question is what objective the business actually wants to maximise.

Determinism as a feature

The simulator runs deterministically under a fixed seed. A standalone test runs a two-week backtest twice and asserts byte-for-byte identical output. This is the property that makes the arena meaningful: strategies can only be compared head-to-head if the only thing varying between runs is the strategy. The seed is also the entry point for stochastic robustness runs: same strategy, multiple noise seeds on the vendor lead-time jitter, to measure variance under realistic execution uncertainty.

Surrounding stack

FastAPI + Polars on the backend, Pixi for environment management, MariaDB as the source of truth, MongoDB exports for vendor-side history. The dashboard is a pure React + Vite + Bun single-page application (SPA) on TanStack Router (file-based routing) with TanStack Query for cache-aware reads. LightGBM for the forecaster, HiGHS (via direct highspy bindings) for the allocator. No NumPy anywhere.

If your business has this shape

Catalogue-driven retailers with a long tail of slow-moving SKUs (specialty books, audio gear, replacement parts, niche food and drink) all hit a version of this problem. The architecture is portable: the unique inputs are demand data and procurement constraints, not the kind of product. Two-stage decoupled forecasting-then-allocation lets you replace either side independently as better models or better solvers appear. The ablation pattern makes "is the new forecaster worth it?" answerable with numbers rather than vibes. If you're sitting on years of order history and you've never been able to ask "what would have happened if we'd done this differently," the arena is a small project that pays for itself the first time the numbers say something you didn't expect.

Related work