Training a demand forecaster on the cells you didn't lose to stockouts

A short writeup on a single design decision in a demand-forecasting pipeline, because it captures a pattern that comes up repeatedly when modelling for a downstream optimiser: think hard about what the model’s output is supposed to mean, and shape the training target to match. The case lives inside the inventory decision engine I built for a long-catalogue specialty e-commerce business (case page); the forecaster in question feeds a linear-program (LP) allocator that picks integer order quantities under a capital constraint.

The naive training setup

You have a sales table with one row per stock-keeping unit (SKU) per day, recording the quantity sold. The obvious target for a demand-forecasting model is exactly that: take the (SKU, day) grid, join sales, fill the missing cells with zero, train. The model learns “given features, predict units sold on this (SKU, day).”

training_grid = catalog × calendar
training_grid.join(sales, on=[sku, day], how=left).fillna(0)

There’s nothing structurally wrong with this code. It will train. The model will converge. The predictions will look reasonable. And the model will be subtly, load-bearingly wrong.

What the model actually learns

Consider what the (SKU, day) → zero-sales rows in that training set actually represent. Some of them are days where the SKU was in stock and nobody bought it. True zero demand. The model should learn that. Others are days where the SKU was out of stock and there was no chance to register a sale even if a customer wanted to buy. That’s the classic censoring problem in survival-analysis-flavoured data. The model cannot distinguish those rows from real zeros. Both look like units_sold = 0 in the table.

For a SKU that has frequent stockouts, the training data leans heavily toward zero on exactly the days a competent procurement system would have stocked it. The model learns “this SKU sells nothing.” That belief gets fed to the allocator. The allocator orders nothing. The SKU stays out of stock. The next training pass sees more zeros. The system has talked itself into an equilibrium that maximises the chance of being trivially correct about its own forecast.

Imputation is the wrong fix

The textbook response to censored data is to impute. You build a second model that estimates “latent demand,” meaning what would have sold if the SKU had been available, and you train the forecaster on the union of real sales and imputed sales. There are entire research lines around how to do this: parametric assumptions about the demand distribution, page-view-times-conversion-rate scaling, time-of-year priors.

I deliberately didn’t go this route. Two reasons:

Imputation introduces a model you can’t validate independently. The latent-demand estimator’s outputs are by definition unobservable. You can pick reasonable-looking priors, but you can’t run a held-out set against the truth, because there is no truth. Whatever bias the imputation model has, you propagate into the forecaster, and the forecaster’s “good” cross-validation numbers say nothing about whether the imputation was right.
There’s a flag already in the data that solves this directly. The catalog pipeline already records per-(SKU, day) whether the SKU was orderable, meaning at least one vendor had it in stock or backorderable. That flag is the censoring indicator, no inference required.

The actual decision: drop the censored cells, change the model’s semantics

# Training grid is the orderable cells only.
training = page_views.filter(pl.col("orderable") == True)
training = training.join(sales, on=["sku", "day"], how="left").with_columns(
    pl.col("units_sold").fill_null(0)
)

Three lines. The training set is exactly the cells where, if a customer had wanted to buy, the system would have allowed it. Censored cells (days the SKU was unavailable) are gone. The zero-sales rows that remain are real zero-demand observations because the SKU was available and nobody bought it.

The crucial consequence is what the model learns to predict. Before this change, the model was estimating

E[units_sold | features]

which is what the sales table contains. After this change, the model is estimating

E[units_sold | orderable=True, features]

The model has not become smarter. It has become specifically calibrated to the conditional expectation the downstream allocator needs.

Why this exact conditional is what the allocator needs

The allocator’s decision is “make this SKU orderable today by buying enough of it to matter.” That decision is meaningful only in the regime where the SKU is orderable. The allocator is the thing that makes orderability happen. Asking the forecaster “what’s the expected demand for this SKU?” without conditioning on orderability is asking a question whose answer mixes two regimes: days when the SKU could sell and days when it couldn’t.

What the allocator actually wants is “if you make this SKU orderable, what will sell?” That’s the conditional. The training-data trick makes the model answer exactly that question.

Quantile losses fall out cleanly

The forecaster outputs three quantiles (q10, q50, q90) as distributions rather than point estimates, because the allocator needs to reason about stockout risk in the tails. Each quantile is a separate booster trained with objective="quantile" and the appropriate alpha. Once the training cells are right, the quantile losses are independent of which quantile is being trained, and the model converges to the conditional tail distribution under orderability. The allocator’s stockout penalty against q90 is therefore a well-posed estimate of stockout probability given the allocator’s own intervention.

That’s not something I could have produced by imputing latent demand and then fitting a quantile loss to the imputed series. The imputation would have systematically shifted the tails in ways the allocator’s penalty was not calibrated for.

The cross-validation result, before and after

On the 12-fold walk-forward window, this single change moved pinball-90 (the metric that matters most for procurement, because the cost of mis-sizing safety stock lives in the right tail of demand) from above the zero baseline to ~27% below it: a 27% reduction in tail-forecast error. The median-quantile pinball stayed essentially at the baseline, because demand at the (SKU, day) granularity is ~97% zeros and “always predict zero” is a brutally tough median baseline. The tail is where the signal lives, and the tail is what the training-data trick unlocked.

Coverage-80, the fraction of actual sales falling inside the predicted [q10, q90] band, landed at 0.980 against a 0.80 target. The interval is well-calibrated, slightly wide if anything. That’s the right side for a procurement input to err on; an over-confident tight interval would let the LP under-buy and stockout more.

The pattern, generalised

Two takeaways that I think transfer to any modelling pipeline with a downstream consumer:

The training target is part of the model’s API. The output of a forecaster is implicitly conditional on the data-generating process it was trained against. If a downstream system is going to use that output as input to a decision, you need to make sure the conditional matches the decision. “Predict demand” is the wrong specification; “predict demand under the conditions that the next system will actually create” is the right one. In this case those two are very different, and that gap is what separates a model that helps from one that quietly reinforces its own historical errors.

Look for the flag before you reach for the model. Imputation models, latent-variable estimators, dual-model architectures: all of these are sometimes the right answer. But they’re never the first thing to reach for. If your data pipeline already records the flag that distinguishes the regime you care about from the regime you don’t, use it. A filter on a flag is a fragment of a query. An imputation model is a research project.

The forecaster code, after the change, is mostly boilerplate around LightGBM. The interesting decision was made before any LightGBM was called, in the line that built the training grid.