Right-sizing 2,600 Jenkins pipelines without breaking production

A large enterprise CI cluster ran 2,600 Jenkins pipelines. Every pipeline got the same pod template: 8 GB RAM, 4 CPU. Capacity incidents were ongoing. Nobody had numbers.

This is the long version of how that ended up at ~60% memory utilization. The case study is the elevator pitch; this post is the design argument.

The interesting thing isn’t the bin sizes. It’s that the diagnostic infrastructure had to be built first — and that’s the part that’s usually missing when this kind of work hasn’t shipped already.

The problem under the problem

The previous Jenkins setup used a static agent pool: long-lived, identical pods, drawn from a named pod template via the Jenkins label mechanism. Pipelines requested an agent by label; the plugin handed back the next idle pod from the pool.

This pattern is fast (pre-warmed pods, near-instant build starts) but opaque to telemetry. The pod that runs your build has no metadata pointing back at the build. cAdvisor sees a pod named acme-jnlp-abc12 consuming memory — and that pod, over its lifetime, ran builds 47, 48, 49, and 50 of three different pipelines. The memory peak in cAdvisor cannot be attributed to any of them.

You cannot size what you cannot measure. So the first job was attribution.

Per-build pod identity

The Jenkins Kubernetes plugin has two distinct code paths:

	`label`-based	declarative `inheritFrom`
Pod creation time	Plugin loads pod templates from controller config	Plugin renders pod spec at build start
Pod name	Generic, template-derived	`<job-name>-<build-number>-<random>`
Annotations	None build-specific	`runUrl`, `buildUrl`, `jenkins/label`, more
Reusable	Yes (from the pool)	No, single-use

The declarative path is what we want. It’s literally designed to render pod specs in build context, which means the live Run object is available — the plugin can substitute job name, build number, run URL into the pod spec at the moment of creation. The price is that pods are single-use; the gain is that every pod’s identity is the identity of its build.

In the shared pipeline library:

pipeline {
    agent {
        kubernetes {
            inheritFrom 'acme-jnlp'
            idleMinutes 0
            podRetention never()
        }
    }
}

Three settings, three reasons:

inheritFrom selects the named pod template as a base, but renders it per build.
idleMinutes 0 forces teardown immediately after the build, preventing pool-style reuse.
podRetention never() is the explicit “don’t keep this pod alive under any condition” instruction. This is load-bearing — without it, the plugin can fall back to retention policies inherited from the template or cloud-level defaults, and pods stick around for 3 minutes that you don’t want.

A sharp edge: even with all three above, pods were lingering for exactly 3 minutes + small jitter after build completion. The culprit turned out to be the cloud-level pod retention default in Jenkins’s Kubernetes cloud configuration — a setting completely outside the pipeline-level YAML, which silently overrode idleMinutes 0. Dropping the cloud default to 0 was the fix. I’d missed it the first three times. If your pods are surviving past podRetention never(), that’s where to look.

The 63-character truncation problem

Kubernetes caps resource names at 63 characters (DNS-1123 label limit). The Jenkins plugin constructs pod names as <sanitized-job-name>-<build-number>-<hash>. The hash and number take ~18 characters; with anything but the shortest job names, the plugin has to truncate something.

Some plugin versions truncate from the front, which corrupts the job name part. Pods that should have been acme-team-project-master-pipeline-47-xyz come out as team-project-master-pipeline-47-xyz. Plugin behaviour is correct (the hash is the uniqueness guarantee, so it can’t be touched), but it means the pod name is no longer a reliable key for joining metrics back to pipelines.

The fix is to stop using pod names as keys at all.

OpenTelemetry: scrape the annotations

The plugin already injects build identity as pod annotations. runUrl and buildUrl are the full Jenkins URLs that point at the specific build. They’re never truncated, and they contain the canonical job path.

If the OTel collector’s k8sattributes processor is configured to promote these annotations to metric labels, the problem disappears. The processor block in the collector config:

processors:
  k8sattributes:
    extract:
      annotations:
        - tag_name: job_name
          key: runUrl
          from: pod
        - tag_name: job_run
          key: buildUrl
          from: pod

After this lands, every metric Thanos receives for a Jenkins pod carries job_name and job_run as labels. Queries can group on them directly:

max by (job_name, job_run) (
  max_over_time(
    container_memory_working_set_bytes{
      k8s_namespace_name="acme-ci",
      job_name=~"acme-.*"
    }[7d]
  ) / (1024 * 1024)
)

One row per build, keyed on the build’s identity, no regex parsing of pod names.

A worthwhile side finding: the OTel collector was scraping application metrics but not control-plane signals (kubelet, scheduler, API server). That’s a common shape for observability stacks that grew up around app monitoring — control-plane scrape configs are usually a follow-on once someone needs them. The diagnostic surfaced exactly that need, and the scrape-config extension was filed as separate platform work. It’s the kind of finding the diagnostic phase produces incidentally, and it tends to be worth more than it looks.

The data pipeline

Three Grafana queries, exported as CSVs, joined in Polars:

# Peak RAM per build (MB)
max by (job_name, job_run) (
  max_over_time(container_memory_working_set_bytes{...}[30d]) / (1024 * 1024)
)

# Peak CPU rate per build (cores)
max by (job_name, job_run) (
  max_over_time(rate(container_cpu_time_seconds_total{...}[2m])[30d:2m])
)

# Approximate duration per build (seconds)
count_over_time(container_memory_working_set_bytes{...}[30d]) * 60

The duration query exploits a property of cAdvisor: it only emits samples while the container exists. count_over_time over a long window therefore counts the number of sample points in the container’s lifetime, and multiplying by the scrape interval gives an approximate duration. With a 60-second scrape interval the precision is ±60s, which is fine for 5–15 minute builds and not fine for shorter ones — worth documenting as a caveat.

The Polars side is small:

ram = pl.read_csv("ram.csv").rename({"Value #A": "peak_ram_mb"})
cpu = pl.read_csv("cpu.csv").rename({"Value #B": "peak_cpu_cores"})
dur = pl.read_csv("duration.csv").rename({"Value #C": "duration_seconds"})

df = (
    ram.join(cpu, on=["job_name", "job_run"], how="full", coalesce=True)
       .join(dur, on=["job_name", "job_run"], how="full", coalesce=True)
)

stats = df.group_by("job_name").agg([
    pl.len().alias("build_count"),
    pl.col("peak_ram_mb").quantile(0.50).alias("ram_p50"),
    pl.col("peak_ram_mb").quantile(0.95).alias("ram_p95"),
    pl.col("peak_ram_mb").max().alias("ram_max"),
    pl.col("peak_cpu_cores").quantile(0.95).alias("cpu_p95"),
])

After this lands, the distribution is queryable in any way the analysis wants.

The bin-design argument

This is the section that most often goes wrong.

Out of 1,169 pipelines with enough data (≥3 builds in the 30-day window), the per-pipeline P95 RAM distribution was strongly right-skewed:

50% had P95 below 500 MB
73% had P95 below 600 MB
89% had P95 below 800 MB
94% had P95 below 1 GB
99% had P95 below 2 GB
A single pipeline had P95 above 3 GB

The temptation in this kind of distribution is to reach for k-means. Don’t.

K-means is the wrong tool for 1D segmentation:

Non-deterministic. Random centroid initialisation means two runs on the same data give different bin boundaries unless you fix the seed. For a recommendation that ends up applied to 2,600 pipelines, “depends on the random seed” is not a defensible position.
Equal-variance assumption. K-means treats clusters as roughly spherical with similar widths. CI workload distributions are virtually never that shape — they’re skewed, with a dense low-end and a sparse tail.
Wrong objective. K-means minimises within-cluster sum of squared distances to the centroid. You don’t care about distance to a centroid; you care about the bin’s upper bound, because that determines whether the bin fits the workload.

Jenks natural breaks is the right tool. It’s specifically designed for 1D segmentation, explicitly minimises within-class variance while maximising between-class variance, is deterministic, and is the standard method cartographers have used to bin continuous data into choropleth maps for decades. Same problem shape. In Python: jenkspy.jenks_breaks(data, n_classes=4).

In practice the three methods (quantile-based, Jenks, eyeballing the histogram) all agreed on the bin boundaries for this distribution — which is exactly the validation you want. Agreement gives confidence; disagreement is itself informative. When all three methods suggest a break at the same place, the data wants the bin there.

The final bins were chosen with explicit 50% safety overprovisioning above each tier’s P95 target:

Bin	RAM (req=lim)	CPU req	Target P95	Sized at
XS	768 MiB	500 m	≤ 500 MB	1.5× P95
S	1.25 GiB	1.0 core	≤ 1 GB	1.25× P95
M	2 GiB	1.5 cores	≤ 1.5 GB	1.33× P95
L	3 GiB	2.0 cores	≤ 2.5 GB	1.2× P95

RAM request equals limit — Guaranteed QoS — because eviction mid-build is catastrophic (builds fail in confusing ways, no clean retry semantics). The marginal cluster-packing benefit of Burstable QoS isn’t worth that.

CPU limit is left unset because CPU throttles gracefully (the kernel CFS scheduler just slows the pod down). The request is what the scheduler reserves; setting only request means the pod is guaranteed its share but can burst above it on underutilised nodes. For build workloads this is uniformly better than fixed limits.

The bins are deliberately conservative for the first migration wave. Tightening them is a follow-up project, once monitoring confirms no OOM pressure across the migrated population.

Centralised registry, override path, escape hatch

Sizing decisions live in a single YAML file inside the shared pipeline library, with a four-level resolution chain in a small Groovy class:

Map resolve(String pipelineIdentifier) {
    if (sizing.disabled?.contains(pipelineIdentifier)) {
        return [podTemplate: FALLBACK_TEMPLATE, source: "disabled"]
    }
    def assignedBin = sizing.assignments?.get(pipelineIdentifier)
    if (assignedBin) {
        def template = sizing.bins?.get(assignedBin)
        if (template) {
            return [podTemplate: template, source: "assigned:${assignedBin}"]
        }
    }
    def defaultBin = sizing.default_bin
    def defaultTemplate = sizing.bins?.get(defaultBin)
    if (defaultTemplate) {
        return [podTemplate: defaultTemplate, source: "default:${defaultBin}"]
    }
    return [podTemplate: FALLBACK_TEMPLATE, source: "fallback"]
}

Three details that matter operationally:

The unmigrated sentinel as default_bin maps to the original 8 GB template. Pipelines without an explicit assignment stay on the existing setup. Opt-in migration, not opt-out — and a typo in the YAML can’t accidentally bin everyone into a smaller template.
The disabled list is the explicit escape hatch. Pipelines that have been identified as bad migration candidates (memory leaks, hardcoded -Xmx, deprecated) are pinned to the original template even if someone later adds them to assignments.
The echo line prints the resolved template and the source for every build: Pod template for acme-foo: acme-jnlp-extra-small (source: assigned:extra-small). During rollout this is the fastest way to spot mistakes — you can see what each pipeline is actually getting without inspecting pod YAML.

Rollout

The migration is intentionally slow, intentionally reversible, intentionally observable.

Shadow merge. The resolver itself shipped to production before any pipeline was assigned to a smaller bin. The assignments map was empty; every pipeline resolved to default:unmigrated → original template. This validated the resolver’s behaviour under real load before any resource changes happened.
Single canary. One pipeline, comfortably below the XS threshold, on a fresh the shared pipeline library branch. Watched for a week. Build success, build duration, OOM events, JVM behaviour.
Staged batches: 5 → 20 → 100. Each batch runs for several days before the next begins. The batches are picked deliberately:
- Manually triggerable master pipelines (deterministic, easy to re-run)
- High build frequency (faster signal if something breaks)
- Diverse teams (so a team-specific failure mode doesn’t take out one group)
- No hardcoded -Xmx exceeding the bin (audited before inclusion)
A pipeline that consistently runs above 85% of its bin’s RAM during the migration is flagged for promotion to the next tier. A pipeline that OOMs is reverted immediately and the cause investigated.

What I’d do differently

Scrape the platform metrics first. The discovery that kubelet/scheduler/API metrics weren’t being scraped surfaced during data collection. Doing this first would have made the incident-window investigations during rollout faster, even though it wasn’t on the critical path for sizing.
Get the cloud-level config audit done before any merge. The pod retention default override cost a day of debugging. Reading the JCasC YAML completely up front would have caught it.
Audit JVM heap configuration earlier. Some pipelines hardcode -Xmx larger than even the original 8 GB pod could practically support; those won’t surface until they hit the new smaller bin. Doing a grep -rE 'JAVA_OPTS|MAVEN_OPTS|-Xmx' across the pipeline scripts repo earlier would have given a candidate list for the disabled section before the canary, not after.

The takeaway

The bins are not the value. The bins are an inevitability once the data exists.

The value is the per-build attribution, the annotation-based identity that survives pod name truncation, the data pipeline that produces honest percentile distributions, the defensible argument for why this bin and not that one, and the rollout discipline that keeps the migration reversible at every stage.

A lot of CI clusters at this scale have the same shape of problem, and the telemetry to diagnose it usually isn’t there yet. The fix itself is small; the missing prerequisite is what’s expensive. That’s where this project — and the rest of my optimisation work — tends to deliver value.

If you have a cluster like this and want to talk about the diagnostic phase specifically, get in touch.