Skip to content

Enterprise CI cluster · anonymised real-work case

Jenkins pipeline right-sizing

Took 2,600 production pipelines from 8% to ~60% memory utilisation by building per-build telemetry, then designing bins from real percentile data. Same hardware, several multiples more headroom, no rewrite of any pipeline required.

DevOps Observability Optimization Data
Kubernetes Jenkins OpenTelemetry Thanos Grafana Polars Groovy

8% → 60%

RAM utilization, before → after

2,600

Production pipelines in scope

94%

Fit in the smallest (≤1.25 GiB) bin

1.5×

Safety margin above measured P95

The problem

A large enterprise CI cluster ran 2,600 Jenkins pipelines on a uniform pod allocation: 8 GB RAM and 4 CPU each. Capacity incidents were a recurring operational theme. Queues built up. Builds occasionally failed for unclear scheduling reasons. The general suspicion was over-allocation, but nobody had numbers.

There was a structural reason for that. The Jenkins agent setup used long-lived, generic pods drawn from a static pool — none of them advertised which job was currently running. Resource attribution per pipeline didn't exist. You could see that the cluster was loaded; you couldn't see by what.

So before sizing the bins, the diagnostic infrastructure had to be built. That ended up being the more interesting half of the project.

What I built

1. Per-build pod identity

Replaced the static label-based agent reference with the Kubernetes Jenkins plugin's declarative kubernetes { inheritFrom ... } block, combined with idleMinutes 0 and podRetention never(). This makes the plugin render a fresh pod spec at build start, with access to the live Run object — it interpolates the job name, build number, and a uniqueness hash into the pod name, and injects build metadata as pod annotations (runUrl, buildUrl).

The result: every Jenkins build runs in a single-use pod whose Kubernetes identity deterministically points back to a specific build. cAdvisor metrics now correlate to Jenkins runs.

2. Annotation scraping in OpenTelemetry

The pod-name path had a sharp edge: Kubernetes caps names at 63 characters and the plugin truncates from the front when job names are long, so regex-based extraction was lossy.

Extended the OTel collector's k8sattributes processor to promote the pod's annotations into metric labels (job_name from runUrl, job_run from buildUrl). This gave clean, structured identifiers, immune to truncation, queryable directly in Thanos.

A side finding from this work: the platform had no scrape config for kubelet, scheduler, or API server metrics. A separate platform improvement was filed independently of this project.

3. Data pipeline

Built a Grafana panel with three queries that produce one row per build: max_over_time of working-set memory, max_over_time(rate(...)) of CPU time, and count_over_time × scrape interval as an approximate duration. CSV-exported the three tables, joined them in Polars on (job_name, job_run), computed per-pipeline distribution statistics (P50, P75, P90, P95, P99, max, mean) for both metrics.

The first surprise: median peak RAM was 441 MB against an 8 GB allocation. P95 across all pipelines was 999 MB. Average utilization of the existing allocation was around 8%.

4. Defensible bin design

The temptation here is k-means with k=3 and ship. K-means is the wrong tool: it's non-deterministic on 1D data, assumes equal-variance clusters (CI workloads are heavily right-skewed), and optimizes centroid distance rather than bin-fit. Jenks natural breaks is the right tool for 1D segmentation — deterministic, designed to minimize within-class variance, and cited as the standard method for choropleth binning for decades. Argument and tooling choice are written up in the methodology deep-dive.

Final bins, each with ~50% safety overprovisioning above the measured P95 of the pipelines assigned to it:

Bin RAM (req = limit) CPU request Target population Share of population
XS768 MiB500 mP95 ≤ 500 MB~50%
S1.25 GiB1.0 coreP95 ≤ 1 GB~44%
M2 GiB1.5 coresP95 ≤ 1.5 GB~5%
L3 GiB2.0 coresP95 ≤ 2.5 GB<1%

RAM request equals limit for Guaranteed QoS — preventing eviction under node memory pressure mid-build. CPU limit is deliberately unset, so pods can burst above their request when nodes are underutilised (CPU throttles gracefully; memory does not).

5. Centralised registry in a shared pipeline library

Sizing decisions live in a YAML resource inside the organisation's shared Jenkins pipeline library (sizing.yaml), with a four-level resolution chain: disabled list → per-pipeline assignment → default_bin fallback → ultimate fallback to the original 8 GB template. An unmigrated sentinel keeps every pipeline not yet on the list running on the original template, so the migration is strictly opt-in. Every resolved template is echoed in the build console with its source — assigned:extra-small, default:unmigrated, disabled — so what the pipeline is actually getting is visible from the first line of build output.

Rollout discipline

The migration is intentionally slow and reversible.

  1. Shadow mode. The resolver shipped to production before any pipeline was assigned to a smaller bin — every pipeline still resolved to the original 8 GB template. This validated the lookup path under real production load.
  2. Canary. One pipeline I knew well, comfortably below the XS threshold, watched for a week. Build success, build duration, peak RAM, OOM events. JVM heap behaviour audited (any hardcoded -Xmx exceeding the bin's RAM would fail at startup).
  3. Staged batches. Five pipelines, then twenty, then a hundred. Diverse teams in each batch so a team-specific failure mode (e.g. a shared JVM tuning convention) wouldn't take out one group entirely.

Things that went wrong (and how I caught them)

A case study with no surprises is a case study someone is hiding. Three real ones from this project:

  • Cloud-level retention overrode pipeline-level idleMinutes 0. Pods were sticking around for exactly 3 minutes + 5–20 seconds after each build. The 3 minutes was a default in the Jenkins cloud config that was winning against the declarative override. Fixed by adding podRetention never() explicitly and dropping the cloud default to 0.
  • The cloud-level concurrency cap caused cross-workload contention. During master rollout, unrelated pipelines couldn't spawn agents — Jenkins was rejecting them because the 30-pod cloud-level cap was held by the ephemeral master pods churning through it. The fix wasn't to add more nodes; it was to understand which limit was binding.
  • Default bin pointing at an uncreated template would have broken every unmigrated pipeline. Caught in review: the initial default_bin: medium would have resolved to a pod template that didn't exist yet. Introduced the unmigrated sentinel that maps explicitly to the original 8 GB template, so opting out is the default instead of opting in.

What this is worth

Moving utilization from 8% to ~60% on clusters of this scale represents real, ongoing infrastructure savings at a scale where the dollar figure is worth a conversation rather than a published number. What's worth sharing publicly is the methodology and the rollout discipline; the financials belong with the organisation, not the writeup.

The takeaway I'd point at instead is structural: the most expensive part of work like this is usually the missing telemetry, not the missing decisions. Pod identity, attribution, percentile analysis, defensible bin design — those are the deliverable. The bins themselves fall out of the data once the data exists.

Want the full methodology — the bin-design argument, the OTel processor configuration, the rollout discipline? See the methodology deep-dive →