Enterprise CI cluster · anonymised real-work case
Jenkins pipeline right-sizing
Took 2,600 production pipelines from 8% to ~60% memory utilisation by building per-build telemetry, then designing bins from real percentile data. Same hardware, several multiples more headroom, no rewrite of any pipeline required.
8% → 60%
RAM utilization, before → after
2,600
Production pipelines in scope
94%
Fit in the smallest (≤1.25 GiB) bin
1.5×
Safety margin above measured P95
The problem
A large enterprise CI cluster ran 2,600 Jenkins pipelines on a uniform pod allocation: 8 GB RAM and 4 CPU each. Capacity incidents were a recurring operational theme. Queues built up. Builds occasionally failed for unclear scheduling reasons. The general suspicion was over-allocation, but nobody had numbers.
There was a structural reason for that. The Jenkins agent setup used long-lived, generic pods drawn from a static pool — none of them advertised which job was currently running. Resource attribution per pipeline didn't exist. You could see that the cluster was loaded; you couldn't see by what.
So before sizing the bins, the diagnostic infrastructure had to be built. That ended up being the more interesting half of the project.
What I built
1. Per-build pod identity
Replaced the static label-based agent reference with the Kubernetes
Jenkins plugin's declarative kubernetes { inheritFrom ... } block,
combined with idleMinutes 0 and podRetention never(). This
makes the plugin render a fresh pod spec at build start, with access to the
live Run object — it interpolates the job name, build number, and a uniqueness hash
into the pod name, and injects build metadata as pod annotations
(runUrl, buildUrl).
The result: every Jenkins build runs in a single-use pod whose Kubernetes identity deterministically points back to a specific build. cAdvisor metrics now correlate to Jenkins runs.
2. Annotation scraping in OpenTelemetry
The pod-name path had a sharp edge: Kubernetes caps names at 63 characters and the plugin truncates from the front when job names are long, so regex-based extraction was lossy.
Extended the OTel collector's k8sattributes processor to promote the
pod's annotations into metric labels (job_name from runUrl,
job_run from buildUrl). This gave clean, structured
identifiers, immune to truncation, queryable directly in Thanos.
A side finding from this work: the platform had no scrape config for kubelet, scheduler, or API server metrics. A separate platform improvement was filed independently of this project.
3. Data pipeline
Built a Grafana panel with three queries that produce one row per build:
max_over_time of working-set memory, max_over_time(rate(...))
of CPU time, and count_over_time × scrape interval as an approximate
duration. CSV-exported the three tables, joined them in Polars on
(job_name, job_run), computed per-pipeline distribution statistics
(P50, P75, P90, P95, P99, max, mean) for both metrics.
The first surprise: median peak RAM was 441 MB against an 8 GB allocation. P95 across all pipelines was 999 MB. Average utilization of the existing allocation was around 8%.
4. Defensible bin design
The temptation here is k-means with k=3 and ship. K-means is the wrong tool: it's non-deterministic on 1D data, assumes equal-variance clusters (CI workloads are heavily right-skewed), and optimizes centroid distance rather than bin-fit. Jenks natural breaks is the right tool for 1D segmentation — deterministic, designed to minimize within-class variance, and cited as the standard method for choropleth binning for decades. Argument and tooling choice are written up in the methodology deep-dive.
Final bins, each with ~50% safety overprovisioning above the measured P95 of the pipelines assigned to it:
| Bin | RAM (req = limit) | CPU request | Target population | Share of population |
|---|---|---|---|---|
| XS | 768 MiB | 500 m | P95 ≤ 500 MB | ~50% |
| S | 1.25 GiB | 1.0 core | P95 ≤ 1 GB | ~44% |
| M | 2 GiB | 1.5 cores | P95 ≤ 1.5 GB | ~5% |
| L | 3 GiB | 2.0 cores | P95 ≤ 2.5 GB | <1% |
RAM request equals limit for Guaranteed QoS — preventing
eviction under node memory pressure mid-build. CPU limit is
deliberately unset, so pods can burst above their request when nodes are
underutilised (CPU throttles gracefully; memory does not).
5. Centralised registry in a shared pipeline library
Sizing decisions live in a YAML resource inside the organisation's shared
Jenkins pipeline library (sizing.yaml), with a four-level
resolution chain: disabled list → per-pipeline assignment →
default_bin fallback → ultimate fallback to the original 8 GB
template. An unmigrated sentinel keeps every pipeline not yet
on the list running on the original template, so the migration is strictly
opt-in. Every resolved template is echoed in the build console with its
source — assigned:extra-small, default:unmigrated,
disabled — so what the pipeline is actually getting is visible
from the first line of build output.
Rollout discipline
The migration is intentionally slow and reversible.
- Shadow mode. The resolver shipped to production before any pipeline was assigned to a smaller bin — every pipeline still resolved to the original 8 GB template. This validated the lookup path under real production load.
- Canary. One pipeline I knew well, comfortably below the XS
threshold, watched for a week. Build success, build duration, peak RAM, OOM
events. JVM heap behaviour audited (any hardcoded
-Xmxexceeding the bin's RAM would fail at startup). - Staged batches. Five pipelines, then twenty, then a hundred. Diverse teams in each batch so a team-specific failure mode (e.g. a shared JVM tuning convention) wouldn't take out one group entirely.
Things that went wrong (and how I caught them)
A case study with no surprises is a case study someone is hiding. Three real ones from this project:
- Cloud-level retention overrode pipeline-level
idleMinutes 0. Pods were sticking around for exactly 3 minutes + 5–20 seconds after each build. The 3 minutes was a default in the Jenkins cloud config that was winning against the declarative override. Fixed by addingpodRetention never()explicitly and dropping the cloud default to 0. - The cloud-level concurrency cap caused cross-workload contention. During master rollout, unrelated pipelines couldn't spawn agents — Jenkins was rejecting them because the 30-pod cloud-level cap was held by the ephemeral master pods churning through it. The fix wasn't to add more nodes; it was to understand which limit was binding.
- Default bin pointing at an uncreated template would have broken
every unmigrated pipeline. Caught in review: the initial
default_bin: mediumwould have resolved to a pod template that didn't exist yet. Introduced theunmigratedsentinel that maps explicitly to the original 8 GB template, so opting out is the default instead of opting in.
What this is worth
Moving utilization from 8% to ~60% on clusters of this scale represents real, ongoing infrastructure savings at a scale where the dollar figure is worth a conversation rather than a published number. What's worth sharing publicly is the methodology and the rollout discipline; the financials belong with the organisation, not the writeup.
The takeaway I'd point at instead is structural: the most expensive part of work like this is usually the missing telemetry, not the missing decisions. Pod identity, attribution, percentile analysis, defensible bin design — those are the deliverable. The bins themselves fall out of the data once the data exists.
Want the full methodology — the bin-design argument, the OTel processor configuration, the rollout discipline? See the methodology deep-dive →