Enterprise CI cluster · anonymised real-work case
Jenkins pipeline right-sizing
Took 2,600 production pipelines from 8% to ~60% memory utilisation by building per-build telemetry, then designing bins from real percentile data. Same hardware, several multiples more headroom, no rewrite of any pipeline required.
8% → 60%
RAM utilization, before → after
2,600
Production pipelines in scope
94%
Fit in the smallest (≤1.25 GiB) bin
1.5×
Safety margin above measured P95
The problem
A large enterprise CI cluster ran 2,600 Jenkins pipelines on a uniform pod allocation: 8 GB RAM and 4 CPU each. Capacity incidents were a recurring operational theme. Queues built up. Builds occasionally failed for unclear scheduling reasons. The general suspicion was over-allocation, but nobody had numbers.
There was a structural reason for that. The Jenkins agent setup used long-lived, generic pods drawn from a static pool, and none of them advertised which job was currently running. Resource attribution per pipeline didn't exist. You could see that the cluster was loaded; you couldn't see by what.
So before sizing the bins, the diagnostic infrastructure had to be built. That ended up being the more interesting half of the project.
What I built
1. Per-build pod identity
Replaced the static label-based agent reference with the Kubernetes
Jenkins plugin's declarative kubernetes { inheritFrom ... } block,
combined with idleMinutes 0 and podRetention never(). This
makes the plugin render a fresh pod spec at build start, with access to the
live Run object. It interpolates the job name, build number, and a uniqueness hash
into the pod name, and injects build metadata as pod annotations
(runUrl, buildUrl).
The result: every Jenkins build runs in a single-use pod whose Kubernetes identity deterministically points back to a specific build. cAdvisor metrics now correlate to Jenkins runs.
2. Annotation scraping in OpenTelemetry
The pod-name path had a sharp edge: Kubernetes caps names at 63 characters and the plugin truncates from the front when job names are long, so regex-based extraction was lossy.
Extended the OTel collector's k8sattributes processor to promote the
pod's annotations into metric labels (job_name from runUrl,
job_run from buildUrl). This gave clean, structured
identifiers, immune to truncation, queryable directly in Thanos.
3. Data pipeline
Built a Grafana panel with three queries that produce one row per build:
max_over_time of working-set memory, max_over_time(rate(...))
of CPU time, and count_over_time × scrape interval as an approximate
duration. CSV-exported the three tables, joined them in Polars on
(job_name, job_run), computed per-pipeline distribution statistics
(P50, P75, P90, P95, P99, max, mean) for both metrics.
The first surprise: median peak RAM was 441 MB against an 8 GB allocation. P95 across all pipelines was 999 MB. Average utilization of the existing allocation was around 8%.
4. Defensible bin design
The temptation here is k-means with k=3 and ship. K-means is the wrong tool: it's non-deterministic on 1D data, assumes equal-variance clusters (CI workloads are heavily right-skewed), and optimizes centroid distance rather than bin-fit. Jenks natural breaks is the right tool for 1D segmentation: deterministic, designed to minimize within-class variance, and cited as the standard method for choropleth binning for decades.
Final bins, each with ~50% safety overprovisioning above the measured P95 of the pipelines assigned to it:
| Bin | RAM (req = limit) | CPU request | Target population | Share of population |
|---|---|---|---|---|
| XS | 768 MiB | 500 m | P95 ≤ 500 MB | ~50% |
| S | 1.25 GiB | 1.0 core | P95 ≤ 1 GB | ~44% |
| M | 2 GiB | 1.5 cores | P95 ≤ 1.5 GB | ~5% |
| L | 3 GiB | 2.0 cores | P95 ≤ 2.5 GB | <1% |
RAM request equals limit for Guaranteed QoS, preventing
eviction under node memory pressure mid-build. CPU limit is
deliberately unset, so pods can burst above their request when nodes are
underutilised (CPU throttles gracefully; memory does not).
5. Centralised registry in a shared pipeline library
Sizing decisions live in a single YAML resource inside the organisation's
shared Jenkins pipeline library, with a four-level resolution chain:
disabled list → per-pipeline assignment →
default_bin fallback → ultimate fallback to the original 8 GB
template. An unmigrated sentinel keeps every pipeline not yet
on the list running on the original template, so the migration is strictly
opt-in. Every resolved template is echoed in the build console with its
source (assigned:extra-small, default:unmigrated,
disabled), so what the pipeline is actually getting is visible
from the first line of build output.
Rollout discipline
The migration is intentionally slow and reversible.
- Shadow mode. The resolver shipped to production before any pipeline was assigned to a smaller bin; every pipeline still resolved to the original 8 GB template. This validated the lookup path under real production load.
- Canary. One pipeline I knew well, comfortably below the XS
threshold, watched for a week. Build success, build duration, peak RAM, OOM
events. JVM heap behaviour audited (any hardcoded
-Xmxexceeding the bin's RAM would fail at startup). - Staged batches. Five pipelines, then twenty, then a hundred. Diverse teams in each batch so a team-specific failure mode (e.g. a shared JVM tuning convention) wouldn't take out one group entirely.
Things that went wrong (and how I caught them)
A case study with no surprises is a case study someone is hiding. Four real ones from this project:
- Cloud-level retention overrode pipeline-level
idleMinutes 0. Pods were sticking around for exactly 3 minutes + 5–20 seconds after each build. The 3 minutes was a default in the Jenkins cloud config that was winning against the declarative override. Fixed by addingpodRetention never()explicitly and dropping the cloud default to 0. - The cloud-level concurrency cap caused cross-workload contention. During master rollout, unrelated pipelines couldn't spawn agents, because Jenkins was rejecting them because the 30-pod cloud-level cap was held by the ephemeral master pods churning through it. The fix wasn't to add more nodes; it was to understand which limit was binding.
- Default bin pointing at an uncreated template would have broken
every unmigrated pipeline. Caught in review: the initial
default_bin: mediumwould have resolved to a pod template that didn't exist yet. Introduced theunmigratedsentinel that maps explicitly to the original 8 GB template, so opting out is the default instead of opting in. - Frontend pipelines blew the bin with a 2 GB SonarQube scanner. Pipelines that build frontend code trigger SonarQube to spawn a Node.js analysis process that can take around 2 GB on its own, enough to OOM a pipeline sized for its non-frontend peak. They are pinned to the original template through the opt-out list, which is exactly the escape hatch the registry was built to provide.
What this is worth
Moving utilization from 8% to ~60% on clusters of this scale represents real, ongoing infrastructure savings at a scale where the dollar figure is worth a conversation rather than a published number.
The takeaway I'd point at instead is structural: the most expensive part of work like this is usually the missing telemetry, not the missing decisions. Pod identity, attribution, percentile analysis, defensible bin design: those are the deliverable. The bins themselves fall out of the data once the data exists.