Jenkins Lesson 35 – Scaling Jenkins | Dataplexa

Section IV · Lesson 35

Scaling Jenkins

A single Jenkins master serving 10 developers is fine. The same master serving 100 developers running 500 builds a day will buckle — slow UI, queued builds, dropped connections. Scaling Jenkins means knowing exactly which bottleneck you're hitting and which tool solves it.

This lesson covers

Vertical vs horizontal scaling → Identifying your bottleneck → Scaling build capacity with agents → Master high availability → Folder and view organisation → The Jenkins Operations Center pattern → When to split into multiple masters

Scaling Jenkins has two completely separate dimensions. The first is build capacity — how many builds can run concurrently. The second is master capacity — how much work the Jenkins master itself can handle before it slows down. Most teams only think about the first dimension until the second one bites them.

The Analogy

Scaling Jenkins is like scaling a restaurant. Adding more cooks (agents) solves the problem of too many orders — but if the front-of-house manager (the master) is overwhelmed taking orders, seating guests, and handling complaints all at once, adding cooks doesn't help. At some point you need a second manager, or you split the restaurant into two locations. Jenkins scaling follows exactly this pattern.

The Two Dimensions of Jenkins Scale

Dimension 1 — Build Capacity

How many builds can run concurrently. The master queues work, agents execute it. Bottleneck appears as: builds waiting in the queue, executor utilisation at 100%, long wait times before builds start.

Solution: add more agents.

Dimension 2 — Master Capacity

How much the master can handle — job scheduling, pipeline execution state, UI requests, API calls, plugin hooks. Bottleneck appears as: slow UI, high CPU on the master, OutOfMemoryError, agent disconnects under load.

Solution: tune the master or split to multiple masters.

Identifying Your Bottleneck

Before throwing hardware at the problem, diagnose which bottleneck you're actually hitting. The symptoms look similar but the fixes are different:

Symptom	Root cause	Fix
Builds queued but agents idle	Label mismatch — agents don't have the right label	Check and fix agent labels
Queue growing, executors at 100%	Not enough agent capacity	Add more agents or executors
UI slow but agents have free capacity	Master CPU/memory overloaded	Tune JVM, clean up old data, reduce plugins
Agents randomly disconnect under load	Master thread pool exhausted	Increase master heap, check thread count
Single master serves 50+ teams	Organisational scale — one master is a risk	Split into multiple masters by team or domain

Vertical Scaling — Making the Master Stronger

Before splitting into multiple masters, squeeze everything you can from the existing one. The three levers are JVM memory, thread tuning, and data retention.

The scenario:

You're a DevOps engineer at a 60-person company. Your Jenkins master handles 150 builds a day across 40 jobs. The UI has been getting sluggish during peak hours (10–11 AM) and you've seen three OutOfMemoryError entries in jenkins.log this week. You need to tune the JVM before the problem gets worse.

Tools used:

JAVA_OPTS — the environment variable that passes JVM options to Jenkins on startup. Set in /etc/default/jenkins on Debian/Ubuntu or /etc/sysconfig/jenkins on RHEL/CentOS.
-Xms / -Xmx — minimum and maximum JVM heap size. Setting both to the same value prevents the JVM from resizing the heap at runtime, which reduces garbage collection pauses.
-XX:+UseG1GC — enables the G1 garbage collector, designed for low-pause-time collection on heaps above 4GB. Better than the default collector for interactive server applications like Jenkins.
-XX:+HeapDumpOnOutOfMemoryError — writes a heap dump file when the JVM runs out of memory. Essential for diagnosing OOM errors — without it you're guessing at the cause.
-Djava.awt.headless=true — tells Jenkins it's running without a display. Required on Linux servers — without it, some plugins crash trying to render graphics in a headless environment.
Groovy script console — Jenkins' built-in tool at /script for running Groovy against the live instance. Used here to check current memory usage without restarting.

# /etc/default/jenkins  (Debian/Ubuntu)
# /etc/sysconfig/jenkins (RHEL/CentOS)
# Edit this file then: sudo systemctl restart jenkins

# Before tuning — default settings (often absent or too low)
# JAVA_OPTS="-Djava.awt.headless=true"

# After tuning — production-grade JVM settings
JAVA_OPTS="\
  -Djava.awt.headless=true \
  -Xms4g \
  -Xmx8g \
  -XX:+UseG1GC \
  -XX:MaxGCPauseMillis=200 \
  -XX:+HeapDumpOnOutOfMemoryError \
  -XX:HeapDumpPath=/var/jenkins-heapdumps/ \
  -XX:+ExplicitGCInvokesConcurrent \
  -Djenkins.install.runSetupWizard=false"

# -------------------------------------------------------
# Check current memory usage from the Groovy script console
# Run this at http://jenkins-master-01:8080/script
# -------------------------------------------------------
# Paste into the Script Console:
def rt = Runtime.getRuntime()
def mb = 1024 * 1024

println "=== JVM Memory Status ==="
println "Max heap:   ${rt.maxMemory() / mb} MB"
println "Total heap: ${rt.totalMemory() / mb} MB"
println "Free heap:  ${rt.freeMemory() / mb} MB"
println "Used heap:  ${(rt.totalMemory() - rt.freeMemory()) / mb} MB"
println ""
println "Thread count: ${Thread.activeCount()}"

Where to practice: Paste the Groovy snippet directly into http://localhost:8080/script on your local Jenkins to see live memory stats. For JVM tuning, the safe starting point is Xmx = half your total RAM, Xms = same as Xmx. For a 16GB server: -Xms8g -Xmx8g. Full JVM tuning guide at jenkins.io — Scaling Jenkins.

# After restarting Jenkins with new JVM settings:
$ sudo systemctl restart jenkins
$ sudo systemctl status jenkins
● jenkins.service - Jenkins Continuous Integration Server
     Active: active (running) since Mon 2024-04-08 09:01:22 UTC; 4s ago

# Groovy script console output:
=== JVM Memory Status ===
Max heap:   8192 MB
Total heap: 4096 MB
Free heap:  3241 MB
Used heap:   855 MB

Thread count: 48

What just happened?

Max heap is now 8GB — Jenkins can grow its heap up to 8GB before the JVM throws OutOfMemoryError. Previously, on a 16GB server with default settings, Jenkins might have been limited to 256MB or 512MB — not enough for 150 builds a day.
Used heap is only 855MB — at startup, Jenkins hasn't loaded much yet. As builds run and pipeline state accumulates throughout the day, used heap will grow. The monitoring value is the peak usage during the 10–11 AM rush — check this from the script console during that window.
48 active threads — each pipeline step, agent connection, and HTTP request can consume a thread. At 48 threads under light load, this is healthy. If this approaches 200+ during peak hours, Jenkins is under thread pressure and the OutOfMemoryErrors may be related to thread stack space rather than heap.
G1GC will reduce UI pauses — the default JVM garbage collector can pause all threads for hundreds of milliseconds while collecting. G1GC collects concurrently with application threads, keeping pause times under 200ms. This is why the Jenkins UI was feeling sluggish — GC pauses were freezing it.

Horizontal Scaling — Multiple Masters

At a certain scale, one master is no longer appropriate — not for performance reasons, but for organisational and risk reasons. When does splitting make sense?

🏢

Organisational boundaries

When one team's misbehaving pipeline shouldn't be able to affect another team's deployments. Separate masters mean separate blast radii — a runaway build on the payments master can't impact the frontend master's queue.

🔒

Security and compliance isolation

Production deployment pipelines often need stricter access controls than development pipelines. A dedicated production Jenkins master can have different security policies, stricter RBAC, and a separate audit log — all without affecting developer productivity.

📍

Geographic distribution

Teams in different regions with latency-sensitive builds benefit from a master close to their agents. A US master and an EU master each managing their own regional agent pool is faster and more resilient than routing all traffic through a single global master.

📊

Pure throughput — 500+ builds/day

At very high build volumes, even a well-tuned master accumulates so much pipeline execution state in memory that GC pressure becomes a constant problem. Sharding jobs across multiple masters by team or service domain is the architectural solution.

Organising Jobs at Scale — Folders and Views

Before you split into multiple masters, organise your single master well. A Jenkins dashboard with 200 unorganised jobs is harder to use than one with 200 jobs in a sensible folder structure. The Folders plugin (installed by default in most setups) lets you group jobs hierarchically.

Jenkins — Dashboard (folder structure)

📁 payments-team/ — folder with team-scoped RBAC
├─ checkout-service (Multibranch)
├─ payment-api (Multibranch)
└─ payment-api-deploy (Pipeline)
📁 frontend-team/
├─ web-app (Multibranch)
└─ mobile-app (Multibranch)
📁 platform-team/
├─ jenkins-backup (Pipeline)
└─ agent-health-check (Pipeline)
📁 production-deploys/ — restricted access, release managers only
├─ checkout-service-prod (Pipeline)
└─ payment-api-prod (Pipeline)

Scaling anti-pattern to avoid

Don't add more agents as a solution to master slowness. Adding agents increases the load on the master — more agent connections, more build state to track, more events to process. If the master is already struggling, more agents make it worse. Fix the master first.

The Scaling Decision Flow

Check the build queue and executor utilisation first

If the queue is growing and executors are at 100%, you need more agents. If executors have spare capacity but builds still queue — check labels. If the UI is slow but executors are free — the master is the bottleneck.

Tune the master before buying hardware

Increase JVM heap, enable G1GC, clean up old builds and fingerprints (Lesson 33). Most teams get 50–80% performance improvement from tuning alone before needing new hardware.

Add agents for build capacity, not master slowness

More agents only help if the queue is the problem. Each new agent connection increases master overhead slightly — so adding agents when the master is already struggling makes things worse.

Organise with folders and RBAC before splitting

A well-organised single master with folder-level RBAC can handle many teams cleanly. Multiple masters multiply your operational burden — backups, upgrades, plugin management all need to happen N times.

Split masters at organisational or compliance boundaries

When you have a genuine reason to isolate (security, team autonomy, geography, compliance), a separate master is the right call. Size each master for its specific load rather than running one overloaded master for everyone.

Teacher's Note

The most common Jenkins scaling mistake is adding agents when the master is the bottleneck. Measure before you scale. The Prometheus metrics from Lesson 32 tell you exactly which resource is saturated.

Practice Questions

1. Which JVM flag sets the maximum heap size Jenkins is allowed to use?

2. Which garbage collector is recommended for Jenkins on heaps above 4GB because it reduces UI-freezing pause times?

3. The Jenkins UI is slow but executors have spare capacity. A colleague suggests adding more agents. What should you do instead and why?

Quiz

Up Next · Lesson 36

Jenkins as Code

Pipelines in Git, jobs generated by scripts, shared logic in libraries — the three layers that make your entire Jenkins setup reproducible, reviewable, and recoverable.

← Previous Course Index Next →

Jenkins Course