Jenkins Lesson 43 – Troubleshooting | Dataplexa

Section V · Lesson 43

Troubleshooting Jenkins

Jenkins will break. The difference between a senior engineer and a junior one isn't whether they see failures — it's how fast they diagnose them. This lesson gives you a systematic framework for the most common Jenkins problems so you spend minutes diagnosing, not hours guessing.

This lesson covers

The diagnostic framework → Build failure vs Jenkins failure → Agent disconnection → OutOfMemoryError → Pipeline stuck forever → Plugin conflicts → UI not loading → The 5-minute triage checklist

Most Jenkins problems fall into one of two categories: something wrong with your pipeline code, or something wrong with Jenkins itself. The first step in every diagnosis is determining which category you're in — because the lookup chain is completely different.

The Analogy

Troubleshooting Jenkins is like diagnosing a car that won't start. The first question isn't "what's wrong" — it's "is this an engine problem or a fuel problem?" Engine problem: the car itself is broken. Fuel problem: the car is fine, there's just nothing in it. In Jenkins terms: is the pipeline broken, or is Jenkins broken? Each question has a completely different diagnostic path, and mixing them up wastes hours.

The Diagnostic Framework

Before touching anything, answer these three questions in order. They route you to the right fix 80% of the time:

Is this failing for everyone or just this one job?

Everyone: Jenkins infrastructure problem — check the system log, master health, agent status. One job: pipeline problem — read the console output for that specific build.

Did it work before? When did it stop?

Was working, now broken: something changed — check recent commits, plugin updates, agent changes, environment variables. Never worked: configuration or code error from the start — focus on the error message, not the history.

What does the error message actually say?

Read the entire console output, not just the summary. Jenkins console output is verbose on purpose — the actual error is almost always there, buried under noise. Search for ERROR, Exception, FAILED, and error: in the output.

Problem 1 — Build Failure Diagnosis

The build went red. Here's the lookup chain — go through these in order, stopping when you find the cause:

# Step 1: Read the full console output — scroll to the BOTTOM first
# Jenkins shows the error that caused the failure near the end
# The line just above "Finished: FAILURE" is almost always the root cause

# Step 2: Filter the console output for error keywords
# If the console is 5000 lines, grep saves time
grep -iE "error:|exception|failed|cannot|not found|permission denied" build.log

# Step 3: Check the exit code
# Non-zero exit codes from sh steps cause failures
# A sh step that fails with exit code 1 is a script error
# Exit code 126/127 = command not found
# Exit code 128+signal = process killed (OOM, timeout, SIGKILL)

# Step 4: Reproduce locally
# Run the failing shell command directly on the agent machine
# SSH to the agent and run: cd /var/jenkins-agent/workspace/job-name && ./the-failing-command.sh
# If it fails locally too: code/environment issue, not Jenkins
# If it passes locally: Jenkins environment issue (missing tool, wrong PATH, missing credential)

Common build failure patterns and their fixes

command not found: mvn

Maven (or the tool) isn't on the agent's PATH. Fix: use a Docker agent with the tool pre-installed, or configure the tool installation in Manage Jenkins → Tools.

Permission denied: ./gradlew

The script isn't executable. Fix: add sh 'chmod +x gradlew' before the failing step, or commit the file with execute permissions (git update-index --chmod=+x gradlew).

CredentialNotFoundException: No credential found with id 'docker-registry'

The credential ID in the Jenkinsfile doesn't match what's stored in Jenkins. Fix: go to Manage Jenkins → Credentials, find the correct ID, update the Jenkinsfile.

groovy.lang.MissingMethodException: No signature of method

The pipeline step doesn't exist or its arguments are wrong. Either the plugin providing the step isn't installed, the step name is misspelled, or the parameter format changed in a plugin update.

Problem 2 — Agent Disconnection

An agent goes offline or a build fails with "Agent is not connected" or "Remote call failed". Here's the systematic diagnosis:

Tools used:

jenkins.log — Jenkins' system log at /var/log/jenkins/jenkins.log. Agent disconnections always appear here with a reason.
agent log — the per-agent log accessible via Manage Jenkins → Nodes → [agent name] → Log. Shows the agent-side view of the connection.
ssh -v — verbose SSH connection to the agent machine. Tests whether the master can actually reach the agent independently of Jenkins.

# Step 1: Check the Jenkins system log for disconnection events
sudo grep -i "agent-linux-01\|SlaveComputer\|disconnected\|offline" \
  /var/log/jenkins/jenkins.log | tail -30

# Step 2: Check the agent log in the Jenkins UI
# Manage Jenkins → Nodes → [agent name] → Log
# Look for: IOException, Connection refused, Authentication failed, timeout

# Step 3: Test SSH connectivity from the master to the agent directly
# This confirms whether the network path is the issue
ssh -v -i /var/lib/jenkins/.ssh/agent_rsa jenkins-agent@10.0.1.45

# Step 4: Check disk space on the agent — full disk causes silent failures
df -h /var/jenkins-agent

# Step 5: Check if the Jenkins agent process is still running on the agent machine
ps aux | grep jenkins

# Step 6: Check agent JVM memory — OOM on the agent kills the process
# Look in the agent process output or check for hs_err_pid*.log files
find /tmp -name "hs_err_pid*.log" -mtime -1 2>/dev/null

# jenkins.log grep output:
Apr 08 14:22:11 WARNING SlaveComputer
  agent-linux-01: Connection was broken: java.io.IOException:
  Pipe broken — remote side may have died

Apr 08 14:22:11 WARNING hudson.remoting.Channel
  Failed to read from channel agent-linux-01

Apr 08 14:22:11 SEVERE  SlaveComputer
  agent-linux-01 went offline

# SSH test from master:
ssh -v -i /var/lib/jenkins/.ssh/agent_rsa jenkins-agent@10.0.1.45
debug1: Connecting to 10.0.1.45 port 22
debug1: connect to address 10.0.1.45 port 22: Connection timed out
ssh: connect to host 10.0.1.45 port 22: Connection timed out

# Diagnosis: network connectivity issue — agent is unreachable from master
# Root cause found: the agent's EC2 security group was accidentally modified
# SSH port 22 was removed from the inbound rules
# Fix: restore the security group rule

What just happened?

"Pipe broken" in jenkins.log — this is Jenkins' way of saying the network connection to the agent was lost mid-stream. The error is in Jenkins, but the root cause is on the network or the agent machine. Jenkins is reporting the symptom, not the cause.
SSH test confirmed the root cause — the 3-second ssh -v test showed the agent was completely unreachable, not just the Jenkins agent process failing. This immediately ruled out a JVM crash, a disk issue, or a code problem — the network was broken.
Security group was the culprit — a common cloud environment issue. Someone modified an AWS security group and accidentally removed the SSH inbound rule. ssh -v shows "Connection timed out" which is the TCP-level symptom of a firewall blocking the port. Without the SSH test, this could have taken an hour to diagnose.

Problem 3 — OutOfMemoryError

Jenkins starts behaving erratically — slow UI, builds failing mid-run with no obvious error, agents randomly disconnecting — and jenkins.log contains java.lang.OutOfMemoryError. Here's the diagnosis and fix:

# Step 1: Confirm the OOM in jenkins.log
grep -i "OutOfMemoryError\|java.lang.OutOfMemory" /var/log/jenkins/jenkins.log

# Step 2: Check current heap usage from the Groovy script console
# Run at http://jenkins-master-01:8080/script
def rt = Runtime.getRuntime()
def mb = 1024 * 1024
println "Max heap:  ${rt.maxMemory() / mb} MB"
println "Used heap: ${(rt.totalMemory() - rt.freeMemory()) / mb} MB"
println "Free heap: ${rt.freeMemory() / mb} MB"

# Step 3: Check what's consuming heap — build retention is the most common cause
# Too many builds kept in memory
# Run in script console:
Jenkins.instance.getAllItems(hudson.model.Job.class).each { job ->
    def builds = job.builds.size()
    if (builds > 50) {
        println "${job.fullName}: ${builds} builds retained"
    }
}

# Step 4: Immediate relief — trigger garbage collection
# This is a temporary fix — address the root cause
System.gc()

# Step 5: Permanent fix — increase the JVM heap in /etc/default/jenkins
# JAVA_OPTS="-Xms4g -Xmx8g -XX:+UseG1GC ..."
# Then restart Jenkins
sudo systemctl restart jenkins

# jenkins.log:
java.lang.OutOfMemoryError: Java heap space
  at hudson.model.Run.getLog(Run.java:1284)
  at jenkins.model.Jenkins.getBuilds(Jenkins.java:...)

# Heap usage from script console:
Max heap:  2048 MB       ← only 2GB allocated
Used heap: 1998 MB       ← 97% consumed
Free heap:   50 MB       ← almost none left

# Jobs with excessive build retention:
payment-api: 312 builds retained
frontend-test: 287 builds retained
checkout-service: 198 builds retained

# Root cause: builds never cleaned up + heap too small
# Fix 1: Add logRotator to those jobs
# Fix 2: Increase heap to 8GB

# After fix — heap usage 24 hours later:
Max heap:  8192 MB
Used heap:  612 MB
Free heap: 7580 MB

What just happened?

Root cause was two things, not one — the heap was too small (2GB) AND build history was never cleaned up (300+ builds retained per job). Either alone would cause problems eventually. Together they caused OOM within weeks. The fix needed both: increase heap and add buildDiscarder(logRotator(numToKeepStr: '20')) to all jobs.
The Groovy script console gave the diagnosis in 10 seconds — 97% heap usage at 2GB is immediately actionable. Without this check you'd be guessing at whether the heap was the issue. The script console is the fastest diagnostic tool for Jenkins runtime problems.
After fix: 7.5GB free with 8GB max — the system is now healthy with enormous headroom. The combination of proper heap sizing and log rotation means this server won't hit OOM again for years at the current build volume.

Problem 4 — Pipeline Stuck Forever

A build has been running for hours with no new console output. The stage view shows it stuck in one stage. No error. Just silence. Here's the diagnosis:

Common causes of stuck pipelines

input() waiting

A pipeline has an input() gate that nobody approved. Check the pipeline UI — there will be a visible "Proceed / Abort" button. Either approve it or abort the build.

waitUntil() looping

A waitUntil { } block is polling a condition that never becomes true — a service that never starts, a health check URL that never responds. Always wrap with timeout().

sh hanging on stdin

A shell command is waiting for user input that will never come. Common with interactive tools (apt-get without -y, some Docker commands). Add -y or --no-input flags, or check the command's documentation for non-interactive mode.

Lock never released

If using the Lockable Resources plugin, a lock acquired by a previous failed build was never released. Check Manage Jenkins → Lockable Resources and manually release the lock.

# Prevention — always add pipeline-level and stage-level timeouts
// In every Jenkinsfile options block:
options {
    timeout(time: 30, unit: 'MINUTES')  // entire pipeline
}

// For individual long-running steps:
stage('Deploy') {
    steps {
        timeout(time: 5, unit: 'MINUTES') {
            sh './deploy.sh'
        }
    }
}

// For waitUntil — always give it a deadline:
timeout(time: 2, unit: 'MINUTES') {
    waitUntil {
        def response = sh(script: 'curl -s -o /dev/null -w "%{http_code}" https://api/health',
                          returnStdout: true).trim()
        return response == '200'
    }
}

# If already stuck — abort from the CLI:
java -jar jenkins-cli.jar \
  -s http://jenkins-master-01:8080 \
  -auth admin:your-api-token \
  stop-build job-name BUILD_NUMBER

Problem 5 — UI Not Loading or Very Slow

The Jenkins dashboard loads in 15 seconds or not at all. Here's the fast triage:

Symptom	Check	Fix
UI slow, all pages	Heap usage in script console	Increase JVM heap, clean up old builds
UI slow, one page only	That page's job has thousands of builds	Add logRotator to the job, delete old builds
HTTP 503 / connection refused	`systemctl status jenkins`	Jenkins isn't running — check jenkins.log for startup error
Blank page / partial load	Browser console errors	Often a plugin loading failure — check jenkins.log for SEVERE entries
Slow after plugin update	jenkins.log for plugin SEVERE errors	Disable the updated plugin, rollback or update to a newer version

The 5-Minute Triage Checklist

When something is wrong and you don't know where to start, run through this checklist in order. It covers 90% of Jenkins problems:

# 1. Is Jenkins running?
sudo systemctl status jenkins

# 2. What does the system log say? (last 50 SEVERE/WARNING entries)
sudo grep -E "SEVERE|WARNING" /var/log/jenkins/jenkins.log | tail -50

# 3. How is memory?  (run in the script console at /script)
def rt = Runtime.getRuntime(); def mb = 1024*1024
println "Heap: ${(rt.totalMemory()-rt.freeMemory())/mb}MB / ${rt.maxMemory()/mb}MB"

# 4. How are the agents?
java -jar jenkins-cli.jar -s http://localhost:8080 -auth admin:token \
  groovy = <<'EOF'
Jenkins.instance.nodes.each { node ->
    def c = node.toComputer()
    println "${node.name.padRight(30)} ${c?.isOnline() ? 'ONLINE' : 'OFFLINE'}"
}
EOF

# 5. Is the build queue growing?
java -jar jenkins-cli.jar -s http://localhost:8080 -auth admin:token \
  groovy = <<'EOF'
def q = Jenkins.instance.queue.items.size()
println "Queue depth: ${q}"
if (q > 5) println "WARNING: Build queue backing up"
EOF

# 6. Any recent plugin updates that coincide with the problem start time?
sudo grep -i "plugin\|update\|install" /var/log/jenkins/jenkins.log \
  | grep "$(date +%Y-%m-%d)" | tail -20

# 1. Jenkins status:
● jenkins.service - Jenkins Continuous Integration Server
     Active: active (running)

# 2. Recent SEVERE/WARNING:
WARNING SlaveComputer agent-linux-02: Connection was broken
SEVERE  hudson.plugins.git.GitSCM: java.io.IOException: Cannot run git

# 3. Memory:
Heap: 3841MB / 4096MB        ← 94% used — approaching OOM

# 4. Agent status:
agent-linux-01                ONLINE
agent-linux-02                OFFLINE
agent-linux-03                ONLINE

# 5. Queue depth:
Queue depth: 14
WARNING: Build queue backing up

# 6. Recent plugin updates today:
Apr 08 09:14 Installing: git 5.3.0 (was 4.12.0)

# DIAGNOSIS:
# - git plugin updated this morning to 5.3.0
# - agent-linux-02 went offline (unrelated — SSH issue found earlier)
# - Heap at 94% — needs attention
# - 14 builds queued because agent-linux-02 is offline and heap is limiting throughput
# IMMEDIATE ACTIONS:
# 1. Restore agent-linux-02 (fix SSH/security group)
# 2. Check if git 5.3.0 introduced the "Cannot run git" error
# 3. Increase heap before OOM hits

What just happened?

Six commands, complete picture — in under 5 minutes, the checklist revealed three simultaneous problems: agent offline, approaching OOM, and a suspicious git plugin update. Without this systematic approach, you'd likely investigate only the most visible symptom and miss the other two.
Plugin update timestamp matched problem start — the git plugin update at 09:14 and the "Cannot run git" SEVERE error appearing shortly after is strong correlation. The next step is to check the git plugin 5.3.0 changelog for breaking changes and compare against the agent's git binary version.
Queue depth 14 explained by two causes — agent-linux-02 offline reduced capacity by one-third, and the heap at 94% was causing GC pressure that slowed the master's scheduling. Fixing only one without the other would leave the queue partially resolved. The checklist surfaces both simultaneously.

Teacher's Note

The three most useful troubleshooting tools in Jenkins are: the script console (for live runtime data), jenkins.log (for errors and events), and kubectl describe pod or ssh -v for infrastructure problems. Every Jenkins problem is diagnosed with some combination of these three.

Practice Questions

1. What is the first question to ask when something goes wrong with Jenkins — before looking at any logs?

2. An agent shows offline in Jenkins. What command confirms whether the problem is network connectivity versus a Jenkins agent process failure?

3. What pipeline directive prevents a build from getting stuck forever when a waitUntil condition never becomes true?

Quiz

Up Next · Lesson 44

Jenkins Anti-Patterns

The habits, shortcuts, and "it works for now" decisions that quietly accumulate until they take down a production Jenkins. Learn to recognise them before they find you.

← Previous Course Index Next →

Jenkins Course

Troubleshooting Jenkins

The Diagnostic Framework

Problem 1 — Build Failure Diagnosis

Problem 2 — Agent Disconnection

Problem 3 — OutOfMemoryError

Problem 4 — Pipeline Stuck Forever

Problem 5 — UI Not Loading or Very Slow

The 5-Minute Triage Checklist

Practice Questions

Quiz