Jenkins Course
Logging and Monitoring
A Jenkins server you can't observe is a black box you can't trust. This lesson covers the logs that tell you what's happening inside Jenkins, the metrics that predict problems before they become incidents, and how to connect Jenkins to the monitoring stack your team already uses.
This lesson covers
Jenkins system logs → Build log management → The Prometheus Metrics plugin → Key metrics to watch → Grafana dashboards → Log levels and custom loggers → Diagnosing problems from logs
Most teams treat Jenkins like a vending machine — put code in, get builds out, complain when it breaks. Teams that treat Jenkins like a production service — with logs, metrics, dashboards, and alerts — catch problems before users report them and diagnose failures in minutes instead of hours.
The Analogy
Running Jenkins without monitoring is like driving a car with no dashboard. The engine might be overheating, the fuel might be low, the oil pressure might be dropping — but you only find out when the car stops moving. Logs are your warning lights. Metrics are your gauges. A dashboard is your instrument panel. You need all three to drive safely.
Jenkins System Logs — The First Place to Look
Jenkins writes system-level logs to /var/log/jenkins/jenkins.log on Linux. These are different from build console logs — they record Jenkins' own internal activity: plugin loading, agent connections, security events, errors, and warnings. When something is wrong with Jenkins itself (not a pipeline), this is where you start.
SEVERE — Stop immediately
Critical failures that prevent Jenkins from functioning. A plugin that failed to load. An exception that crashed a core service. These always need investigation.
WARNING — Investigate soon
Recoverable problems — a failed agent reconnection, a credential lookup that returned null, a configuration inconsistency. Often the precursor to a SEVERE.
INFO — Normal activity
Expected events — Jenkins started, a plugin loaded, an agent connected. Most of your log will be INFO. Use grep to filter it out when hunting for problems.
FINE / FINER — Debug detail
Verbose detail disabled by default. Enable per-component via Manage Jenkins → System Log → Add new log recorder. Only turn on when actively debugging a specific issue.
Reading Jenkins Logs From the Terminal
The scenario:
It's Monday morning. Three developers have reported that builds triggered over the weekend failed with a connection error, but the pipelines themselves look fine. You need to check Jenkins' own logs to see if an agent went offline or if there was a plugin error between Friday night and Monday morning.
Tools used:
- journalctl — the systemd journal reader on Linux. Reads logs from any service managed by systemd, including Jenkins. More powerful than reading the log file directly because it supports filtering by time, severity, and unit.
- grep — filters log output to lines matching a pattern. Essential for finding specific events in a noisy log file.
- tail -f — follows a log file in real time — new lines appear as they're written. Use this when actively watching Jenkins during a problem.
- Jenkins Log Recorder — a built-in Jenkins UI feature at Manage Jenkins → System Log that captures log output per Java package. Used to enable debug logging for specific components without restarting Jenkins.
# View the last 200 lines of the Jenkins system log
sudo tail -200 /var/log/jenkins/jenkins.log
# Follow the log in real time — see new entries as they appear
sudo tail -f /var/log/jenkins/jenkins.log
# Filter for only SEVERE and WARNING entries — ignore INFO noise
sudo grep -E "SEVERE|WARNING" /var/log/jenkins/jenkins.log
# Filter logs from a specific time window (Friday 6 PM to Monday 6 AM)
# Using journalctl — more powerful date filtering than grep
sudo journalctl -u jenkins \
--since "2024-03-08 18:00:00" \
--until "2024-03-11 06:00:00" \
--priority=warning # only WARNING and above (SEVERE/CRITICAL)
# Search for agent disconnection events specifically
sudo grep -i "disconnected\|offline\|lost connection" /var/log/jenkins/jenkins.log \
| tail -50
# Count how many times each type of error appeared
sudo grep -E "SEVERE|WARNING" /var/log/jenkins/jenkins.log \
| awk '{print $4}' \
| sort | uniq -c | sort -rn \
| head -20
Where to practice: Run sudo tail -50 /var/log/jenkins/jenkins.log on your Jenkins server. For Docker, use docker logs jenkins-local 2>&1 | tail -50 — Docker captures stdout/stderr which is where the Jenkins Docker image writes its logs. Full logging documentation at jenkins.io — Monitoring Jenkins.
# journalctl output (Friday 18:00 to Monday 06:00, WARNING+):
Mar 08 22:14:33 jenkins-master-01 jenkins[3821]: WARNING hudson.slaves.SlaveComputer
agent-linux-02 is disconnected — attempting reconnection (attempt 1 of 5)
Mar 08 22:14:43 jenkins-master-01 jenkins[3821]: WARNING hudson.slaves.SlaveComputer
agent-linux-02 reconnection failed — SSH connection refused on 10.0.1.46:22
Mar 08 22:14:53 jenkins-master-01 jenkins[3821]: SEVERE hudson.slaves.SlaveComputer
agent-linux-02 went offline after 5 failed reconnection attempts
Mar 09 00:03:17 jenkins-master-01 jenkins[3821]: WARNING hudson.model.Queue
Build payment-service-build #87 has been waiting in queue for 45 minutes
— no agents with label 'linux' available
Mar 09 00:03:22 jenkins-master-01 jenkins[3821]: WARNING hudson.model.Queue
Build frontend-test #44 has been waiting in queue for 45 minutes
— no agents with label 'linux' available
Mar 11 06:01:44 jenkins-master-01 jenkins[3821]: INFO hudson.slaves.SlaveComputer
agent-linux-02 reconnected successfully
# grep -E count output:
47 WARNING
3 SEVERE
What just happened?
- Root cause found immediately —
agent-linux-02went offline at 22:14 on Friday after 5 failed SSH reconnection attempts. That's when the weekend build failures started. The agent came back online Monday at 06:01 — matching exactly when developers started seeing builds succeed again. SlaveComputerin the log — this is the internal Jenkins class name for build agents. "Slave" is Jenkins' legacy term for what's now called an "agent". Knowing these class names helps you filter logs precisely.- Queue warnings confirmed the impact — builds were not failing mid-run, they were never starting. The queue warning shows jobs waiting 45 minutes with no matching agent available. This is a different problem than a build that starts and fails.
- 50 total warnings over the weekend — 47 WARNINGs and 3 SEVEREs. The SEVEREs are the most actionable — each one represents something that failed completely. Cross-reference their timestamps with the build history to understand the full blast radius.
journalctl --priority=warning— this filters to WARNING level and above (which includes SEVERE/CRITICAL). On systems using systemd, this is more reliable than grepping the log file because it handles log rotation transparently.
Metrics with the Prometheus Plugin
Prometheus is an open-source monitoring system that scrapes metrics from services at regular intervals. The Prometheus Metrics plugin for Jenkins exposes a /prometheus endpoint that Prometheus can scrape — giving you time-series data on build success rates, queue depths, executor utilisation, and more.
Install it from Manage Jenkins → Plugin Manager → prometheus-plugin. After installing, Jenkins exposes metrics at http://JENKINS_URL/prometheus. No further configuration is required to start serving metrics.
The metrics Jenkins exposes via /prometheus (most important ones)
default_jenkins_builds_duration_milliseconds_summary
How long builds are taking — by job name. Spot which pipeline is getting slower over time.
default_jenkins_builds_success_build_count
Number of successful builds per job. Track success rate trends across the team.
default_jenkins_builds_failed_build_count
Number of failed builds per job. A spike here means something broke in a specific pipeline.
default_jenkins_executor_count
Total executors available across all agents. Drop here means an agent went offline.
default_jenkins_executor_in_use_count
Executors currently running builds. Near 100% utilisation means the queue is growing.
default_jenkins_queue_size_value
Number of builds waiting in the queue right now. A persistently high queue means you need more agents.
default_jenkins_node_count_value
Total agents registered. Drop here means an agent was removed or disconnected.
Connecting Prometheus to Jenkins
The scenario:
Your team already runs Prometheus and Grafana for application monitoring. You need to add Jenkins metrics to the same stack — so build failures and queue backlogs appear on the same dashboard as API error rates and latency. Here's the Prometheus scrape config and the key alerting rules.
New terms in this code:
- scrape_configs — the section of prometheus.yml that tells Prometheus which targets to scrape and how often.
- scrape_interval — how often Prometheus pulls metrics from the target. 30s is sensible for Jenkins — frequent enough for real-time dashboards without overwhelming Jenkins.
- basic_auth — Jenkins' Prometheus endpoint can be protected by authentication. Pass credentials here so Prometheus can authenticate.
- alerting rules — Prometheus expressions that fire alerts when conditions are met. Written in PromQL (Prometheus Query Language). Sent to Alertmanager which routes them to Slack, PagerDuty, etc.
- for: 5m — the alert must be true for this duration before firing. Prevents alerts from firing on momentary spikes.
# prometheus.yml — add this scrape job to your existing Prometheus config
scrape_configs:
# Scrape Jenkins metrics every 30 seconds
- job_name: 'jenkins'
scrape_interval: 30s
metrics_path: /prometheus # the endpoint the plugin exposes
scheme: http # use https in production
# Authentication — create a dedicated Jenkins service account for Prometheus
# Store credentials securely — don't hardcode in production
basic_auth:
username: prometheus-reader
password: your-api-token-here
static_configs:
- targets:
- jenkins-master-01:8080 # Jenkins host and port
labels:
environment: production # add labels for Grafana filtering
team: platform
# jenkins-alerts.yml — Prometheus alerting rules for Jenkins
# Add this to your Prometheus rules directory
groups:
- name: jenkins
rules:
# Alert when a build agent goes offline
# executor_count dropping means available capacity is shrinking
- alert: JenkinsAgentOffline
expr: default_jenkins_executor_count < 4
for: 5m
labels:
severity: warning
annotations:
summary: "Jenkins agent capacity low"
description: "Only {{ $value }} executors available — expected at least 4. An agent may be offline."
# Alert when the build queue has been backing up for more than 10 minutes
# A large queue that persists means builds are being delayed
- alert: JenkinsBuildQueueBacklog
expr: default_jenkins_queue_size_value > 5
for: 10m
labels:
severity: warning
annotations:
summary: "Jenkins build queue backlog"
description: "{{ $value }} builds waiting in queue for over 10 minutes."
# Alert when a job's failure rate spikes above 50% in the last hour
# Catches a pipeline that started breaking without individual build alerts
- alert: JenkinsHighFailureRate
expr: |
rate(default_jenkins_builds_failed_build_count[1h])
/
(rate(default_jenkins_builds_success_build_count[1h]) + rate(default_jenkins_builds_failed_build_count[1h]))
> 0.5
for: 15m
labels:
severity: critical
annotations:
summary: "Jenkins high build failure rate"
description: "More than 50% of builds in the last hour are failing."
Where to practice: Install the Prometheus Metrics plugin from the Jenkins Plugin Manager. Then visit http://localhost:8080/prometheus — you'll see the raw metrics output immediately, no Prometheus installation needed to verify the endpoint works. For a full Prometheus + Grafana stack locally, use the official Docker Compose setup at github.com/vegasbrianc/prometheus. Pre-built Jenkins Grafana dashboards are available at grafana.com/dashboards — search Jenkins.
# Sample output from http://jenkins-master-01:8080/prometheus
# HELP default_jenkins_executor_count Total number of executor slots
# TYPE default_jenkins_executor_count gauge
default_jenkins_executor_count 12.0
# HELP default_jenkins_executor_in_use_count Executors currently running builds
# TYPE default_jenkins_executor_in_use_count gauge
default_jenkins_executor_in_use_count 7.0
# HELP default_jenkins_queue_size_value Number of builds waiting in queue
# TYPE default_jenkins_queue_size_value gauge
default_jenkins_queue_size_value 2.0
# HELP default_jenkins_builds_success_build_count Number of successful builds
# TYPE default_jenkins_builds_success_build_count counter
default_jenkins_builds_success_build_count{jenkins_job="payment-service-build"} 142.0
default_jenkins_builds_success_build_count{jenkins_job="frontend-test"} 89.0
# HELP default_jenkins_builds_failed_build_count Number of failed builds
# TYPE default_jenkins_builds_failed_build_count counter
default_jenkins_builds_failed_build_count{jenkins_job="payment-service-build"} 8.0
default_jenkins_builds_failed_build_count{jenkins_job="frontend-test"} 14.0
# HELP default_jenkins_builds_duration_milliseconds_summary Build duration
# TYPE default_jenkins_builds_duration_milliseconds_summary summary
default_jenkins_builds_duration_milliseconds_summary{jenkins_job="payment-service-build",quantile="0.5"} 94221.0
default_jenkins_builds_duration_milliseconds_summary{jenkins_job="payment-service-build",quantile="0.99"} 187443.0
What just happened?
- 12 executors available, 7 in use — the cluster is at 58% utilisation. Healthy. If
in_use_countconsistently equalledexecutor_count, every new build would queue and you'd need more agents. - Queue depth is 2 — two builds waiting. This is fine. If this number stays above 5 for more than 10 minutes, the
JenkinsBuildQueueBacklogalert fires. - Success/failure ratio per job — payment-service-build has 142 successes and 8 failures — a 94.7% success rate. frontend-test has 89 successes and 14 failures — an 86.3% success rate. The alerting rule would fire if this rate dropped below 50% and stayed there for 15 minutes.
- Quantiles for build duration — the p50 (median) build for payment-service-build takes 94 seconds. The p99 (99th percentile) takes 187 seconds. If the p99 starts climbing over time, something is getting slower in that pipeline — worth investigating before it affects developers significantly.
- The output format is Prometheus exposition format — plain text, one metric per line, with HELP and TYPE annotations. Prometheus reads this format natively. Every scrape, Prometheus stores these values as time-series data points.
What a Grafana Dashboard Looks Like
Once Prometheus is scraping Jenkins, you can visualise the metrics in Grafana. Here's what a simple Jenkins health dashboard covers and which metric powers each panel:
Jenkins Health Dashboard — Grafana
Executor Utilisation
58%
7 of 12 in use
Queue Depth
2
builds waiting
Success Rate (1h)
91.2%
across all jobs
Build Duration — p50 over 24h
Failure Count by Job (24h)
The Four Metrics That Actually Matter
Build success rate — the health indicator
Track success rate per job over 7 and 30 days. A gradual decline is harder to spot in day-to-day build results but stands out clearly on a trend graph. Alert when it drops below 70% for any important pipeline.
Build duration p99 — the slowness detector
The median hides outliers. The 99th percentile shows the worst cases. When p99 starts climbing without p50 following, you have an intermittent slow step — often a flaky test or a network call that sometimes times out.
Queue depth over time — the capacity signal
A queue that briefly spikes and clears is fine. A queue that grows steadily from 8 AM to 5 PM every day means your team's build demand has outgrown your agent capacity. Use this trend to justify adding agents before developers start complaining.
Executor utilisation — the sizing guide
Consistently above 80% = you need more agents. Consistently below 20% = you're paying for idle capacity. The right size keeps average utilisation between 50–70% — enough headroom for bursts without wasted spend.
Teacher's Note
Set up the Prometheus endpoint and build one Grafana panel for queue depth. That single panel will tell you more about your Jenkins health than a year of log reading.
Practice Questions
1. After installing the Prometheus Metrics plugin, at which URL path does Jenkins expose its metrics endpoint?
2. Which Prometheus metric tells you how many builds are currently waiting in the Jenkins build queue?
3. On Linux systems using systemd, which command lets you query Jenkins logs filtered by time range and severity level?
Quiz
1. In the Jenkins system log, which class name appears in entries about build agent connection and disconnection events?
2. Build duration p50 has been stable for weeks but p99 is steadily climbing. What does this pattern indicate?
3. In a Prometheus alerting rule, what does the for: 5m field do?
Up Next · Lesson 33
Performance Tuning
JVM heap, build log retention, workspace cleanup, and the settings that keep Jenkins fast when it's handling 200 builds a day.