Jenkins Lesson 32 – Logging and Monitoring | Dataplexa

Section III · Lesson 32

Logging and Monitoring

A Jenkins server you can't observe is a black box you can't trust. This lesson covers the logs that tell you what's happening inside Jenkins, the metrics that predict problems before they become incidents, and how to connect Jenkins to the monitoring stack your team already uses.

This lesson covers

Jenkins system logs → Build log management → The Prometheus Metrics plugin → Key metrics to watch → Grafana dashboards → Log levels and custom loggers → Diagnosing problems from logs

Most teams treat Jenkins like a vending machine — put code in, get builds out, complain when it breaks. Teams that treat Jenkins like a production service — with logs, metrics, dashboards, and alerts — catch problems before users report them and diagnose failures in minutes instead of hours.

The Analogy

Running Jenkins without monitoring is like driving a car with no dashboard. The engine might be overheating, the fuel might be low, the oil pressure might be dropping — but you only find out when the car stops moving. Logs are your warning lights. Metrics are your gauges. A dashboard is your instrument panel. You need all three to drive safely.

Jenkins System Logs — The First Place to Look

Jenkins writes system-level logs to /var/log/jenkins/jenkins.log on Linux. These are different from build console logs — they record Jenkins' own internal activity: plugin loading, agent connections, security events, errors, and warnings. When something is wrong with Jenkins itself (not a pipeline), this is where you start.

SEVERE — Stop immediately

Critical failures that prevent Jenkins from functioning. A plugin that failed to load. An exception that crashed a core service. These always need investigation.

WARNING — Investigate soon

Recoverable problems — a failed agent reconnection, a credential lookup that returned null, a configuration inconsistency. Often the precursor to a SEVERE.

INFO — Normal activity

Expected events — Jenkins started, a plugin loaded, an agent connected. Most of your log will be INFO. Use grep to filter it out when hunting for problems.

FINE / FINER — Debug detail

Verbose detail disabled by default. Enable per-component via Manage Jenkins → System Log → Add new log recorder. Only turn on when actively debugging a specific issue.

Reading Jenkins Logs From the Terminal

The scenario:

It's Monday morning. Three developers have reported that builds triggered over the weekend failed with a connection error, but the pipelines themselves look fine. You need to check Jenkins' own logs to see if an agent went offline or if there was a plugin error between Friday night and Monday morning.

Tools used:

journalctl — the systemd journal reader on Linux. Reads logs from any service managed by systemd, including Jenkins. More powerful than reading the log file directly because it supports filtering by time, severity, and unit.
grep — filters log output to lines matching a pattern. Essential for finding specific events in a noisy log file.
tail -f — follows a log file in real time — new lines appear as they're written. Use this when actively watching Jenkins during a problem.
Jenkins Log Recorder — a built-in Jenkins UI feature at Manage Jenkins → System Log that captures log output per Java package. Used to enable debug logging for specific components without restarting Jenkins.

# View the last 200 lines of the Jenkins system log
sudo tail -200 /var/log/jenkins/jenkins.log

# Follow the log in real time — see new entries as they appear
sudo tail -f /var/log/jenkins/jenkins.log

# Filter for only SEVERE and WARNING entries — ignore INFO noise
sudo grep -E "SEVERE|WARNING" /var/log/jenkins/jenkins.log

# Filter logs from a specific time window (Friday 6 PM to Monday 6 AM)
# Using journalctl — more powerful date filtering than grep
sudo journalctl -u jenkins \
  --since "2024-03-08 18:00:00" \
  --until "2024-03-11 06:00:00" \
  --priority=warning      # only WARNING and above (SEVERE/CRITICAL)

# Search for agent disconnection events specifically
sudo grep -i "disconnected\|offline\|lost connection" /var/log/jenkins/jenkins.log \
  | tail -50

# Count how many times each type of error appeared
sudo grep -E "SEVERE|WARNING" /var/log/jenkins/jenkins.log \
  | awk '{print $4}' \
  | sort | uniq -c | sort -rn \
  | head -20

Where to practice: Run sudo tail -50 /var/log/jenkins/jenkins.log on your Jenkins server. For Docker, use docker logs jenkins-local 2>&1 | tail -50 — Docker captures stdout/stderr which is where the Jenkins Docker image writes its logs. Full logging documentation at jenkins.io — Monitoring Jenkins.

# journalctl output (Friday 18:00 to Monday 06:00, WARNING+):

Mar 08 22:14:33 jenkins-master-01 jenkins[3821]: WARNING hudson.slaves.SlaveComputer
  agent-linux-02 is disconnected — attempting reconnection (attempt 1 of 5)

Mar 08 22:14:43 jenkins-master-01 jenkins[3821]: WARNING hudson.slaves.SlaveComputer
  agent-linux-02 reconnection failed — SSH connection refused on 10.0.1.46:22

Mar 08 22:14:53 jenkins-master-01 jenkins[3821]: SEVERE  hudson.slaves.SlaveComputer
  agent-linux-02 went offline after 5 failed reconnection attempts

Mar 09 00:03:17 jenkins-master-01 jenkins[3821]: WARNING hudson.model.Queue
  Build payment-service-build #87 has been waiting in queue for 45 minutes
  — no agents with label 'linux' available

Mar 09 00:03:22 jenkins-master-01 jenkins[3821]: WARNING hudson.model.Queue
  Build frontend-test #44 has been waiting in queue for 45 minutes
  — no agents with label 'linux' available

Mar 11 06:01:44 jenkins-master-01 jenkins[3821]: INFO  hudson.slaves.SlaveComputer
  agent-linux-02 reconnected successfully

# grep -E count output:
     47 WARNING
      3 SEVERE

What just happened?

Root cause found immediately — agent-linux-02 went offline at 22:14 on Friday after 5 failed SSH reconnection attempts. That's when the weekend build failures started. The agent came back online Monday at 06:01 — matching exactly when developers started seeing builds succeed again.
SlaveComputer in the log — this is the internal Jenkins class name for build agents. "Slave" is Jenkins' legacy term for what's now called an "agent". Knowing these class names helps you filter logs precisely.
Queue warnings confirmed the impact — builds were not failing mid-run, they were never starting. The queue warning shows jobs waiting 45 minutes with no matching agent available. This is a different problem than a build that starts and fails.
50 total warnings over the weekend — 47 WARNINGs and 3 SEVEREs. The SEVEREs are the most actionable — each one represents something that failed completely. Cross-reference their timestamps with the build history to understand the full blast radius.
journalctl --priority=warning — this filters to WARNING level and above (which includes SEVERE/CRITICAL). On systems using systemd, this is more reliable than grepping the log file because it handles log rotation transparently.

Metrics with the Prometheus Plugin

Prometheus is an open-source monitoring system that scrapes metrics from services at regular intervals. The Prometheus Metrics plugin for Jenkins exposes a /prometheus endpoint that Prometheus can scrape — giving you time-series data on build success rates, queue depths, executor utilisation, and more.

Install it from Manage Jenkins → Plugin Manager → prometheus-plugin. After installing, Jenkins exposes metrics at http://JENKINS_URL/prometheus. No further configuration is required to start serving metrics.

The metrics Jenkins exposes via /prometheus (most important ones)

default_jenkins_builds_duration_milliseconds_summary

How long builds are taking — by job name. Spot which pipeline is getting slower over time.

default_jenkins_builds_success_build_count

Number of successful builds per job. Track success rate trends across the team.

default_jenkins_builds_failed_build_count

Number of failed builds per job. A spike here means something broke in a specific pipeline.

default_jenkins_executor_count

Total executors available across all agents. Drop here means an agent went offline.

default_jenkins_executor_in_use_count

Executors currently running builds. Near 100% utilisation means the queue is growing.

default_jenkins_queue_size_value

Number of builds waiting in the queue right now. A persistently high queue means you need more agents.

default_jenkins_node_count_value

Total agents registered. Drop here means an agent was removed or disconnected.

Connecting Prometheus to Jenkins

The scenario:

Your team already runs Prometheus and Grafana for application monitoring. You need to add Jenkins metrics to the same stack — so build failures and queue backlogs appear on the same dashboard as API error rates and latency. Here's the Prometheus scrape config and the key alerting rules.

New terms in this code:

scrape_configs — the section of prometheus.yml that tells Prometheus which targets to scrape and how often.
scrape_interval — how often Prometheus pulls metrics from the target. 30s is sensible for Jenkins — frequent enough for real-time dashboards without overwhelming Jenkins.
basic_auth — Jenkins' Prometheus endpoint can be protected by authentication. Pass credentials here so Prometheus can authenticate.
alerting rules — Prometheus expressions that fire alerts when conditions are met. Written in PromQL (Prometheus Query Language). Sent to Alertmanager which routes them to Slack, PagerDuty, etc.
for: 5m — the alert must be true for this duration before firing. Prevents alerts from firing on momentary spikes.

# prometheus.yml — add this scrape job to your existing Prometheus config
scrape_configs:

  # Scrape Jenkins metrics every 30 seconds
  - job_name: 'jenkins'
    scrape_interval: 30s
    metrics_path: /prometheus          # the endpoint the plugin exposes
    scheme: http                        # use https in production

    # Authentication — create a dedicated Jenkins service account for Prometheus
    # Store credentials securely — don't hardcode in production
    basic_auth:
      username: prometheus-reader
      password: your-api-token-here

    static_configs:
      - targets:
          - jenkins-master-01:8080     # Jenkins host and port
        labels:
          environment: production       # add labels for Grafana filtering
          team: platform

# jenkins-alerts.yml — Prometheus alerting rules for Jenkins
# Add this to your Prometheus rules directory
groups:
  - name: jenkins
    rules:

      # Alert when a build agent goes offline
      # executor_count dropping means available capacity is shrinking
      - alert: JenkinsAgentOffline
        expr: default_jenkins_executor_count < 4
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Jenkins agent capacity low"
          description: "Only {{ $value }} executors available — expected at least 4. An agent may be offline."

      # Alert when the build queue has been backing up for more than 10 minutes
      # A large queue that persists means builds are being delayed
      - alert: JenkinsBuildQueueBacklog
        expr: default_jenkins_queue_size_value > 5
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Jenkins build queue backlog"
          description: "{{ $value }} builds waiting in queue for over 10 minutes."

      # Alert when a job's failure rate spikes above 50% in the last hour
      # Catches a pipeline that started breaking without individual build alerts
      - alert: JenkinsHighFailureRate
        expr: |
          rate(default_jenkins_builds_failed_build_count[1h])
          /
          (rate(default_jenkins_builds_success_build_count[1h]) + rate(default_jenkins_builds_failed_build_count[1h]))
          > 0.5
        for: 15m
        labels:
          severity: critical
        annotations:
          summary: "Jenkins high build failure rate"
          description: "More than 50% of builds in the last hour are failing."

Where to practice: Install the Prometheus Metrics plugin from the Jenkins Plugin Manager. Then visit http://localhost:8080/prometheus — you'll see the raw metrics output immediately, no Prometheus installation needed to verify the endpoint works. For a full Prometheus + Grafana stack locally, use the official Docker Compose setup at github.com/vegasbrianc/prometheus. Pre-built Jenkins Grafana dashboards are available at grafana.com/dashboards — search Jenkins.

# Sample output from http://jenkins-master-01:8080/prometheus

# HELP default_jenkins_executor_count Total number of executor slots
# TYPE default_jenkins_executor_count gauge
default_jenkins_executor_count 12.0

# HELP default_jenkins_executor_in_use_count Executors currently running builds
# TYPE default_jenkins_executor_in_use_count gauge
default_jenkins_executor_in_use_count 7.0

# HELP default_jenkins_queue_size_value Number of builds waiting in queue
# TYPE default_jenkins_queue_size_value gauge
default_jenkins_queue_size_value 2.0

# HELP default_jenkins_builds_success_build_count Number of successful builds
# TYPE default_jenkins_builds_success_build_count counter
default_jenkins_builds_success_build_count{jenkins_job="payment-service-build"} 142.0
default_jenkins_builds_success_build_count{jenkins_job="frontend-test"} 89.0

# HELP default_jenkins_builds_failed_build_count Number of failed builds
# TYPE default_jenkins_builds_failed_build_count counter
default_jenkins_builds_failed_build_count{jenkins_job="payment-service-build"} 8.0
default_jenkins_builds_failed_build_count{jenkins_job="frontend-test"} 14.0

# HELP default_jenkins_builds_duration_milliseconds_summary Build duration
# TYPE default_jenkins_builds_duration_milliseconds_summary summary
default_jenkins_builds_duration_milliseconds_summary{jenkins_job="payment-service-build",quantile="0.5"} 94221.0
default_jenkins_builds_duration_milliseconds_summary{jenkins_job="payment-service-build",quantile="0.99"} 187443.0

What just happened?

12 executors available, 7 in use — the cluster is at 58% utilisation. Healthy. If in_use_count consistently equalled executor_count, every new build would queue and you'd need more agents.
Queue depth is 2 — two builds waiting. This is fine. If this number stays above 5 for more than 10 minutes, the JenkinsBuildQueueBacklog alert fires.
Success/failure ratio per job — payment-service-build has 142 successes and 8 failures — a 94.7% success rate. frontend-test has 89 successes and 14 failures — an 86.3% success rate. The alerting rule would fire if this rate dropped below 50% and stayed there for 15 minutes.
Quantiles for build duration — the p50 (median) build for payment-service-build takes 94 seconds. The p99 (99th percentile) takes 187 seconds. If the p99 starts climbing over time, something is getting slower in that pipeline — worth investigating before it affects developers significantly.
The output format is Prometheus exposition format — plain text, one metric per line, with HELP and TYPE annotations. Prometheus reads this format natively. Every scrape, Prometheus stores these values as time-series data points.

What a Grafana Dashboard Looks Like

Once Prometheus is scraping Jenkins, you can visualise the metrics in Grafana. Here's what a simple Jenkins health dashboard covers and which metric powers each panel:

Jenkins Health Dashboard — Grafana

Executor Utilisation

58%

7 of 12 in use

Queue Depth

builds waiting

Success Rate (1h)

91.2%

across all jobs

Build Duration — p50 over 24h

payment-service-build

94s

frontend-test

132s

Failure Count by Job (24h)

frontend-test 14

payment-service-build 8

The Four Metrics That Actually Matter

📊

Build success rate — the health indicator

Track success rate per job over 7 and 30 days. A gradual decline is harder to spot in day-to-day build results but stands out clearly on a trend graph. Alert when it drops below 70% for any important pipeline.

⏱

Build duration p99 — the slowness detector

The median hides outliers. The 99th percentile shows the worst cases. When p99 starts climbing without p50 following, you have an intermittent slow step — often a flaky test or a network call that sometimes times out.

🔢

Queue depth over time — the capacity signal

A queue that briefly spikes and clears is fine. A queue that grows steadily from 8 AM to 5 PM every day means your team's build demand has outgrown your agent capacity. Use this trend to justify adding agents before developers start complaining.

💻

Executor utilisation — the sizing guide

Consistently above 80% = you need more agents. Consistently below 20% = you're paying for idle capacity. The right size keeps average utilisation between 50–70% — enough headroom for bursts without wasted spend.

Teacher's Note

Set up the Prometheus endpoint and build one Grafana panel for queue depth. That single panel will tell you more about your Jenkins health than a year of log reading.

Practice Questions

1. After installing the Prometheus Metrics plugin, at which URL path does Jenkins expose its metrics endpoint?

2. Which Prometheus metric tells you how many builds are currently waiting in the Jenkins build queue?

3. On Linux systems using systemd, which command lets you query Jenkins logs filtered by time range and severity level?

Quiz

Up Next · Lesson 33

Performance Tuning

JVM heap, build log retention, workspace cleanup, and the settings that keep Jenkins fast when it's handling 200 builds a day.

← Previous Course Index Next →

Jenkins Course