Jenkins Lesson 39 – Jenkins on Kubernetes | Dataplexa
Section IV · Lesson 39

Jenkins on Kubernetes

Lesson 23 introduced Kubernetes pod agents. Lesson 38 showed the Helm deployment. This lesson goes deeper — the production concerns teams face when Jenkins and Kubernetes are their full CI/CD platform: RBAC, pod disruption budgets, multiple pod templates, autoscaling agents, and diagnosing failures.

This lesson covers

Kubernetes RBAC for Jenkins → Production pod templates → Namespace strategy → Pod disruption budgets → Autoscaling agent pools → Diagnosing pod agent failures → The production readiness checklist

Running Jenkins on Kubernetes in a proof-of-concept is easy. Running it reliably in production — where it must survive node drains, pod reschedules, and cluster upgrades without dropping a build — requires understanding a handful of Kubernetes concepts that don't appear in getting-started guides. This lesson covers those production concerns directly.

The Analogy

Running Jenkins on Kubernetes for production is like moving your office into a managed co-working space. The space handles the building, the cleaning, and the power — but you still need to know the rules: which rooms you're allowed in (RBAC), what happens when the building is evacuated for maintenance (pod disruption), and how to get more desks when the team grows (autoscaling). The infrastructure is managed, but you still need to be a good tenant.

Kubernetes RBAC for Jenkins

Jenkins needs a Kubernetes ServiceAccount with specific permissions to create and destroy build agent pods. The principle of least privilege applies — Jenkins should only have the permissions it actually needs, scoped to the namespaces it actually uses.

New terms in this code:

  • ServiceAccount — a Kubernetes identity for a pod. Jenkins runs under a ServiceAccount that has RBAC permissions to create and manage build pods.
  • Role — a set of allowed actions scoped to a single namespace. Jenkins needs create, delete, get, and watch on Pods and Secrets in the namespace where build pods run.
  • RoleBinding — attaches a Role to a ServiceAccount. The RoleBinding is what actually grants the ServiceAccount the permissions defined in the Role.
  • ClusterRole — like a Role but cluster-wide. Only use ClusterRole if Jenkins needs to create pods in multiple namespaces — otherwise namespace-scoped Roles are safer.
  • PodDisruptionBudget (PDB) — a Kubernetes policy that limits how many pods of a given type can be unavailable at once. A PDB on the Jenkins master prevents it from being evicted during cluster upgrades.
# rbac.yaml — RBAC setup for Jenkins on Kubernetes
# Apply with: kubectl apply -f rbac.yaml

---
# ServiceAccount for the Jenkins master pod
apiVersion: v1
kind: ServiceAccount
metadata:
  name: jenkins
  namespace: jenkins

---
# Role — permissions Jenkins needs to manage build agent pods
# Scoped to the 'jenkins' namespace only — not cluster-wide
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: jenkins-agent-manager
  namespace: jenkins
rules:
  # Pod management — create, list, watch, delete build agent pods
  - apiGroups: [""]
    resources: ["pods"]
    verbs: ["create", "delete", "get", "list", "patch", "update", "watch"]

  # Pod logs — Jenkins streams build output from agent pod logs
  - apiGroups: [""]
    resources: ["pods/log"]
    verbs: ["get", "list", "watch"]

  # Pod exec — Jenkins may need to exec into pods for debug steps
  - apiGroups: [""]
    resources: ["pods/exec"]
    verbs: ["create", "get"]

  # Secrets — needed if build pods mount Jenkins credentials as Kubernetes secrets
  - apiGroups: [""]
    resources: ["secrets"]
    verbs: ["get", "list", "watch"]

  # ConfigMaps — for passing build configuration to agent pods
  - apiGroups: [""]
    resources: ["configmaps"]
    verbs: ["create", "delete", "get", "list", "update"]

---
# RoleBinding — grant the jenkins ServiceAccount the Role above
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: jenkins-agent-manager-binding
  namespace: jenkins
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: jenkins-agent-manager
subjects:
  - kind: ServiceAccount
    name: jenkins
    namespace: jenkins

---
# PodDisruptionBudget — prevent the Jenkins master from being evicted
# during cluster node drains or upgrades
# minAvailable: 1 means at least 1 Jenkins pod must be running at all times
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: jenkins-pdb
  namespace: jenkins
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: jenkins

Where to practice: Apply this file to your cluster with kubectl apply -f rbac.yaml. Verify the binding with kubectl describe rolebinding jenkins-agent-manager-binding -n jenkins. Test permissions with kubectl auth can-i create pods --as=system:serviceaccount:jenkins:jenkins -n jenkins — it should return "yes". Full RBAC reference at kubernetes.io — RBAC.

$ kubectl apply -f rbac.yaml
serviceaccount/jenkins created
role.rbac.authorization.k8s.io/jenkins-agent-manager created
rolebinding.rbac.authorization.k8s.io/jenkins-agent-manager-binding created
poddisruptionbudget.policy/jenkins-pdb created

$ kubectl auth can-i create pods \
    --as=system:serviceaccount:jenkins:jenkins -n jenkins
yes

$ kubectl auth can-i delete deployments \
    --as=system:serviceaccount:jenkins:jenkins -n jenkins
no

$ kubectl get pdb -n jenkins
NAME          MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
jenkins-pdb   1               N/A               0                     5s

What just happened?

  • Jenkins can create pods but not delete deployments — the RBAC check confirms the principle of least privilege is working. Jenkins has exactly the permissions it needs for agent management, nothing more. A compromised build script cannot use the Jenkins ServiceAccount to delete production Deployments.
  • PDB shows 0 allowed disruptions — because only one Jenkins pod is running and minAvailable: 1, Kubernetes will refuse to evict that pod during a node drain. If you run two Jenkins replicas (active-passive HA), the PDB would allow one to be disrupted while keeping the other running.
  • Namespace scoping — the RoleBinding is in the jenkins namespace. Jenkins can manage pods in that namespace but has no permissions in production, staging, or any other namespace. This is the correct isolation boundary.

Production Pod Templates

The scenario:

You're a platform engineer at a company running 20 services on EKS. Different services need different build environments — Java services need Maven, frontend services need Node, and some pipelines need to build and push Docker images. Rather than one overloaded pod template, you define three specialised templates with appropriate resource limits. Each Jenkinsfile picks the right one by label.

New terms in this code:

  • activeDeadlineSeconds — the maximum time a build pod is allowed to run. After this duration, Kubernetes kills the pod. Use this as a safety net for runaway builds — always set it higher than your timeout() in the Jenkinsfile.
  • nodeSelector — pins pods to nodes with specific labels. Use this to route build pods to nodes in a dedicated build node group, keeping them off the nodes running production workloads.
  • tolerations — allows pods to run on tainted nodes. If your build node group has a taint like dedicated=build:NoSchedule, build pods need a matching toleration to be scheduled there.
  • workspaceVolume — the volume used for the build workspace shared between containers in a pod. emptyDirWorkspaceVolume is an in-memory tmpfs — fast for I/O intensive builds. dynamicPVC provisions a real EBS volume for large builds.
// JCasC kubernetes cloud config — production pod templates
// Each template gives Jenkinsfiles a specific label to request
jenkins:
  clouds:
    - kubernetes:
        name: "eks-build-cluster"
        serverUrl: ""               // empty = in-cluster config
        namespace: "jenkins"
        concurrencyLimit: 50        // max concurrent build pods

        podTemplates:

          # Template 1: Java / Maven builds
          - name: "java-builder"
            label: "java maven"
            serviceAccount: "jenkins"
            activeDeadlineSeconds: 3600   // 1 hour max — safety net
            nodeSelector: "role=build"    // dedicated build nodes only
            tolerations:
              - key: "dedicated"
                operator: "Equal"
                value: "build"
                effect: "NoSchedule"
            containers:
              - name: "jnlp"
                image: "jenkins/inbound-agent:latest-jdk21"
                resourceRequestCpu: "100m"
                resourceRequestMemory: "256Mi"
              - name: "maven"
                image: "maven:3.9-eclipse-temurin-21"
                command: "cat"
                ttyEnabled: true
                resourceRequestCpu: "1"
                resourceRequestMemory: "2Gi"
                resourceLimitCpu: "2"
                resourceLimitMemory: "4Gi"
            # Fast workspace for I/O intensive Java builds
            workspaceVolume:
              emptyDirWorkspaceVolume:
                memory: false

          # Template 2: Node.js / frontend builds
          - name: "node-builder"
            label: "node frontend"
            serviceAccount: "jenkins"
            activeDeadlineSeconds: 1800
            nodeSelector: "role=build"
            containers:
              - name: "jnlp"
                image: "jenkins/inbound-agent:latest"
                resourceRequestCpu: "100m"
                resourceRequestMemory: "256Mi"
              - name: "node"
                image: "node:20-alpine"
                command: "cat"
                ttyEnabled: true
                resourceRequestCpu: "500m"
                resourceRequestMemory: "1Gi"
                resourceLimitCpu: "1"
                resourceLimitMemory: "2Gi"

          # Template 3: Docker image builds — requires privileged DinD
          - name: "docker-builder"
            label: "docker"
            serviceAccount: "jenkins"
            activeDeadlineSeconds: 3600
            nodeSelector: "role=build"
            containers:
              - name: "jnlp"
                image: "jenkins/inbound-agent:latest"
                resourceRequestCpu: "100m"
                resourceRequestMemory: "256Mi"
              - name: "docker"
                image: "docker:24-dind"
                command: "cat"
                ttyEnabled: true
                privileged: true             // required for Docker-in-Docker
                resourceRequestCpu: "1"
                resourceRequestMemory: "1Gi"
                resourceLimitCpu: "4"
                resourceLimitMemory: "8Gi"   // large limit — image builds are memory-hungry
            # Large workspace PVC for Docker layer cache
            workspaceVolume:
              dynamicPVC:
                storageClassName: "gp3"
                requestsSize: "20Gi"
                accessModes: "ReadWriteOnce"
# When a Java build triggers:
[Pipeline] node (java-builder pod)
Creating pod: java-builder-abc123 in namespace jenkins
  Container jnlp:  jenkins/inbound-agent:latest-jdk21 (100m CPU, 256Mi RAM)
  Container maven: maven:3.9-eclipse-temurin-21      (1 CPU, 2Gi RAM)
Pod scheduled on node: build-node-eu-west-1a

[Pipeline] container (maven)
+ mvn clean test
BUILD SUCCESSFUL — 63 tests, 0 failed

Pod java-builder-abc123 deleted after build completion.

# When a Docker build triggers:
Creating pod: docker-builder-xyz789 in namespace jenkins
  Container jnlp:   jenkins/inbound-agent:latest     (100m CPU, 256Mi RAM)
  Container docker:  docker:24-dind (privileged)      (1 CPU, 1Gi RAM)
PVC docker-builder-xyz789-workspace: 20Gi gp3 provisioned
Pod scheduled on node: build-node-eu-west-1b (dedicated build node)

[Pipeline] container (docker)
+ docker build -t registry.acmecorp.com/checkout-service:88-a1b2c3d .
Successfully built 3e4f5g6h
+ docker push registry.acmecorp.com/checkout-service:88-a1b2c3d

Pod docker-builder-xyz789 deleted. PVC docker-builder-xyz789-workspace deleted.

What just happened?

  • Each pod was scheduled on a dedicated build node — the nodeSelector: role=build kept build pods off nodes running production workloads. Build activity — high CPU, memory, and I/O — doesn't affect production latency.
  • Resource requests ensure predictable scheduling — Kubernetes uses the requests values to decide which node can fit the pod. The Maven container requests 2Gi RAM — Kubernetes guarantees that much is available before scheduling. Builds don't compete with each other for memory on the same node.
  • Docker build got a 20Gi PVC — the dynamic PVC workspace gives Docker layer caching real disk. Docker can reuse layers from previous builds on the same PVC, significantly speeding up repeated image builds. The PVC is deleted when the pod is deleted — no orphaned volumes accumulating costs.
  • Pods deleted immediately after build — every build starts from a clean environment. No workspace contamination between builds, no credential leaks between teams, no disk accumulation on build nodes.

Diagnosing Pod Agent Failures

When a Kubernetes build agent fails, the Jenkins console shows a generic error. The real diagnostic information is in Kubernetes, not in Jenkins. Here's the lookup chain every platform engineer needs to know:

# Step 1: Find the failed pod name from the Jenkins console output
# It will look like: java-builder-abc123
# Then query Kubernetes directly

# Step 2: Check the pod status and events
kubectl describe pod java-builder-abc123 -n jenkins

# Key sections to read in the output:
# Events: — shows scheduling failures, image pull errors, OOMKilled
# Status: — shows which containers exited and why
# Conditions: — shows if the pod reached Ready state

# Step 3: Check container logs directly
# Useful when the build output in Jenkins is incomplete
kubectl logs java-builder-abc123 -c maven -n jenkins

# Step 4: Check node pressure if pods are Pending
kubectl describe node build-node-eu-west-1a | grep -A5 "Conditions:"

# Common failure patterns and their causes:
# ---
# OOMKilled: container exceeded its memory limit
#   Fix: increase resourceLimitMemory in the pod template
# ---
# ImagePullBackOff: cannot pull the container image
#   Fix: check image tag exists, check imagePullSecrets, check registry access
# ---
# Pending (Unschedulable): no node has enough resources
#   Fix: scale up the node group, reduce resource requests, or check taints
# ---
# activeDeadlineExceeded: pod ran longer than activeDeadlineSeconds
#   Fix: increase activeDeadlineSeconds or investigate the slow build step
$ kubectl describe pod java-builder-abc123 -n jenkins

Name:         java-builder-abc123
Namespace:    jenkins
Node:         build-node-eu-west-1a/10.0.3.15
Status:       Failed

Containers:
  maven:
    State:     Terminated
    Reason:    OOMKilled
    Exit Code: 137

Conditions:
  Ready: False

Events:
  Warning  OOMKilling  15s  kubelet  Memory limit reached.
  Container maven was OOMKilled.
  Request: 2Gi  Limit: 4Gi  Usage at kill: 4.1Gi

# Diagnosis: the Maven build consumed more than 4Gi memory
# Fix: increase resourceLimitMemory for the maven container
# Or: add -Xmx2g to the Maven build to limit JVM heap

What just happened?

  • OOMKilled — exit code 137 — exit code 137 always means the process was killed by the Linux OOM killer. The container hit its memory limit (4Gi) and Kubernetes killed it. The Jenkins console would show a vague agent disconnection error — kubectl describe pod gives the real reason in seconds.
  • Two fixes depending on root cause — if the build legitimately needs more memory, increase resourceLimitMemory in the pod template. If the build is leaking memory or the JVM heap is unbounded, add -Xmx2g to the Maven command to cap the heap below the container limit. The second fix is better — it prevents the build from consuming more memory than it should even on nodes where the limit is higher.
  • kubectl describe pod is always the first command — before looking at Jenkins logs, checking Slack, or filing a ticket, run kubectl describe pod on the failed pod. 80% of pod agent failures have a clear explanation in the Events section.

Production Readiness Checklist

ServiceAccount with least-privilege Role — not cluster-admin
PodDisruptionBudget on the Jenkins master — survives node drains
Resource requests and limits set on all pod template containers
activeDeadlineSeconds set — runaway builds cannot hang forever
Build pods scheduled on dedicated build nodes — not production nodes
JENKINS_HOME on a PersistentVolumeClaim — not ephemeral pod storage
Multiple pod templates by build type — right-sized resources per workload
Prometheus metrics from Lesson 32 — queue, executor utilisation, build success rate

Teacher's Note

kubectl describe pod on the failed agent pod is always the first diagnostic command. It tells you what Jenkins cannot — OOMKilled, ImagePullBackOff, scheduling failure — in plain English in the Events section.

Practice Questions

1. Which Kubernetes resource prevents the Jenkins master pod from being evicted during a node drain or cluster upgrade?



2. A build pod's kubectl describe pod output shows exit code 137 and a container state of Terminated. What does this indicate?



3. Which pod template setting in the Kubernetes plugin limits the maximum time a build pod is allowed to run before Kubernetes kills it?



Quiz

1. You've created a Role with pod management permissions. What additional resource must you create to actually grant those permissions to the Jenkins ServiceAccount?


2. How do the pod templates in this lesson prevent build activity from affecting production workloads on the same cluster?


3. What is the difference between emptyDirWorkspaceVolume and dynamicPVC workspace types in a pod template?


Up Next · Lesson 40

Jenkins and Infrastructure as Code

Terraform, Ansible, and Pulumi in Jenkins pipelines — provisioning cloud infrastructure as part of your CI/CD workflow with the same reliability as application deployments.