Jenkins Lesson 24 – Handling Pipeline Failures | Dataplexa
Section II · Lesson 24

Handling Pipeline Failures

Pipelines fail. That's not a bug in your setup — it's the system working. The question isn't how to prevent failures, it's how to handle them so a broken build never becomes a production incident.

This lesson covers

retry → catchError → try/catch in script blocks → Automatic rollbacks → Timeout handling → unstable vs failure → The failure patterns every production pipeline must handle

Most Jenkinsfile tutorials show you how to make a pipeline work. Very few show you what happens when it doesn't. This lesson is about the second part — the defensive patterns that separate a pipeline that causes incidents from one that catches and contains them.

The Analogy

A well-handled pipeline failure is like a circuit breaker in an electrical system. The moment something goes wrong, the breaker trips — cleanly, safely, with a clear signal of what happened. The rest of the system is protected. You know exactly where to look. Compare this to a pipeline with no failure handling — a fault in one stage silently corrupts the next, and by the time anyone notices, the damage is already done.

Default Failure Behaviour — and Its Limits

By default, when any shell command in a Jenkins pipeline returns a non-zero exit code, the current stage fails and the pipeline stops. That's the right default — you don't want to deploy broken code. But default behaviour isn't always enough:

Default behaviour is not enough when...

  • A flaky integration test fails 1 in 10 runs
  • A deploy fails and needs to be rolled back automatically
  • You want to collect test results even when tests fail
  • A stage fails but you want other parallel stages to keep running
  • A build hangs indefinitely waiting for a network response

The tools Jenkins gives you

  • retry() — retry a step N times on failure
  • catchError() — catch a failure without stopping the pipeline
  • try/catch — full Groovy error handling in script blocks
  • timeout() — abort if a step takes too long
  • unstable() — mark as yellow without failing the build

retry — Handling Flaky Steps

retry(N) wraps a block of steps and re-executes them up to N times if they fail. Use it for steps that fail intermittently due to network issues, external service timeouts, or race conditions — not for steps that fail because the code is broken.

pipeline {
    agent { label 'linux' }

    stages {

        stage('Integration Tests') {
            steps {
                // retry wraps the block and re-runs it up to 3 times if it fails
                // Use for genuinely flaky tests or external dependencies
                // Do NOT use to hide real test failures — fix those instead
                retry(3) {
                    sh './gradlew integrationTest'
                }
            }
        }

        stage('Deploy') {
            steps {
                // Combine retry with timeout for network-dependent operations
                // This retries the deploy up to 2 times, but each attempt
                // is killed if it takes longer than 5 minutes
                retry(2) {
                    timeout(time: 5, unit: 'MINUTES') {
                        sh './deploy.sh staging'
                    }
                }
            }
        }

    }
}
[Pipeline] { (Integration Tests) }
[Pipeline] retry (3)
[Pipeline] sh
+ ./gradlew integrationTest
Connection refused: staging-db:5432 — retrying (attempt 1 of 3)
[Pipeline] sh
+ ./gradlew integrationTest
Connection refused: staging-db:5432 — retrying (attempt 2 of 3)
[Pipeline] sh
+ ./gradlew integrationTest
BUILD SUCCESSFUL in 1m 12s — 18 integration tests completed, 0 failed
[Pipeline] { (Deploy) }
[Pipeline] retry (2)
[Pipeline] timeout (5 MINUTES)
[Pipeline] sh
+ ./deploy.sh staging
Deploying to staging... done.
[Pipeline] End of Pipeline
Finished: SUCCESS

What just happened?

  • Two retries before success — the integration test failed twice due to a database connection issue, then succeeded on the third attempt. Jenkins logged each attempt automatically — you can see exactly how many retries were consumed without adding any extra echo statements.
  • No failed stage in the UI — because the third attempt succeeded, the stage shows green. The retries are transparent to the stage view. Only if all three attempts failed would the stage turn red.
  • The deploy only ran once — the Deploy stage used retry(2) but only needed one attempt. The timeout never fired because the deploy finished in under 5 minutes.

Warning: retry() is not a fix for broken tests. If a test fails every time, retrying it three times just delays the failure. Use retry only for genuinely transient failures — external HTTP calls, database connections, cloud API rate limits. Identify and fix the root cause of any step that needs more than one retry to pass consistently.

catchError — Continue Despite Failure

catchError catches a failure in a block, records it, but lets the pipeline continue running. The build result is marked as FAILURE (or UNSTABLE depending on your config), but subsequent stages still execute. Use this when you want to collect results from multiple steps even if one fails.

pipeline {
    agent { label 'linux' }

    stages {

        stage('Run All Checks') {
            steps {

                // catchError lets the pipeline continue even if this step fails
                // buildResult: the final build result if this block fails
                // stageResult: what this stage shows in the UI if it fails
                catchError(buildResult: 'FAILURE', stageResult: 'FAILURE') {
                    // Unit tests — if these fail, build should fail
                    sh './gradlew test'
                }

                // This step runs even if the unit tests above failed
                // Because catchError caught the failure and let execution continue
                catchError(buildResult: 'UNSTABLE', stageResult: 'UNSTABLE') {
                    // Static analysis — if this fails, mark build as unstable (yellow)
                    // but don't fail the whole build — just flag it for review
                    sh './gradlew checkstyle pmd'
                }

                // Publish results regardless of what failed above
                // We want the test report even if tests failed
                junit allowEmptyResults: true, testResults: 'build/test-results/**/*.xml'
            }
        }

    }
}
[Pipeline] { (Run All Checks) }
[Pipeline] catchError (buildResult: FAILURE, stageResult: FAILURE)
[Pipeline] sh
+ ./gradlew test
FAIL src/test/java/CheckoutServiceTest.java
  NullPointerException in line 42
Tests: 2 failed, 41 passed
[Pipeline] echo
catchError caught error — build result set to FAILURE, continuing pipeline
[Pipeline] catchError (buildResult: UNSTABLE, stageResult: UNSTABLE)
[Pipeline] sh
+ ./gradlew checkstyle pmd
Checkstyle violations: 3 warnings found
BUILD SUCCESSFUL — static analysis complete
[Pipeline] junit
Recording test results
[Pipeline] End of Pipeline
Finished: FAILURE

What just happened?

  • Unit tests failed but the pipeline kept going — without catchError, the stage would have stopped immediately after the test failure. The static analysis would never have run and the test results would not have been published. catchError caught the failure and let execution continue.
  • Static analysis ran and produced a result — even though tests failed above, the checkstyle step ran and found 3 warnings. Because its buildResult was set to UNSTABLE, those warnings didn't upgrade the failure to a different state — the worse result (FAILURE from the tests) sticks.
  • Test results were published — the junit step at the end ran and recorded the test results. The Jenkins test trend graph now has data. This is the whole point — you need the test report most when tests fail.
  • Finished: FAILURE — the overall build failed because the first catchError had buildResult: 'FAILURE'. The pipeline completed all steps but still recorded the correct final outcome.

try/catch — Full Error Control in Script Blocks

For more complex failure handling — especially when you need to react differently to different kinds of failures — use Groovy's try/catch/finally inside a script { } block. This gives you complete programmatic control over what happens when something goes wrong.

The scenario:

You're a DevOps engineer at a payments company. A deploy pipeline pushes a new version to production. If the deploy or the post-deploy health check fails, the pipeline must automatically roll back to the previous version — without any human intervention. Failing to roll back would leave production in a broken state.

New terms in this code:

  • try { } — the block of code to attempt. If anything inside throws an exception or fails, execution jumps to the catch block.
  • catch(Exception e) { } — runs when the try block fails. The variable e holds the exception object. e.getMessage() gives the error message. e.toString() gives the full exception string.
  • finally { } — runs after try/catch regardless of whether there was an exception. Use for cleanup that must always happen — closing connections, deleting temp files, freeing locks.
  • currentBuild.result = 'FAILURE' — explicitly sets the build result from inside a script block. Use this when you've caught an exception manually and want to ensure the build records the right outcome.
  • throw e — re-throws the caught exception. Use this after your catch logic to propagate the failure — otherwise Jenkins thinks the build succeeded.
  • unstable(message:) — marks the build as UNSTABLE (yellow) instead of FAILURE (red). Use when the issue is worth flagging but not severe enough to block the team.
pipeline {
    agent { label 'linux' }

    environment {
        APP_NAME     = 'payments-service'
        REGISTRY     = 'registry.acmecorp.com'
        DOCKER_CREDS = credentials('docker-registry-credentials')
    }

    stages {

        stage('Deploy to Production') {
            when { branch 'main' }
            steps {
                script {
                    // Store the current running version BEFORE deploying
                    // If the new deploy fails, we need this to roll back
                    def previousVersion = sh(
                        script: "kubectl get deployment/${APP_NAME} " +
                                "-o jsonpath='{.spec.template.spec.containers[0].image}' " +
                                "--namespace=production",
                        returnStdout: true
                    ).trim()

                    echo "Current version in production: ${previousVersion}"
                    echo "Deploying new version: ${REGISTRY}/${APP_NAME}:${BUILD_NUMBER}"

                    try {
                        // Attempt the deploy
                        sh """
                            kubectl set image deployment/${APP_NAME} \
                              ${APP_NAME}=${REGISTRY}/${APP_NAME}:${BUILD_NUMBER} \
                              --namespace=production
                            kubectl rollout status deployment/${APP_NAME} \
                              --namespace=production --timeout=120s
                        """

                        // Post-deploy health check — confirm the service is responding
                        sh 'curl --fail --retry 3 https://api.acmecorp.com/health'
                        echo "✅ Deploy successful — health check passed"

                    } catch(Exception e) {
                        // Deploy or health check failed — roll back immediately
                        echo "❌ Deploy failed: ${e.getMessage()}"
                        echo "Rolling back to: ${previousVersion}"

                        // Restore the previous image
                        sh """
                            kubectl set image deployment/${APP_NAME} \
                              ${APP_NAME}=${previousVersion} \
                              --namespace=production
                            kubectl rollout status deployment/${APP_NAME} \
                              --namespace=production --timeout=120s
                        """

                        echo "Rollback complete — production restored to ${previousVersion}"

                        // Mark the build as failed and propagate the exception
                        // Without 'throw e' Jenkins would treat the catch as a success
                        currentBuild.result = 'FAILURE'
                        throw e

                    } finally {
                        // Always runs — log the final state of the production deployment
                        // Useful for the audit trail regardless of success or failure
                        sh "kubectl get deployment/${APP_NAME} --namespace=production"
                    }
                }
            }
        }

    }

    post {
        failure {
            slackSend(
                channel: '#incidents',
                color: 'danger',
                message: """
                    🚨 *PRODUCTION DEPLOY FAILED AND ROLLED BACK*
                    Service: *${APP_NAME}*
                    Build: <${env.BUILD_URL}|#${BUILD_NUMBER}>
                    Production has been restored to the previous version.
                """.stripIndent().trim()
            )
        }
        success {
            slackSend(
                channel: '#deployments',
                color: 'good',
                message: "✅ *${APP_NAME}* build #${BUILD_NUMBER} deployed to production successfully"
            )
        }
        always { cleanWs() }
    }

}

Where to practice: Test the rollback logic by intentionally deploying a broken image tag — one that doesn't exist in your registry. The kubectl set image will fail when it tries to pull the image, triggering the catch block and rollback. You'll see the entire rollback flow in the console output. Full error handling reference at jenkins.io — Pipeline Syntax.

Started by GitHub push by dev-omar (branch: main)
[Pipeline] Start of Pipeline
[Pipeline] node (agent-linux-01)
[Pipeline] { (Deploy to Production) }
[Pipeline] script
[Pipeline] sh
+ kubectl get deployment/payments-service -o jsonpath='...' --namespace=production
registry.acmecorp.com/payments-service:59-c4d5e6f
[Pipeline] echo
Current version in production: registry.acmecorp.com/payments-service:59-c4d5e6f
[Pipeline] echo
Deploying new version: registry.acmecorp.com/payments-service:60
[Pipeline] sh
+ kubectl set image deployment/payments-service payments-service=...
deployment.apps/payments-service image updated
+ kubectl rollout status deployment/payments-service --namespace=production --timeout=120s
Waiting for deployment "payments-service" rollout to finish: 0 of 3 updated replicas are available...
error: watch closed before EOFError from server: error when watching "payments-service": The connection was reset
[Pipeline] echo (catch block)
❌ Deploy failed: watch closed before EOFError from server
[Pipeline] echo
Rolling back to: registry.acmecorp.com/payments-service:59-c4d5e6f
[Pipeline] sh
+ kubectl set image deployment/payments-service payments-service=...registry.acmecorp.com/payments-service:59-c4d5e6f
deployment.apps/payments-service image updated
+ kubectl rollout status deployment/payments-service --namespace=production --timeout=120s
deployment "payments-service" successfully rolled out
[Pipeline] echo
Rollback complete — production restored to registry.acmecorp.com/payments-service:59-c4d5e6f
[Pipeline] sh (finally)
NAME                READY   UP-TO-DATE   AVAILABLE
payments-service    3/3     3            3
[Pipeline] post (failure)
Slack #incidents: 🚨 PRODUCTION DEPLOY FAILED AND ROLLED BACK
[Pipeline] cleanWs
Finished: FAILURE

What just happened?

  • Previous version captured before deploy — the pipeline read the currently running image tag before touching anything. This is critical for rollback — you must know what you're rolling back to before you break it.
  • Rollout status failed mid-watch — the deploy started but the kubectl watch connection dropped before all replicas came up. This is a real-world transient failure. The exception was caught by the catch block.
  • Automatic rollback executed — the catch block immediately ran kubectl set image with the previous version. Kubernetes rolled back all three replicas to the known-good image. No human had to intervene.
  • finally block ran — after the catch block completed and the exception was re-thrown, the finally block still executed, logging the current deployment state. The audit trail is complete regardless of outcome.
  • throw e propagated the failure — because we re-threw the exception, Jenkins marked the build as FAILURE and the failure post block fired, sending a #incidents alert. Without throw e, the build would have been marked SUCCESS even though production had a failed deploy.
  • Finished: FAILURE with a complete rollback — the build failed as it should. But production was restored. The team was notified immediately. No incident escalation needed.

unstable() — Yellow Builds for Non-Critical Issues

Not every problem deserves a red build. If code coverage drops below threshold, or a non-critical linting rule fails, or an advisory security scan finds low-severity issues — these are worth flagging without blocking the team. unstable() marks the build yellow (UNSTABLE) instead of red (FAILURE).

pipeline {
    agent { label 'linux' }

    stages {

        stage('Test') {
            steps {
                sh './gradlew test jacocoTestReport'
            }
            post {
                always {
                    junit 'build/test-results/**/*.xml'

                    // Check code coverage — warn if below threshold, don't fail
                    script {
                        def coverage = sh(
                            script: "grep -oP '(?<= 1) {
                        // Critical vulnerabilities found — fail the build
                        error('Dependency check found HIGH severity vulnerabilities — build blocked')
                    }
                }
            }
        }

    }
}
[Pipeline] { (Test) }
[Pipeline] sh
+ ./gradlew test jacocoTestReport
BUILD SUCCESSFUL — 58 tests completed, 0 failed
[Pipeline] junit — recording test results
[Pipeline] script
[Pipeline] sh
+ grep -oP ... jacocoTestReport.xml
73
[Pipeline] unstable
WARNING: Code coverage 73% is below the 80% threshold
Build marked as UNSTABLE
[Pipeline] { (Security Advisory Scan) }
[Pipeline] sh
+ ./gradlew dependencyCheckAnalyze
[INFO] Dependency-Check: 2 advisories found (LOW severity)
exit code: 1
[Pipeline] unstable
WARNING: Dependency check found advisories — review the report
[Pipeline] End of Pipeline
Finished: UNSTABLE

What just happened?

  • All tests passed — but coverage came back at 73%, below the 80% threshold. The unstable() call fired, logging the warning and flipping the build from green to yellow.
  • Security scan found LOW advisories (exit code 1) — the returnStatus: true flag captured the exit code without failing the stage. The if (exitCode == 1) branch ran and called unstable() again, reinforcing the yellow state.
  • Finished: UNSTABLE — the dashboard dot turns yellow. The team sees the warning but nothing is blocked. They can review the report and decide whether to act before the next sprint. If the security scan had found HIGH severity issues (exitCode > 1), error() would have fired and the build would be red — fully blocked.

timeout() — Kill Hanging Builds

A build that hangs indefinitely is worse than a build that fails quickly. It ties up an agent executor, delays other builds, and gives nobody a clear signal that something is wrong. Always wrap external calls and long-running steps with timeout().

pipeline {
    agent { label 'linux' }

    // Pipeline-level timeout — the entire pipeline must finish within 45 minutes
    options {
        timeout(time: 45, unit: 'MINUTES')
    }

    stages {

        stage('Integration Tests') {
            steps {
                // Stage-level timeout — this specific stage has 15 minutes max
                timeout(time: 15, unit: 'MINUTES') {
                    sh './gradlew integrationTest'
                }
            }
        }

        stage('Wait for Service') {
            steps {
                // Step-level timeout — abort if the health check doesn't respond in 2 minutes
                timeout(time: 2, unit: 'MINUTES') {
                    // waitUntil polls until the block returns true or timeout fires
                    waitUntil {
                        def response = sh(
                            script: 'curl -s -o /dev/null -w "%{http_code}" https://staging.acmecorp.com/health',
                            returnStdout: true
                        ).trim()
                        return response == '200'
                    }
                }
            }
        }

    }
}
[Pipeline] { (Integration Tests) }
[Pipeline] timeout (15 MINUTES)
[Pipeline] sh
+ ./gradlew integrationTest
BUILD SUCCESSFUL in 8m 42s — 34 integration tests completed, 0 failed
[Pipeline] { (Wait for Service) }
[Pipeline] timeout (2 MINUTES)
[Pipeline] waitUntil
Attempt 1: curl returned 503 — service not ready yet
Attempt 2: curl returned 503 — service not ready yet
Attempt 3: curl returned 200 — service is healthy
[Pipeline] End of Pipeline
Finished: SUCCESS

What just happened?

  • Integration tests finished in 8m 42s — well within the 15-minute stage timeout. If tests had hung at, say, 14 minutes, Jenkins would have aborted that stage with Timeout exceeded and marked it failed — freeing the executor immediately.
  • waitUntil polled 3 times — the service returned 503 twice while it was starting up. On the third poll it returned 200. waitUntil returned true and the step completed. All of this happened within the 2-minute timeout window.
  • If the 2-minute timeout had fired — Jenkins would have aborted with Timeout: Waiting for next retry of waitUntil exceeded. The build would fail. This protects against a deploy that succeeds but leaves the service in a broken state that never becomes healthy.

Failure Handling — Quick Reference

Situation Tool to reach for Result on failure
Step fails intermittently due to network retry(N) Retries N times, then FAILURE
Step fails but pipeline must continue catchError() Continues, records FAILURE or UNSTABLE
Deploy fails — must auto-rollback try/catch + throw e Rollback executes, then FAILURE
Non-critical issue — flag but don't block unstable(message:) Build turns yellow — team is alerted
Step takes too long / hangs timeout(time, unit) ABORTED after timeout duration
Cleanup must run regardless of result post { always { } } Always executes — build result unchanged

Teacher's Note

Always capture the previous version before deploying. Always re-throw caught exceptions. These two habits alone prevent most production incidents caused by Jenkins pipelines.

Practice Questions

1. After handling a failure in a catch block (e.g. running a rollback), what must you write to ensure Jenkins records the build as FAILURE rather than SUCCESS?



2. Which Jenkins step marks a build as yellow (UNSTABLE) rather than red (FAILURE) — used for non-critical issues that should be flagged but not block the team?



3. In a try/catch block, which clause always runs regardless of whether the try succeeded or the catch handled a failure?



Quiz

1. When is it appropriate to use retry() on a pipeline step?


2. What is the key difference between catchError and default failure behaviour?


3. What does waitUntil { } do in a Jenkins pipeline?


Up Next · Lesson 25

Pipeline Best Practices

The habits, patterns, and rules that separate a Jenkinsfile that holds up under pressure from one that becomes a maintenance burden six months in.