Ansible Lesson 21 – Error Handling | Dataplexa

Section II · Lesson 21

Error Handling

In this lesson

ignore_errors failed_when block / rescue / always any_errors_fatal Rollback patterns

Error handling in Ansible is the set of mechanisms that control what happens when a task fails — whether the play stops immediately, continues on other hosts, retries, or triggers a cleanup sequence. By default Ansible halts execution on a host the moment any task fails and reports it at the end of the play recap. That default is safe but inflexible: real production automation needs to distinguish between expected failures, unexpected failures, and catastrophic failures — responding differently to each. Mastering Ansible's error handling transforms fragile scripts into resilient automation that can recover from problems, clean up after itself, and leave your infrastructure in a known state even when things go wrong.

Default Error Behaviour

Before learning how to override Ansible's error behaviour, it is important to understand exactly what the default behaviour is. This prevents surprises and ensures your error handling decisions are deliberate.

Default behaviour 1

A failing task stops further tasks on that host

When a task fails on web01, Ansible marks web01 as failed and skips all remaining tasks for that host in the current play. Other hosts in the play continue executing unaffected.

Default behaviour 2

The failed host is excluded from subsequent plays

If a playbook has multiple plays, any host that failed in Play 1 is automatically removed from the host list for Play 2 and all subsequent plays in the same run.

Default behaviour 3

Handlers do not run for failed hosts

If a host fails mid-play, any handlers that were notified by successful tasks earlier in the play do not run for that host. This can leave services in an inconsistent state — a config was updated but the service was never restarted.

Default behaviour 4

The play recap shows the failure but the playbook exits 0 only if no hosts failed

Ansible exits with a non-zero return code when any host fails. CI/CD pipelines that check the exit code will detect this and halt the pipeline — the correct behaviour for automated deployments.

ignore_errors and failed_when

The simplest error handling tools are task-level attributes that override how Ansible interprets a task's result. Use them sparingly — every ignore_errors: true is a place where a real problem could be silently swallowed.

# ignore_errors: true — continue the play even if this task fails
# Use ONLY when a failure is expected and explicitly handled below
- name: Stop application (may not be running on first deploy)
  ansible.builtin.service:
    name: myapp
    state: stopped
  ignore_errors: true              # app might not exist yet — that is fine

- name: Start fresh application deployment
  ansible.builtin.command:
    cmd: /usr/local/bin/deploy.sh
  # This task runs regardless of whether the stop task above succeeded

---
# failed_when — redefine what "failure" means for a specific task
# Use when the module's default failure detection doesn't match your intent

# grep exits 1 when no match found — not actually a failure in this context
- name: Check whether the app process is running
  ansible.builtin.command:
    cmd: pgrep -f myapp
  register: process_check
  failed_when: process_check.rc > 1    # rc=0: found, rc=1: not found (ok), rc>1: real error
  changed_when: false

# Custom failure condition based on output content
- name: Run database migration
  ansible.builtin.command:
    cmd: python manage.py migrate --no-input
  register: migration_result
  failed_when: >
    migration_result.rc != 0 or
    'ERROR' in migration_result.stderr

The Circuit Breaker Analogy

Ansible's default error behaviour is like a circuit breaker in an electrical panel — when one circuit overloads, it trips and cuts power to protect the rest of the system. ignore_errors is like bypassing the breaker with a piece of wire: it keeps the circuit running, but now a real fault will go undetected and could cause a fire. Use it deliberately and always ensure you are monitoring for the condition you are ignoring — do not just make the warning disappear.

block / rescue / always

The block construct is Ansible's equivalent of a try/catch/finally pattern. It groups a set of tasks together and lets you define what happens if any of them fail (rescue) and what always runs regardless (always). This is the right tool for implementing rollback logic, cleanup tasks, and notifications when a deployment fails.

---
- name: Deploy application with rollback on failure
  hosts: appservers
  become: true

  tasks:
    - name: Run deployment with error handling
      block:
        # All tasks in block run sequentially
        - name: Back up current application
          ansible.builtin.copy:
            src: "{{ deploy_dir }}/current/"
            dest: "{{ deploy_dir }}/rollback_backup/"
            remote_src: true

        - name: Deploy new application version
          ansible.builtin.unarchive:
            src: "{{ release_archive }}"
            dest: "{{ deploy_dir }}/releases/{{ version }}"
            remote_src: false

        - name: Update current symlink to new release
          ansible.builtin.file:
            src: "{{ deploy_dir }}/releases/{{ version }}"
            dest: "{{ deploy_dir }}/current"
            state: link
            force: true

        - name: Run database migrations
          ansible.builtin.command:
            cmd: python manage.py migrate --no-input
            chdir: "{{ deploy_dir }}/current"
          register: migration_result

        - name: Restart application service
          ansible.builtin.service:
            name: myapp
            state: restarted

      rescue:
        # Only runs if ANY task in block fails
        - name: Log the deployment failure
          ansible.builtin.debug:
            msg: "Deployment failed! Rolling back to previous version."

        - name: Restore previous release symlink
          ansible.builtin.file:
            src: "{{ deploy_dir }}/rollback_backup/"
            dest: "{{ deploy_dir }}/current"
            state: link
            force: true

        - name: Restart application with previous version
          ansible.builtin.service:
            name: myapp
            state: restarted

        - name: Send failure notification
          ansible.builtin.uri:
            url: "{{ slack_webhook_url }}"
            method: POST
            body_format: json
            body:
              text: "Deployment of {{ version }} FAILED on {{ inventory_hostname }}"

      always:
        # Runs regardless of success or failure
        - name: Clean up temporary deployment files
          ansible.builtin.file:
            path: "{{ deploy_dir }}/tmp_deploy"
            state: absent

        - name: Record deployment outcome
          ansible.builtin.lineinfile:
            path: /var/log/deploy_history.log
            line: "{{ ansible_date_time.iso8601 }} — {{ version }} — {{ 'SUCCESS' if not ansible_failed_task else 'FAILED' }}"
            create: true

# Scenario: migration task fails mid-deployment

TASK [Back up current application] ********************************************
changed: [appserver01]

TASK [Deploy new application version] *****************************************
changed: [appserver01]

TASK [Update current symlink to new release] **********************************
changed: [appserver01]

TASK [Run database migrations] ************************************************
FAILED! => {"rc": 1, "stderr": "django.db.utils.OperationalError: table locked"}

TASK [Log the deployment failure] *********************************************
ok: [appserver01] => {"msg": "Deployment failed! Rolling back to previous version."}

TASK [Restore previous release symlink] ***************************************
changed: [appserver01]

TASK [Restart application with previous version] ******************************
changed: [appserver01]

TASK [Send failure notification] **********************************************
ok: [appserver01]

TASK [Clean up temporary deployment files] ************************************
ok: [appserver01]   <-- always runs

TASK [Record deployment outcome] **********************************************
changed: [appserver01]   <-- always runs

PLAY RECAP ********************************************************************
appserver01   : ok=8  changed=6  unreachable=0  failed=0  rescued=1  skipped=0

What just happened?

The migration task failed, which triggered the rescue section — rollback, restart, and Slack notification all ran automatically. The always section then ran cleanup and logging regardless. Notice the play recap shows failed=0 and rescued=1 — the rescue section handled the error cleanly, so the overall play is considered to have completed without unhandled failures. The application was restored to the previous version with zero manual intervention.

any_errors_fatal and max_fail_percentage

By default, a failure on one host does not stop the play for other hosts. Sometimes this is exactly what you want — continue on healthy hosts and report the failures at the end. Other times, a single failure should immediately abort the entire play to prevent cascading damage. any_errors_fatal and max_fail_percentage control this threshold.

any_errors_fatal: true

Stop the entire play immediately if ANY host fails

Zero tolerance — one failure aborts all remaining hosts

Best for database schema changes, security hardening, or any task where a partial application is worse than no change

max_fail_percentage: N

Stop the play when more than N% of hosts have failed

Tolerant — allows some failures before aborting

Best for rolling updates where some failure is acceptable but a majority failure signals a systemic problem

# Stop immediately if ANY host fails (zero tolerance)
- name: Apply critical security patches
  hosts: all
  become: true
  any_errors_fatal: true           # one failure = abort everything

  tasks:
    - name: Apply kernel security patch
      ansible.builtin.package:
        name: linux-image-generic
        state: latest

---
# Allow up to 20% of hosts to fail before aborting (rolling tolerance)
- name: Deploy application update
  hosts: webservers
  become: true
  serial: 5                        # process 5 hosts at a time
  max_fail_percentage: 20          # abort if more than 20% fail

  tasks:
    - name: Deploy new release
      ansible.builtin.unarchive:
        src: release.tar.gz
        dest: /var/www/app
        remote_src: false

Retrying Tasks

Some failures are transient — a service takes a few seconds to become healthy after starting, an API endpoint is temporarily unavailable, or a database is momentarily locked. The retries and until attributes let a task keep trying until a condition is met — without failing permanently on a transient error.

# Wait for a service to become healthy after starting
- name: Wait for application to become available
  ansible.builtin.uri:
    url: "http://{{ ansible_default_ipv4.address }}:{{ app_port }}/health"
    status_code: 200
  register: health_response
  retries: 10          # try up to 10 times
  delay: 6             # wait 6 seconds between each attempt
  until: health_response.status == 200
  # Total wait time: up to 10 * 6 = 60 seconds

# Wait for a port to open after service start
- name: Wait for database port to be accepting connections
  ansible.builtin.wait_for:
    host: "{{ db_host }}"
    port: 5432
    timeout: 60        # wait up to 60 seconds total
    state: started

# Retry a flaky command
- name: Download release archive (retry on network failure)
  ansible.builtin.get_url:
    url: "{{ release_url }}"
    dest: /tmp/release.tar.gz
    checksum: "sha256:{{ release_checksum }}"
  register: download_result
  retries: 3
  delay: 10
  until: download_result is succeeded

Error Handling Decision Guide

There are five different scenarios that call for different error handling approaches. Use this table to choose the right tool for each situation.

Scenario → Correct tool

A task might fail on first deploy but will always succeed once the system is provisioned ignore_errors: true — skip the error and continue. Document why in a comment.

A command exits non-zero even on success (grep, diff, pgrep) failed_when — define the real failure condition based on exit code or output content.

A multi-step deployment needs automatic rollback if any step fails block / rescue / always — group the deployment steps in block and put rollback in rescue.

A failure on any host should immediately stop the entire fleet update any_errors_fatal: true — zero-tolerance abort at the play level.

A service or endpoint takes time to become available after a restart retries + until + delay — poll until healthy rather than failing immediately.

Never Use ignore_errors as a Substitute for Proper Error Handling

ignore_errors: true is the Ansible equivalent of catching an exception and doing nothing with it — the failure is real but now invisible. A play that uses ignore_errors on many tasks can report failed=0 in the recap while leaving the system in a broken state. Every use of ignore_errors must be accompanied by a comment explaining exactly why the failure is expected and acceptable. If you cannot write that comment, the right tool is failed_when or block/rescue — not ignore_errors.

Key Takeaways

✓

By default a failing task stops that host but continues other hosts — the failed host is excluded from subsequent plays in the same run, and its notified handlers do not run.

✓

block / rescue / always is Ansible's try/catch/finally — use it for multi-step operations that need automatic rollback, cleanup, or failure notification when something goes wrong mid-deployment.

✓

Use failed_when instead of ignore_errors where possible — it lets you precisely define what constitutes a real failure rather than blindly suppressing all errors from a task.

✓

retries + until handles transient failures gracefully — poll for a health condition rather than failing immediately when a service takes time to become ready after starting.

✓

Every ignore_errors: true needs a comment explaining why — if you cannot justify the suppression in writing, you need a different error handling approach.

Teacher's Note

Take the Lesson 13 deployment playbook and wrap all its tasks in a block with a rescue that prints a failure message and a always that cleans up any temp files. Deliberately break one task and watch the rescue run. This exercise makes the block/rescue/always structure click faster than any amount of reading.

Practice Questions

1. In a block / rescue / always structure, which section runs only when a task inside block fails?

2. Which play-level attribute stops the entire play immediately if any single host fails — regardless of how many other hosts are still healthy?

3. When using retries to poll for a health condition, which attribute specifies the condition that must be true for the task to stop retrying and succeed?

Up Next · Lesson 22

Tags and Selective Execution

Learn to run only the parts of a playbook you need — tagging tasks, running subsets, skipping sections, and using tags to speed up iterative development and targeted production changes.

← Previous Course Index Next →

Ansible Course