Ansible Course
Error Handling
In this lesson
Error handling in Ansible is the set of mechanisms that control what happens when a task fails — whether the play stops immediately, continues on other hosts, retries, or triggers a cleanup sequence. By default Ansible halts execution on a host the moment any task fails and reports it at the end of the play recap. That default is safe but inflexible: real production automation needs to distinguish between expected failures, unexpected failures, and catastrophic failures — responding differently to each. Mastering Ansible's error handling transforms fragile scripts into resilient automation that can recover from problems, clean up after itself, and leave your infrastructure in a known state even when things go wrong.
Default Error Behaviour
Before learning how to override Ansible's error behaviour, it is important to understand exactly what the default behaviour is. This prevents surprises and ensures your error handling decisions are deliberate.
Default behaviour 1
A failing task stops further tasks on that host
When a task
fails on web01, Ansible marks web01 as failed and
skips all remaining tasks for that host in the current play. Other hosts in
the play continue executing unaffected.
Default behaviour 2
The failed host is excluded from subsequent plays
If a playbook has multiple plays, any host that failed in Play 1 is automatically removed from the host list for Play 2 and all subsequent plays in the same run.
Default behaviour 3
Handlers do not run for failed hosts
If a host fails mid-play, any handlers that were notified by successful tasks earlier in the play do not run for that host. This can leave services in an inconsistent state — a config was updated but the service was never restarted.
Default behaviour 4
The play recap shows the failure but the playbook exits 0 only if no hosts failed
Ansible exits with a non-zero return code when any host fails. CI/CD pipelines that check the exit code will detect this and halt the pipeline — the correct behaviour for automated deployments.
ignore_errors and failed_when
The simplest error handling tools are
task-level attributes that override how Ansible interprets a task's result. Use them
sparingly — every ignore_errors: true is a place where a real problem
could be silently swallowed.
# ignore_errors: true — continue the play even if this task fails
# Use ONLY when a failure is expected and explicitly handled below
- name: Stop application (may not be running on first deploy)
ansible.builtin.service:
name: myapp
state: stopped
ignore_errors: true # app might not exist yet — that is fine
- name: Start fresh application deployment
ansible.builtin.command:
cmd: /usr/local/bin/deploy.sh
# This task runs regardless of whether the stop task above succeeded
---
# failed_when — redefine what "failure" means for a specific task
# Use when the module's default failure detection doesn't match your intent
# grep exits 1 when no match found — not actually a failure in this context
- name: Check whether the app process is running
ansible.builtin.command:
cmd: pgrep -f myapp
register: process_check
failed_when: process_check.rc > 1 # rc=0: found, rc=1: not found (ok), rc>1: real error
changed_when: false
# Custom failure condition based on output content
- name: Run database migration
ansible.builtin.command:
cmd: python manage.py migrate --no-input
register: migration_result
failed_when: >
migration_result.rc != 0 or
'ERROR' in migration_result.stderr
The Circuit Breaker Analogy
Ansible's default error behaviour is like a
circuit breaker in an electrical panel — when one circuit overloads, it trips and
cuts power to protect the rest of the system. ignore_errors is like
bypassing the breaker with a piece of wire: it keeps the circuit running, but now
a real fault will go undetected and could cause a fire. Use it deliberately and
always ensure you are monitoring for the condition you are ignoring — do not just
make the warning disappear.
block / rescue / always
The
block
construct is Ansible's equivalent of a try/catch/finally pattern. It groups a set of
tasks together and lets you define what happens if any of them fail
(rescue) and what always runs regardless (always). This is
the right tool for implementing rollback logic, cleanup tasks, and notifications when
a deployment fails.
---
- name: Deploy application with rollback on failure
hosts: appservers
become: true
tasks:
- name: Run deployment with error handling
block:
# All tasks in block run sequentially
- name: Back up current application
ansible.builtin.copy:
src: "{{ deploy_dir }}/current/"
dest: "{{ deploy_dir }}/rollback_backup/"
remote_src: true
- name: Deploy new application version
ansible.builtin.unarchive:
src: "{{ release_archive }}"
dest: "{{ deploy_dir }}/releases/{{ version }}"
remote_src: false
- name: Update current symlink to new release
ansible.builtin.file:
src: "{{ deploy_dir }}/releases/{{ version }}"
dest: "{{ deploy_dir }}/current"
state: link
force: true
- name: Run database migrations
ansible.builtin.command:
cmd: python manage.py migrate --no-input
chdir: "{{ deploy_dir }}/current"
register: migration_result
- name: Restart application service
ansible.builtin.service:
name: myapp
state: restarted
rescue:
# Only runs if ANY task in block fails
- name: Log the deployment failure
ansible.builtin.debug:
msg: "Deployment failed! Rolling back to previous version."
- name: Restore previous release symlink
ansible.builtin.file:
src: "{{ deploy_dir }}/rollback_backup/"
dest: "{{ deploy_dir }}/current"
state: link
force: true
- name: Restart application with previous version
ansible.builtin.service:
name: myapp
state: restarted
- name: Send failure notification
ansible.builtin.uri:
url: "{{ slack_webhook_url }}"
method: POST
body_format: json
body:
text: "Deployment of {{ version }} FAILED on {{ inventory_hostname }}"
always:
# Runs regardless of success or failure
- name: Clean up temporary deployment files
ansible.builtin.file:
path: "{{ deploy_dir }}/tmp_deploy"
state: absent
- name: Record deployment outcome
ansible.builtin.lineinfile:
path: /var/log/deploy_history.log
line: "{{ ansible_date_time.iso8601 }} — {{ version }} — {{ 'SUCCESS' if not ansible_failed_task else 'FAILED' }}"
create: true
# Scenario: migration task fails mid-deployment
TASK [Back up current application] ********************************************
changed: [appserver01]
TASK [Deploy new application version] *****************************************
changed: [appserver01]
TASK [Update current symlink to new release] **********************************
changed: [appserver01]
TASK [Run database migrations] ************************************************
FAILED! => {"rc": 1, "stderr": "django.db.utils.OperationalError: table locked"}
TASK [Log the deployment failure] *********************************************
ok: [appserver01] => {"msg": "Deployment failed! Rolling back to previous version."}
TASK [Restore previous release symlink] ***************************************
changed: [appserver01]
TASK [Restart application with previous version] ******************************
changed: [appserver01]
TASK [Send failure notification] **********************************************
ok: [appserver01]
TASK [Clean up temporary deployment files] ************************************
ok: [appserver01] <-- always runs
TASK [Record deployment outcome] **********************************************
changed: [appserver01] <-- always runs
PLAY RECAP ********************************************************************
appserver01 : ok=8 changed=6 unreachable=0 failed=0 rescued=1 skipped=0What just happened?
The migration task failed, which triggered the
rescue section — rollback, restart, and Slack notification all ran
automatically. The always section then ran cleanup and logging regardless.
Notice the play recap shows failed=0 and rescued=1 — the
rescue section handled the error cleanly, so the overall play is considered to have
completed without unhandled failures. The application was restored to the previous
version with zero manual intervention.
any_errors_fatal and max_fail_percentage
By default, a failure on one host does
not stop the play for other hosts. Sometimes this is exactly what you want — continue
on healthy hosts and report the failures at the end. Other times, a single failure
should immediately abort the entire play to prevent cascading damage.
any_errors_fatal
and
max_fail_percentage
control this threshold.
# Stop immediately if ANY host fails (zero tolerance)
- name: Apply critical security patches
hosts: all
become: true
any_errors_fatal: true # one failure = abort everything
tasks:
- name: Apply kernel security patch
ansible.builtin.package:
name: linux-image-generic
state: latest
---
# Allow up to 20% of hosts to fail before aborting (rolling tolerance)
- name: Deploy application update
hosts: webservers
become: true
serial: 5 # process 5 hosts at a time
max_fail_percentage: 20 # abort if more than 20% fail
tasks:
- name: Deploy new release
ansible.builtin.unarchive:
src: release.tar.gz
dest: /var/www/app
remote_src: false
Retrying Tasks
Some failures are transient — a service
takes a few seconds to become healthy after starting, an API endpoint is temporarily
unavailable, or a database is momentarily locked. The
retries
and
until
attributes let a task keep trying until a condition is met — without failing
permanently on a transient error.
# Wait for a service to become healthy after starting
- name: Wait for application to become available
ansible.builtin.uri:
url: "http://{{ ansible_default_ipv4.address }}:{{ app_port }}/health"
status_code: 200
register: health_response
retries: 10 # try up to 10 times
delay: 6 # wait 6 seconds between each attempt
until: health_response.status == 200
# Total wait time: up to 10 * 6 = 60 seconds
# Wait for a port to open after service start
- name: Wait for database port to be accepting connections
ansible.builtin.wait_for:
host: "{{ db_host }}"
port: 5432
timeout: 60 # wait up to 60 seconds total
state: started
# Retry a flaky command
- name: Download release archive (retry on network failure)
ansible.builtin.get_url:
url: "{{ release_url }}"
dest: /tmp/release.tar.gz
checksum: "sha256:{{ release_checksum }}"
register: download_result
retries: 3
delay: 10
until: download_result is succeeded
Error Handling Decision Guide
There are five different scenarios that call for different error handling approaches. Use this table to choose the right tool for each situation.
Scenario → Correct tool
ignore_errors: true — skip the error and continue. Document
why in a comment.
failed_when — define the real failure condition based on
exit code or output content.
block / rescue / always — group the deployment steps in
block and put rollback in rescue.
any_errors_fatal: true — zero-tolerance abort at the
play level.
retries + until + delay — poll
until healthy rather than failing immediately.
Never Use ignore_errors as a Substitute for Proper Error Handling
ignore_errors: true
is the Ansible equivalent of catching an exception and doing nothing with it —
the failure is real but now invisible. A play that uses ignore_errors
on many tasks can report failed=0 in the recap while leaving the
system in a broken state. Every use of ignore_errors must be
accompanied by a comment explaining exactly why the failure is expected and
acceptable. If you cannot write that comment, the right tool is
failed_when or block/rescue — not
ignore_errors.
Key Takeaways
block / rescue / always is Ansible's try/catch/finally
— use it for multi-step operations that need automatic rollback, cleanup, or
failure notification when something goes wrong mid-deployment.
failed_when instead of ignore_errors
where possible — it lets you precisely define what constitutes a real
failure rather than blindly suppressing all errors from a task.
retries + until handles transient failures
gracefully — poll for a health condition rather than failing
immediately when a service takes time to become ready after starting.
ignore_errors: true needs a comment explaining
why — if you cannot justify the suppression in writing, you need a
different error handling approach.
Teacher's Note
Take the Lesson 13 deployment
playbook and wrap all its tasks in a block with a rescue
that prints a failure message and a always that cleans up any temp
files. Deliberately break one task and watch the rescue run. This exercise makes
the block/rescue/always structure click faster than any amount of reading.
Practice Questions
1. In a block / rescue / always
structure, which section runs only when a task inside block fails?
2. Which play-level attribute stops the entire play immediately if any single host fails — regardless of how many other hosts are still healthy?
3. When using retries to
poll for a health condition, which attribute specifies the condition that must
be true for the task to stop retrying and succeed?
Quiz
1. A block has three tasks. Task 2 fails. The rescue section runs successfully. Does the always section run?
2. A task runs grep ERROR
/var/log/app.log. It exits with code 1 when no errors are found — which
Ansible treats as a failure. How do you handle this correctly?
3. A play recap shows
failed=0 rescued=1. What does this mean?
Up Next · Lesson 22
Tags and Selective Execution
Learn to run only the parts of a playbook you need — tagging tasks, running subsets, skipping sections, and using tags to speed up iterative development and targeted production changes.