Ansible Lesson 34 – Troubleshooting | Dataplexa
Section III · Lesson 34

Troubleshooting Ansible

In this lesson

Verbosity levels debug & assert Step & check mode Common error patterns Connection issues

Troubleshooting Ansible means knowing exactly which tool to reach for when a playbook fails — whether the problem is a connection error, a variable resolving to the wrong value, a task reporting success while leaving the system broken, or an error message too opaque to diagnose at face value. Ansible ships with a complete debugging toolkit: verbosity flags, the debug module, assert for self-documenting precondition checks, step mode for interactive task-by-task execution, and --check --diff for non-destructive pre-flight inspection. This lesson maps each tool to the failure scenarios it is best suited to diagnose.

Verbosity Levels

-v is the first tool to reach for when a playbook fails with an opaque error. Each additional v adds a layer of detail — start at -v and increase until you see what you need.

Verbosity flag reference

-v Task return values — stdout, stderr, rc, and module-specific output. Best for understanding why a task reported changed or failed.
-vv Task input arguments — the resolved values of every parameter passed to the module. Use when a variable is not resolving to the expected value.
-vvv SSH connection details — exact host, user, key file, and connection parameters. Essential for diagnosing SSH failures and wrong key files.
-vvvv Connection plugin internals and raw bytes. Rarely needed — if you reach this level without finding the problem, the issue is likely in the network layer or Python environment.
ansible-playbook site.yml -v         # task return values
ansible-playbook site.yml -vv        # + resolved task arguments
ansible-playbook site.yml -vvv       # + SSH connection details
ansible-playbook site.yml -vvvv      # + connection internals

# Combine with --limit to reduce noise
ansible-playbook site.yml -vvv --limit web01.example.com

debug and assert

The debug module prints variable values at runtime without affecting state. The assert module fails the play with a clear custom message when a condition is not met — turning silent assumption violations into loud, diagnosable errors.

- name: Show resolved variable values
  ansible.builtin.debug:
    msg: "Deploying {{ app_version }} to {{ inventory_hostname }} ({{ environment }})"

- name: Inspect full registered result structure
  ansible.builtin.debug:
    var: migration_result       # shows rc, stdout, stderr, etc.

- name: Debug output only in verbose mode
  ansible.builtin.debug:
    var: db_config
    verbosity: 2                # only shown at -vv and above

- name: Assert minimum Ansible version
  ansible.builtin.assert:
    that: "ansible_version.full is version('2.14', '>=')"
    fail_msg: "Ansible 2.14+ required. Found: {{ ansible_version.full }}"
    success_msg: "Ansible version OK: {{ ansible_version.full }}"

- name: Assert required deployment variables are set
  ansible.builtin.assert:
    that:
      - app_version is defined and app_version | length > 0
      - environment in ['staging', 'production']
      - db_host is defined
    fail_msg: >
      Missing or invalid required variables.
      app_version={{ app_version | default('UNDEFINED') }},
      environment={{ environment | default('UNDEFINED') }}
TASK [Assert minimum Ansible version] *****************************************
ok: [localhost] => {
    "msg": "Ansible version OK: 2.16.3"
}

TASK [Assert required deployment variables are set] ***************************
FAILED! => {
    "assertion": "environment in ['staging', 'production']",
    "evaluated_to": false,
    "msg": "Missing or invalid required variables. app_version=2.4.1, environment=UNDEFINED"
}

PLAY RECAP ********************************************************************
localhost   : ok=1  changed=0  unreachable=0  failed=1

What just happened?

The assert task pinpointed the exact failing condition and printed the resolved variable values. Without assert this would have caused a confusing downstream failure tens of tasks later. Placed in pre_tasks, it fires before any infrastructure change is made — the cheapest possible point to catch a misconfiguration.

The Pre-Flight Checklist Analogy

A pilot does not discover a missing instrument mid-flight. assert tasks in pre_tasks are the pre-flight checklist — they catch problems on the ground before takeoff, when the cost is zero. The same discovery after 40 tasks have partially modified production costs hours of recovery.

Step Mode and Check Mode

--step (interactive)
Pauses before each task — y to run, n to skip, c to continue without pausing
Use when you need to inspect state between specific tasks in a terminal session
--check --diff (dry run)
--check shows what would change without applying; --diff shows exact file content changes
Safe for CI pipelines — produces no changes on managed nodes
ansible-playbook site.yml --step                    # interactive task-by-task
ansible-playbook site.yml --check                   # dry run — no changes applied
ansible-playbook site.yml --check --diff            # dry run + exact file diffs
ansible-playbook site.yml --start-at-task "Deploy Nginx configuration"
ansible-playbook site.yml --tags config --step
# ansible-playbook site.yml --check --diff

TASK [Deploy Nginx configuration] *********************************************
--- before: /etc/nginx/nginx.conf
+++ after: /tmp/ansible-managed-nginx.conf
@@ -3,7 +3,7 @@
 worker_processes 4;

-worker_connections 1024;
+worker_connections 2048;

changed: [web01.example.com]   <-- would change this file (not applied)

What just happened?

--check --diff showed exactly which line in nginx.conf would change — worker_connections from 1024 to 2048. No file was modified. Run this before every production deployment: if the diff contains anything unexpected, stop and investigate before applying.

Common Error Patterns and Fixes

Error → Cause → Fix

"mapping values are not allowed here"

Cause: Jinja2 expression not quoted

YAML's parser tries to interpret {{ variable }} as a flow mapping. Fix: wrap in double quotes — value: "{{ variable }}"

"UNREACHABLE! ... Connection timed out"

Cause: Network or firewall blocking SSH

Verify with ssh -i keyfile user@host directly. Check firewall rules and security groups. Run with -vvv to see the exact connection attempt.

"Permission denied (publickey)"

Cause: Wrong SSH key or user

Check remote_user and private_key_file in ansible.cfg. Confirm the public key is in ~/.ssh/authorized_keys on the managed node. Run -vvv to see which key Ansible is using.

"The variable ... is undefined"

Cause: Variable not defined in any scope

Add debug: var: hostvars[inventory_hostname] before the failing task. Check spelling — variable names are case-sensitive. Confirm the right group_vars file is loaded for this host.

"sudo: a password is required"

Cause: become configured but sudo needs a password

Add NOPASSWD: ALL to the sudoers entry, or pass the sudo password with --ask-become-pass / -K. Check that pipelining = True is not set while requiretty is still active.

Task reports "ok" but system is still wrong

Cause: Shell/command task not idempotent

ansible.builtin.shell always reports changed. Use the right module for the job. Add changed_when and failed_when conditions. Verify managed node state manually with an ad-hoc command.

"found character '\t' that cannot start any token"

Cause: Tab characters in YAML

YAML forbids tabs for indentation. Run yamllint playbook.yml to find them. Configure your editor to insert spaces on Tab for YAML files.

Diagnosing Connection Issues

# Step 1 — confirm network reachability
ping 192.168.1.10

# Step 2 — confirm SSH works with the same credentials Ansible would use
ssh -i ~/.ssh/id_ed25519 -l ansible 192.168.1.10

# Step 3 — confirm sudo works
ssh -i ~/.ssh/id_ed25519 -l ansible 192.168.1.10 "sudo whoami"

# Step 4 — run Ansible's connectivity test
ansible web01 -m ansible.builtin.ping -vvv

# Step 5 — test with explicit credentials to isolate ansible.cfg issues
ansible web01 -m ansible.builtin.ping \
  -u ansible --private-key ~/.ssh/id_ed25519 -vvv

# Step 6 — confirm which ansible.cfg is active
ansible --version   # shows "config file" path

The Defensive Playbook Pattern

---
- name: Deployment with defensive checks
  hosts: appservers
  become: true

  pre_tasks:
    - name: Verify Ansible version
      ansible.builtin.assert:
        that: "ansible_version.full is version('2.14', '>=')"
        fail_msg: "Ansible 2.14+ required. Found {{ ansible_version.full }}"

    - name: Verify required variables
      ansible.builtin.assert:
        that:
          - app_version is defined and app_version | length > 0
          - environment in ['staging', 'production']
        fail_msg: "Required variable missing or invalid — check group_vars"

    - name: Verify SSH connectivity
      ansible.builtin.ping:

    - name: Print resolved config (debug mode only)
      ansible.builtin.debug:
        msg:
          version: "{{ app_version }}"
          environment: "{{ environment }}"
          db_host: "{{ db_host }}"
      tags: [never, debug]    # only runs with --tags debug

  roles:
    - role: app_deploy

Never Add ignore_errors to Silence a Problem You Have Not Diagnosed

Adding ignore_errors: true to a failing task to make the play complete hides a real problem behind a false success. The recap shows failed=0 while the system is broken. Always diagnose the root cause first. ignore_errors is for expected failures that are genuinely acceptable — not for silencing errors you do not understand.

Key Takeaways

Start at -v and increase until you see the problem-v for return values, -vv for resolved arguments, -vvv for SSH details. Combine with --limit to reduce noise.
Use assert in pre_tasks to catch missing variables and wrong environments before any infrastructure change — the cheapest point to detect misconfiguration.
Always run --check --diff before production — the diff shows exactly which files would change. If anything is unexpected, stop and investigate.
Tag debug tasks [never, debug] — they stay in the playbook permanently but only run with --tags debug. Free diagnostics with zero normal-run overhead.
Diagnose connection failures layer by layer — ping, then SSH manually, then sudo, then ansible -m ping -vvv. Each step narrows the problem to one specific layer.

Teacher's Note

Add pre_tasks with assert checks to your Lesson 13 playbook — Ansible version, required variables, and a ping. Then deliberately break one assertion and confirm it fires before any task runs. That exercise makes pre-flight assertions a habit rather than an afterthought.

Practice Questions

1. A task fails with "Permission denied (publickey)". Which verbosity level shows the exact SSH command and key file Ansible is using?



2. Before applying a template change to production, which two flags show the exact lines that would change in every managed file — without making modifications?



3. In which play section should assert tasks be placed so they run before any roles — catching misconfiguration at the earliest possible point?



Quiz

1. A task fails with "mapping values are not allowed here" on a line containing value: {{ variable }}. What is the cause and fix?


2. A variable is resolving to the wrong value on one host. What is the fastest way to see all variables in scope for that host at the point of failure?


3. A 30-task playbook fails at task 28. You have fixed the issue and want to re-run from task 28 without repeating the first 27 tasks. Which flag achieves this?


Up Next · Lesson 35

Performance Optimization

Learn to dramatically cut playbook run times — SSH pipelining, fact caching, async tasks, callback profiling, and tuning Ansible for inventories of hundreds of hosts.