Ansible Lesson 25 – Playbook Best Practices | Dataplexa
Section II · Lesson 25

Playbook Best Practices

In this lesson

Project structure Naming conventions Performance habits Security in playbooks Section II recap

Playbook best practices are the habits, conventions, and structural decisions that make the difference between automation that works once and automation that a team can maintain, extend, and trust for years. This lesson consolidates the most important practical lessons from Section II into a definitive reference — covering project structure at scale, naming conventions that make playbooks self-documenting, performance optimisations that matter on large fleets, security practices that prevent credential leaks, and the Section II recap that closes out the playbook fundamentals phase of this course. Nothing in this lesson is new theory — it is applied wisdom from everything you have built in Lessons 11 through 24.

Production Project Structure

A production Ansible project is larger than the practice structures used in Lessons 10 and 13. It needs to support multiple environments, a team of engineers, CI/CD integration, and a growing library of roles. The structure below scales from a two-person startup to a 50-engineer platform team without reorganisation.

myproject/
├── ansible.cfg                    # project-level config — committed to Git
├── site.yml                       # master playbook — provisions the full stack
├── deploy.yml                     # deployment-only playbook (no provisioning)
│
├── inventory/
│   ├── staging/
│   │   ├── hosts.ini              # staging host list
│   │   └── group_vars/
│   │       ├── all.yml            # env: staging, log_level: debug
│   │       ├── webservers.yml     # staging nginx settings
│   │       └── databases.yml     # staging postgresql settings
│   └── production/
│       ├── hosts.ini              # production host list
│       └── group_vars/
│           ├── all.yml            # env: production, log_level: warn
│           ├── webservers.yml     # production nginx settings
│           └── databases.yml     # production postgresql settings
│
├── roles/
│   ├── common/                    # base config for every host
│   ├── nginx/                     # web server role
│   ├── postgresql/                # database role
│   └── app_deploy/                # application deployment role
│
├── playbooks/
│   ├── hardening.yml              # security hardening (run separately)
│   ├── monitoring.yml             # monitoring agent setup
│   └── rotate_secrets.yml        # credential rotation (tagged never)
│
├── files/                         # static files used across roles
├── templates/                     # project-level templates (rarely needed)
│
├── .vault_pass                    # NEVER COMMIT — listed in .gitignore
├── .gitignore                     # must include .vault_pass and *.retry
└── README.md                      # how to run this project — always maintained

Three non-negotiable project files

ansible.cfg in the project root ensures every engineer uses the same configuration regardless of what is in their home directory. .gitignore must list .vault_pass, *.retry, and any other files that must never be committed. README.md must be current — if it does not describe how to run the project and which inventories exist, the next engineer will spend an hour figuring out what you spent five minutes setting up.

Naming Conventions

Consistent naming is not cosmetic — it is the difference between a playbook that documents itself and one that requires an expert to interpret. These five conventions appear in every high-quality Ansible project and are enforced by ansible-lint.

Convention 1

Task names are imperative verb phrases

Install Nginx web server, Deploy application config, Create deploy user. Every task name reads like an instruction. Never: nginx, package, or done.

Convention 2

Variables are lowercase_with_underscores

nginx_port, db_max_connections, deploy_user. Never camelCase, never UPPER_CASE for user-defined variables. Ansible's built-in facts use ansible_*.

Convention 3

Role variables are prefixed with the role name

nginx_port not port, postgresql_version not version. This prevents name collisions when multiple roles are applied in the same play.

Convention 4

Playbooks are named by function, not by host

deploy.yml, hardening.yml, rotate_secrets.yml. Never web01.yml or server.yml. Playbooks describe actions, not targets.

Convention 5

Template files use the destination filename plus .j2

The template for /etc/nginx/nginx.conf is named nginx.conf.j2. The template for /etc/postgresql/15/main/postgresql.conf is named postgresql.conf.j2. The naming makes it immediately clear what the template renders into, and it makes grep and IDE search useful.

Performance Habits

On a 5-host lab, performance rarely matters. On a 200-host production fleet, a 10-second-per-host overhead compounds to 33 minutes of waiting. These habits collectively cut playbook run times by 30–60% on large inventories with no change to correctness.

Habit 1

Enable SSH pipelining

Set pipelining = True in ansible.cfg under [ssh_connection]. This eliminates a separate SSH connection per task by streaming commands over a single connection — reducing per-task overhead from ~150ms to ~30ms. Requires requiretty = False in sudoers, which is the default on most modern distributions.

Habit 2

Disable fact gathering on plays that do not need facts

Set gather_facts: false on any play whose tasks do not reference ansible_* variables. Fact gathering takes 1–3 seconds per host — on a 200-host fleet this is 3–10 minutes of pure overhead for plays that never use a single fact.

Habit 3

Use native list passing for package installs

Pass a list directly to ansible.builtin.package's name: parameter rather than looping — installs all packages in a single module invocation instead of one SSH operation per package. For 10 packages, this is 10× fewer round trips.

Habit 4

Increase forks for large fleets

The default fork count of 5 means Ansible processes 5 hosts at a time. Set forks = 20 (or higher) in ansible.cfg for large inventories. The right value depends on your control node's CPU count and network bandwidth — test and tune, but 20 is a safe starting point for most environments.

Habit 5

Cache facts with fact caching

Enable fact_caching = jsonfile with a reasonable fact_caching_timeout in ansible.cfg. Cached facts eliminate the setup module round-trip on subsequent runs within the timeout window — ideal for CI pipelines that run multiple playbooks against the same hosts in sequence.

Performance-optimised ansible.cfg

[defaults]
inventory         = ./inventory/production
remote_user       = ansible
forks             = 20               # parallel host count — tune to your infra
fact_caching      = jsonfile
fact_caching_connection = /tmp/ansible_facts_cache
fact_caching_timeout    = 3600       # cache facts for 1 hour
stdout_callback   = yaml             # cleaner output format
callbacks_enabled = timer, profile_tasks  # shows per-task timing

[ssh_connection]
pipelining        = True             # single SSH connection per task
ssh_args          = -C -o ControlMaster=auto -o ControlPersist=60s
                                     # connection multiplexing

Security in Playbooks

Security is not a feature you add to automation — it is a property you lose if you are not deliberate about it from the start. These six practices are the minimum baseline for any Ansible project that handles real credentials or runs against production infrastructure.

🔐

Encrypt all secrets with Ansible Vault

Every password, API key, and private key that must be committed to Git must be encrypted with ansible-vault encrypt_string or stored in an encrypted vault file. The vault password itself goes in .vault_pass — which is listed in .gitignore and never committed. Covered in depth in Lesson 28.

🙈

Use no_log: true on every task that handles credentials

Any task whose arguments contain a password, token, or private key must set no_log: true — otherwise the value appears in terminal output, CI logs, and any log aggregation systems connected to your automation pipeline.

🔑

Use SSH key authentication — never password authentication

Configure private_key_file in ansible.cfg and disable SSH password authentication on all managed nodes. Password-based SSH is vulnerable to brute force and creates audit trail gaps. Key-based auth is both more secure and more convenient in automation.

Validate config files before writing with validate:

Use the validate parameter on template, copy, and lineinfile tasks that modify critical config files — /etc/sudoers, /etc/ssh/sshd_config, nginx.conf. A syntax error in any of these files can lock you out of the server entirely.

🔒

Apply least-privilege become

Set become: false at the play level and only escalate to become: true at the task level for tasks that genuinely need root. Running an entire play as root when only 3 out of 20 tasks need it violates least-privilege and increases the blast radius of any task error.

📋

Require peer review for playbooks that run against production

Any playbook that modifies production infrastructure should go through a pull request review before it is run. The review forces a second pair of eyes on the --check --diff output and catches logic errors that the author missed. This is a process control, not a technical one — but it prevents more incidents than any technical safeguard.

The Idiomatic Playbook

Every practice from Lessons 11 through 24 is embodied in the following playbook. It is not a complete production playbook — it is a reference showing correct usage of every major convention in context. Read it top to bottom as a checklist of habits.

---
# deploy.yml — application deployment playbook
# Usage: ansible-playbook deploy.yml -i inventory/production/ -e "version=2.4.1"
# Tags:  config, deploy, services

- name: Deploy application to web fleet
  hosts: webservers
  become: false          # least-privilege — only escalate per task
  gather_facts: true     # needed: ansible_fqdn, ansible_default_ipv4
  serial: "25%"          # rolling update — 25% of hosts at a time
  max_fail_percentage: 0 # abort if any host fails

  vars_files:
    - vars/app.yml       # non-secret app variables
    - vars/secrets.yml   # vault-encrypted credentials

  pre_tasks:
    - name: Verify minimum Ansible version
      ansible.builtin.assert:
        that: "ansible_version.full is version('2.14', '>=')"
        msg: "Ansible 2.14+ required — found {{ ansible_version.full }}"
      tags: always

  roles:
    - role: nginx
      become: true       # escalate for this role only
      tags: [config, nginx]

  tasks:
    - name: Deploy application release archive
      ansible.builtin.unarchive:
        src: "releases/{{ version }}.tar.gz"
        dest: "{{ app_deploy_dir }}/releases/{{ version }}"
        remote_src: false
      become: true
      tags: deploy

    - name: Update current symlink to new release
      ansible.builtin.file:
        src: "{{ app_deploy_dir }}/releases/{{ version }}"
        dest: "{{ app_deploy_dir }}/current"
        state: link
        force: true
      become: true
      tags: deploy
      notify: Restart application

    - name: Verify application is healthy
      ansible.builtin.uri:
        url: "http://{{ ansible_default_ipv4.address }}:{{ app_port }}/health"
        status_code: 200
      register: health
      retries: 5
      delay: 6
      until: health.status == 200
      tags: deploy

  post_tasks:
    - name: Record successful deployment
      ansible.builtin.lineinfile:
        path: /var/log/deploy_history.log
        line: "{{ ansible_date_time.iso8601 }} — {{ version }} — deployed by {{ lookup('env', 'USER') }}"
        create: true
      become: true
      tags: always

  handlers:
    - name: Restart application
      ansible.builtin.service:
        name: myapp
        state: restarted
      become: true
      tags: services

Section II Recap

You have completed Section II — Ansible Playbooks. This section took you from a blank .yml file to a full multi-role project with handlers, templates, loops, conditionals, tags, and error handling. Here is what each lesson contributed to your toolkit.

Section II — what you now know

Lesson 11 YAML — syntax, scalars, lists, mappings, indentation rules, quoting, multi-line strings
Lesson 12 Playbook anatomy — plays, tasks, attributes, running, play recap, multiple plays
Lesson 13 First playbook — incremental build, run, verify, idempotency test, extend with variables
Lesson 14 Tasks, handlers, variables — task attributes, variable scopes, set_fact, register, debug, changed_when, failed_when
Lesson 15 Facts — gathering, essential fact variables, facts in conditions and templates, custom facts
Lesson 16 Conditionals and loops — when, AND/OR/NOT/IN, loop, loop over mappings, loop_control, combining loops with conditions
Lesson 17 Jinja2 templates — delimiters, template module, filters, if/for control structures, groups and hostvars
Lesson 18 File and package management — file states, copy/fetch, lineinfile/blockinfile, package modules, version pinning
Lesson 19 Service management — service vs systemd, state values, unit file deployment, service_facts, rolling restarts
Lesson 20 User and permission management — user module, SSH key management, sudoers, ACLs
Lesson 21 Error handling — default behaviour, ignore_errors, failed_when, block/rescue/always, any_errors_fatal, retries
Lesson 22 Tags — assigning, --tags/--skip-tags, inheritance, special built-in tags, tag strategy
Lesson 23 Roles introduction — roles vs playbooks, directory structure, creating a role, role dependencies
Lesson 24 Creating roles — full postgresql role, task splitting, multi-role site.yml, environment overrides, testing levels
Lesson 25 Best practices — project structure, naming conventions, performance habits, security in playbooks, the idiomatic playbook

Never Run a New Playbook Directly Against Production Without --check --diff First

This is the single rule that, if you follow it every time without exception, will prevent the most costly class of automation incident. Always run ansible-playbook --check --diff against a production-equivalent environment before applying any playbook change to the real fleet. The diff output shows exactly what would change — on every file Ansible manages. If the diff contains anything unexpected, stop. If the diff looks correct, you may proceed with confidence. No exceptions, no matter how small the change appears.

Key Takeaways

Structure your project for a team, not just yourself — separate inventories by environment, role variables prefixed by role name, a maintained README, and a .gitignore that excludes secrets.
Enable SSH pipelining and set forks to at least 20 — these two configuration changes alone eliminate most playbook performance problems on medium and large fleets.
Every secret goes through Ansible Vault — no credential ever lives in plaintext in a file that is or could be committed to version control. No exceptions.
Use pre_tasks to assert preconditions — verify Ansible version, required variables, and environment assumptions before any task that could modify infrastructure runs.
The idiomatic playbook uses all the conventions at once — the reference playbook in this lesson is the standard you should aim for. When in doubt, check it.

Teacher's Note

Before starting Section III, audit one of your own playbooks against the idiomatic reference in this lesson. Count how many of the conventions it follows and list the ones it does not. Then spend 20 minutes bringing it up to standard. This exercise consolidates everything from Section II more effectively than any amount of re-reading.

Practice Questions

1. Which ansible.cfg setting under [ssh_connection] reduces per-task SSH overhead by streaming module execution over a single connection rather than opening a new one per task?



2. In the idiomatic playbook, which play section runs assertion tasks — such as verifying the Ansible version — before any roles or regular tasks execute?



3. You are writing an Nginx role and need a variable that sets the listening port. Following role naming conventions, what should this variable be called?



Quiz

1. A play has 20 tasks but only 4 of them require root access. What is the most secure way to configure privilege escalation?


2. A playbook runs against 150 hosts but seems to process them slowly, one batch at a time. The control node has 16 CPUs and plenty of bandwidth. What is the most likely cause and fix?


3. A template task deploys /etc/nginx/nginx.conf with validate: "nginx -t -c %s". The template has a syntax error. What happens?


Up Next · Lesson 26 — Section III

Ansible Galaxy

Section III begins now. Discover Ansible Galaxy — the community hub for sharing and downloading roles and collections — and learn how to use, publish, and manage Galaxy content in your own projects.