Ansible Course
Ansible at Scale
In this lesson
Ansible at scale means running automation reliably against hundreds or thousands of hosts — with dynamic inventories that stay current automatically, centralised execution platforms that provide access control and audit trails, pull-mode architectures where hosts retrieve and apply their own configuration, and fleet patterns that make large playbooks predictable and safe. The tools and patterns in this lesson are what separate a team running Ansible on a laptop against a dozen servers from a platform team operating Ansible across an entire enterprise fleet.
The Scale Problem
Running Ansible against 10 servers is straightforward. Running it against 2,000 servers introduces challenges that do not exist at small scale — and each one requires a specific tool or pattern to solve.
Inventory is always stale
Servers are created and destroyed constantly. A static hosts.ini is out of date within hours. Manually maintaining it across thousands of hosts is impossible.
Dynamic inventory plugins
Query your cloud provider, CMDB, or DNS at runtime to build the inventory fresh every run. AWS, GCP, Azure, and VMware all have first-class inventory plugins.
No access control or audit trail
Anyone with the playbook and SSH keys can run anything against production. No record of who ran what, when, or what changed.
AWX / Ansible Tower
Centralised execution platform with role-based access control, job templates, schedules, audit logs, and a web UI. The SSH keys never leave the platform.
Push mode doesn't scale to all hosts
A control node pushing to 5,000 hosts simultaneously hits network and CPU limits. Configuration drift on hosts that are temporarily unreachable goes undetected.
ansible-pull (pull mode)
Each host runs a cron job that pulls the latest playbook from Git and applies it locally. Scales to any fleet size; every host self-corrects drift on its own schedule.
Dynamic Inventory
Dynamic inventory plugins query an external source — AWS EC2, GCP, Azure, VMware, Terraform state — at playbook runtime and return a live host list grouped by tags, regions, instance types, or any attribute the source exposes. The playbook never needs updating when servers change.
# inventory/aws_ec2.yml — AWS EC2 dynamic inventory plugin
plugin: amazon.aws.aws_ec2
regions:
- us-east-1
- eu-west-1
# Filter to only running instances
filters:
instance-state-name: running
# Group hosts by their EC2 tags automatically
keyed_groups:
- key: tags.Role # group by Role tag: webserver, database, cache
prefix: role_
- key: tags.Environment # group by Environment tag: production, staging
prefix: env_
- key: placement.region # group by AWS region
prefix: region_
# Set the connection address (private IP for internal access)
hostnames:
- private-ip-address
# Compose variables from EC2 metadata
compose:
ansible_host: private_ip_address
ec2_instance_type: instance_type
# List the dynamic inventory to verify groups and hosts
ansible-inventory -i inventory/aws_ec2.yml --list
ansible-inventory -i inventory/aws_ec2.yml --graph
# Run against a tag-based group — no hosts.ini needed
ansible-playbook site.yml -i inventory/aws_ec2.yml
# Target all production web servers (Role=webserver AND Environment=production)
ansible-playbook deploy.yml -i inventory/aws_ec2.yml \
--limit "role_webserver:&env_production"
# ansible-inventory -i inventory/aws_ec2.yml --graph @all: |--@role_webserver: | |--10.0.1.14 | |--10.0.1.22 | |--10.0.1.31 |--@role_database: | |--10.0.2.10 |--@env_production: | |--10.0.1.14 | |--10.0.1.22 | |--10.0.1.31 | |--10.0.2.10 |--@region_us_east_1: | |--10.0.1.14 | |--10.0.2.10 |--@region_eu_west_1: | |--10.0.1.22 | |--10.0.1.31
What just happened?
The inventory plugin queried AWS at runtime and
returned all running EC2 instances, automatically grouped by their tags and region.
No static file was involved — if a new instance with Role=webserver
starts up, it appears in role_webserver on the next run without any
manual update. The :& operator in --limit applies an
AND filter — targeting only hosts that are in both groups simultaneously.
The Live GPS Analogy
Static inventory is a printed paper map — accurate when it was made, increasingly wrong over time. Dynamic inventory is live GPS — it shows where everything actually is right now, updated at the moment you look. At small scale a paper map is fine. At scale, only GPS is reliable.
AWX and Ansible Tower
AWX is the open-source upstream project; Ansible Tower (now Red Hat Ansible Automation Platform) is the enterprise distribution. Both provide a web UI, REST API, and execution engine that runs playbooks centrally — with credentials stored securely, never exposed to individual engineers.
AWX / Tower capabilities
main.
# Trigger an AWX job template via REST API (from a CI/CD pipeline)
curl -X POST \
https://awx.example.com/api/v2/job_templates/42/launch/ \
-H "Authorization: Bearer ${AWX_TOKEN}" \
-H "Content-Type: application/json" \
-d '{
"extra_vars": {
"app_version": "2.4.1",
"environment": "production"
}
}'
# Install AWX on Kubernetes (official method)
# See: https://github.com/ansible/awx-operator
ansible-pull — Pull Mode
In push mode, a central control node
connects to managed hosts. In pull mode, each managed host runs
ansible-pull on a schedule — pulling the latest playbook from a Git
repository and applying it locally to itself. There is no central control node
involved at runtime. This architecture scales to any fleet size because each host
is responsible for its own configuration.
# Install ansible-pull on each managed host
pip install ansible --break-system-packages
# Manual pull — clone repo and run local.yml against localhost
ansible-pull \
-U https://github.com/myorg/ansible-config.git \
-C main \
local.yml # the playbook to apply (targets localhost)
# Run with vault password from a file
ansible-pull \
-U https://github.com/myorg/ansible-config.git \
--vault-password-file /etc/ansible/.vault_pass \
local.yml
Automating pull-mode with a cron role
# roles/ansible_pull/tasks/main.yml
# Deploy this role ONCE via push mode to bootstrap pull mode on each host
---
- name: Install Ansible on the managed host
ansible.builtin.package:
name: ansible
state: present
- name: Install Git for ansible-pull
ansible.builtin.package:
name: git
state: present
- name: Create vault password file
ansible.builtin.copy:
content: "{{ vault_ansible_pull_password }}"
dest: /etc/ansible/.vault_pass
owner: root
mode: "0600"
no_log: true
- name: Schedule ansible-pull via cron (every 30 minutes)
ansible.builtin.cron:
name: ansible-pull configuration management
minute: "*/30"
job: >
ansible-pull
-U https://github.com/myorg/ansible-config.git
-C main
--vault-password-file /etc/ansible/.vault_pass
local.yml
>> /var/log/ansible-pull.log 2>&1
user: root
state: present
The local.yml playbook that each host applies to itself
# local.yml — applied by ansible-pull on each host to itself
---
- name: Apply configuration to this host
hosts: localhost
connection: local
become: true
vars_files:
- vars/common.yml
- vars/secrets.yml # vault-encrypted
roles:
- role: common # base packages, timezone, NTP
- role: hardening # SSH, firewall, sysctl
- role: monitoring_agent # Prometheus node exporter
# Conditional roles — applied based on host facts or variables
- role: nginx
when: "'webserver' in ansible_hostname"
- role: postgresql
when: "'db' in ansible_hostname"
Delegation and Local Action
At scale, some tasks must run on a
different host than the one being configured — registering a server with a load
balancer, updating a DNS record, or notifying a monitoring system. The
delegate_to attribute redirects a single task to run on a specified
host while all variable context from the current host remains available.
- name: Rolling deployment with load balancer management
hosts: webservers
serial: 1
tasks:
# This task runs on the load balancer host, not on the web server being updated
- name: Remove this web server from load balancer
community.general.haproxy:
state: disabled
host: "{{ inventory_hostname }}" # still the current web server's name
socket: /var/run/haproxy/admin.sock
delegate_to: loadbalancer.example.com # but runs on the LB host
- name: Deploy new application release
ansible.builtin.unarchive:
src: "releases/{{ app_version }}.tar.gz"
dest: "{{ deploy_dir }}/releases/{{ app_version }}"
- name: Update current symlink
ansible.builtin.file:
src: "{{ deploy_dir }}/releases/{{ app_version }}"
dest: "{{ deploy_dir }}/current"
state: link
- name: Restart application service
ansible.builtin.service:
name: myapp
state: restarted
- name: Health check — verify before re-adding to LB
ansible.builtin.uri:
url: "http://{{ ansible_default_ipv4.address }}:{{ app_port }}/health"
status_code: 200
retries: 5
delay: 6
until: health.status == 200
register: health
# Run on the LB again — re-add only after health check passes
- name: Re-add web server to load balancer
community.general.haproxy:
state: enabled
host: "{{ inventory_hostname }}"
socket: /var/run/haproxy/admin.sock
delegate_to: loadbalancer.example.com
# Run on the control node (localhost) — update external monitoring
- name: Record deployment in monitoring system
ansible.builtin.uri:
url: "https://monitoring.example.com/api/deployments"
method: POST
body_format: json
body:
host: "{{ inventory_hostname }}"
version: "{{ app_version }}"
timestamp: "{{ ansible_date_time.iso8601 }}"
delegate_to: localhost
run_once: false
Fleet Patterns — run_once, serial, and throttle
Large-fleet playbooks need precise control over execution order, parallelism, and which tasks run once versus on every host. These three attributes handle the most common fleet orchestration requirements.
Fleet control attributes
run_once: true
Executes the task on only the first host in the current batch. Use for tasks that logically apply to the fleet as a whole — database migrations, DNS updates, Slack notifications — not to every individual host.
serial: "25%"
Processes hosts in batches of the specified size or percentage. The entire play runs to completion on the first batch before moving to the next — ensures some hosts remain live during rolling updates.
throttle: 5
Limits a specific task to run on at most N hosts simultaneously — even if forks is higher. Use for tasks that put load on an external resource: API calls, database writes, package mirror downloads.
---
- name: Production rolling update — 200 web servers
hosts: webservers
become: true
serial: "10%" # update 20 servers at a time (10% of 200)
max_fail_percentage: 5 # abort if more than 5% of hosts fail in any batch
tasks:
- name: Run database migration (once per batch, not once per host)
ansible.builtin.command:
cmd: python manage.py migrate --no-input
chdir: "{{ deploy_dir }}/current"
run_once: true # runs once on the first host in the batch
delegate_to: "{{ groups['databases'][0] }}" # actually runs on the DB server
- name: Pull new Docker image
community.docker.docker_image:
name: "registry.example.com/myapp:{{ app_version }}"
source: pull
throttle: 5 # only 5 hosts pull simultaneously — protects registry
- name: Update application container
community.docker.docker_container:
name: myapp
image: "registry.example.com/myapp:{{ app_version }}"
state: started
- name: Send deployment notification (once for the whole fleet)
ansible.builtin.uri:
url: "https://hooks.slack.com/services/{{ vault_slack_webhook }}"
method: POST
body_format: json
body:
text: "Deployed {{ app_version }} to {{ ansible_play_hosts | length }} web servers"
run_once: true
delegate_to: localhost
Dynamic Inventory Requires Cloud Credentials on the Control Node
AWS, GCP, and Azure dynamic
inventory plugins authenticate using the same credentials as the AWS/GCP/Azure CLI
— environment variables, instance profiles, or credential files. The control node
running the playbook must have permission to call the cloud provider's
DescribeInstances (or equivalent) API. In CI/CD pipelines this means
configuring an IAM role or service account with read-only access to compute
resources. Never use a root or full-admin credential for inventory queries —
the required permissions are minimal and read-only.
Key Takeaways
delegate_to redirects individual tasks to a different host — enabling load balancer drain/restore, DNS updates, and monitoring notifications as part of the same play that deploys the application.
run_once, serial, and throttle together — serial for rolling batch control, run_once for fleet-wide singleton tasks, throttle to protect shared resources from overload.
Teacher's Note
If you run in AWS or GCP, replace
your hosts.ini with a dynamic inventory plugin file today — even for a
small lab fleet. Run ansible-inventory --graph and watch your instances
appear grouped by their tags. That single change makes every future scaling step
automatic rather than manual.
Practice Questions
1. Which task attribute redirects a single task to run on a different host than the one currently being configured — while keeping all variable context from the current host?
2. A play targets 50 web servers but the database migration task should only run once. Which task attribute restricts execution to the first host in the batch?
3. Which Ansible tool runs on each managed host via cron — pulling the latest playbook from Git and applying it locally, eliminating the need for a central control node at runtime?
Quiz
1. Your fleet runs on AWS and servers are created and terminated constantly. Why is a static hosts.ini file insufficient and what is the solution?
2. A play targeting 100 hosts uses serial: "25%". How does Ansible execute it and why is this useful for deployments?
3. A security team requires that no engineer has direct access to production SSH keys or vault passwords, and that all automation runs are logged. Which tool addresses this?
Up Next · Lesson 37
Ansible in CI/CD Pipelines
Integrate Ansible directly into your CI/CD workflow — GitHub Actions, GitLab CI, pipeline structure, testing stages, environment promotion, and automated deployment gates.