Ansible Lesson 36 – Ansible at Scale | Dataplexa
Section III · Lesson 36

Ansible at Scale

In this lesson

Dynamic inventory AWX & Ansible Tower ansible-pull Fleet patterns Delegation & local action

Ansible at scale means running automation reliably against hundreds or thousands of hosts — with dynamic inventories that stay current automatically, centralised execution platforms that provide access control and audit trails, pull-mode architectures where hosts retrieve and apply their own configuration, and fleet patterns that make large playbooks predictable and safe. The tools and patterns in this lesson are what separate a team running Ansible on a laptop against a dozen servers from a platform team operating Ansible across an entire enterprise fleet.

The Scale Problem

Running Ansible against 10 servers is straightforward. Running it against 2,000 servers introduces challenges that do not exist at small scale — and each one requires a specific tool or pattern to solve.

Challenge

Inventory is always stale

Servers are created and destroyed constantly. A static hosts.ini is out of date within hours. Manually maintaining it across thousands of hosts is impossible.

Solution

Dynamic inventory plugins

Query your cloud provider, CMDB, or DNS at runtime to build the inventory fresh every run. AWS, GCP, Azure, and VMware all have first-class inventory plugins.

Challenge

No access control or audit trail

Anyone with the playbook and SSH keys can run anything against production. No record of who ran what, when, or what changed.

Solution

AWX / Ansible Tower

Centralised execution platform with role-based access control, job templates, schedules, audit logs, and a web UI. The SSH keys never leave the platform.

Challenge

Push mode doesn't scale to all hosts

A control node pushing to 5,000 hosts simultaneously hits network and CPU limits. Configuration drift on hosts that are temporarily unreachable goes undetected.

Solution

ansible-pull (pull mode)

Each host runs a cron job that pulls the latest playbook from Git and applies it locally. Scales to any fleet size; every host self-corrects drift on its own schedule.

Dynamic Inventory

Dynamic inventory plugins query an external source — AWS EC2, GCP, Azure, VMware, Terraform state — at playbook runtime and return a live host list grouped by tags, regions, instance types, or any attribute the source exposes. The playbook never needs updating when servers change.

# inventory/aws_ec2.yml — AWS EC2 dynamic inventory plugin
plugin: amazon.aws.aws_ec2
regions:
  - us-east-1
  - eu-west-1

# Filter to only running instances
filters:
  instance-state-name: running

# Group hosts by their EC2 tags automatically
keyed_groups:
  - key: tags.Role          # group by Role tag: webserver, database, cache
    prefix: role_
  - key: tags.Environment   # group by Environment tag: production, staging
    prefix: env_
  - key: placement.region   # group by AWS region
    prefix: region_

# Set the connection address (private IP for internal access)
hostnames:
  - private-ip-address

# Compose variables from EC2 metadata
compose:
  ansible_host: private_ip_address
  ec2_instance_type: instance_type
# List the dynamic inventory to verify groups and hosts
ansible-inventory -i inventory/aws_ec2.yml --list
ansible-inventory -i inventory/aws_ec2.yml --graph

# Run against a tag-based group — no hosts.ini needed
ansible-playbook site.yml -i inventory/aws_ec2.yml

# Target all production web servers (Role=webserver AND Environment=production)
ansible-playbook deploy.yml -i inventory/aws_ec2.yml \
  --limit "role_webserver:&env_production"
# ansible-inventory -i inventory/aws_ec2.yml --graph

@all:
  |--@role_webserver:
  |  |--10.0.1.14
  |  |--10.0.1.22
  |  |--10.0.1.31
  |--@role_database:
  |  |--10.0.2.10
  |--@env_production:
  |  |--10.0.1.14
  |  |--10.0.1.22
  |  |--10.0.1.31
  |  |--10.0.2.10
  |--@region_us_east_1:
  |  |--10.0.1.14
  |  |--10.0.2.10
  |--@region_eu_west_1:
  |  |--10.0.1.22
  |  |--10.0.1.31

What just happened?

The inventory plugin queried AWS at runtime and returned all running EC2 instances, automatically grouped by their tags and region. No static file was involved — if a new instance with Role=webserver starts up, it appears in role_webserver on the next run without any manual update. The :& operator in --limit applies an AND filter — targeting only hosts that are in both groups simultaneously.

The Live GPS Analogy

Static inventory is a printed paper map — accurate when it was made, increasingly wrong over time. Dynamic inventory is live GPS — it shows where everything actually is right now, updated at the moment you look. At small scale a paper map is fine. At scale, only GPS is reliable.

AWX and Ansible Tower

AWX is the open-source upstream project; Ansible Tower (now Red Hat Ansible Automation Platform) is the enterprise distribution. Both provide a web UI, REST API, and execution engine that runs playbooks centrally — with credentials stored securely, never exposed to individual engineers.

AWX / Tower capabilities

Job Templates Pre-configured playbook runs — playbook + inventory + credentials + extra vars — that authorised users can launch with one click without needing Git access or SSH keys.
RBAC Role-based access control — developers can run deploy jobs against staging but not production; only ops engineers can run hardening or infrastructure changes.
Credential vaulting SSH keys, vault passwords, and cloud credentials are stored encrypted in AWX. Engineers never see the keys — they only launch jobs that use them.
Schedules Run job templates on a cron-like schedule — nightly compliance checks, weekly patching runs, or hourly drift correction — without any engineer involvement.
Audit log Every job run is logged — who launched it, when, against which hosts, with what result, and the full task output. Required for compliance in regulated environments.
REST API & webhooks Trigger job templates from CI/CD pipelines via the REST API. GitHub/GitLab webhooks can launch a deploy job automatically on a push to main.
# Trigger an AWX job template via REST API (from a CI/CD pipeline)
curl -X POST \
  https://awx.example.com/api/v2/job_templates/42/launch/ \
  -H "Authorization: Bearer ${AWX_TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{
    "extra_vars": {
      "app_version": "2.4.1",
      "environment": "production"
    }
  }'

# Install AWX on Kubernetes (official method)
# See: https://github.com/ansible/awx-operator

ansible-pull — Pull Mode

In push mode, a central control node connects to managed hosts. In pull mode, each managed host runs ansible-pull on a schedule — pulling the latest playbook from a Git repository and applying it locally to itself. There is no central control node involved at runtime. This architecture scales to any fleet size because each host is responsible for its own configuration.

Push mode
Control node connects outbound to every host
Scales to ~500–1000 hosts before hitting limits
Unreachable hosts miss runs — drift accumulates
Pull mode (ansible-pull)
Each host pulls from Git and runs locally
Scales to unlimited hosts — no central bottleneck
Hosts self-correct on next pull after coming back online
# Install ansible-pull on each managed host
pip install ansible --break-system-packages

# Manual pull — clone repo and run local.yml against localhost
ansible-pull \
  -U https://github.com/myorg/ansible-config.git \
  -C main \
  local.yml                  # the playbook to apply (targets localhost)

# Run with vault password from a file
ansible-pull \
  -U https://github.com/myorg/ansible-config.git \
  --vault-password-file /etc/ansible/.vault_pass \
  local.yml

Automating pull-mode with a cron role

# roles/ansible_pull/tasks/main.yml
# Deploy this role ONCE via push mode to bootstrap pull mode on each host
---
- name: Install Ansible on the managed host
  ansible.builtin.package:
    name: ansible
    state: present

- name: Install Git for ansible-pull
  ansible.builtin.package:
    name: git
    state: present

- name: Create vault password file
  ansible.builtin.copy:
    content: "{{ vault_ansible_pull_password }}"
    dest: /etc/ansible/.vault_pass
    owner: root
    mode: "0600"
  no_log: true

- name: Schedule ansible-pull via cron (every 30 minutes)
  ansible.builtin.cron:
    name: ansible-pull configuration management
    minute: "*/30"
    job: >
      ansible-pull
      -U https://github.com/myorg/ansible-config.git
      -C main
      --vault-password-file /etc/ansible/.vault_pass
      local.yml
      >> /var/log/ansible-pull.log 2>&1
    user: root
    state: present

The local.yml playbook that each host applies to itself

# local.yml — applied by ansible-pull on each host to itself
---
- name: Apply configuration to this host
  hosts: localhost
  connection: local
  become: true

  vars_files:
    - vars/common.yml
    - vars/secrets.yml           # vault-encrypted

  roles:
    - role: common               # base packages, timezone, NTP
    - role: hardening            # SSH, firewall, sysctl
    - role: monitoring_agent     # Prometheus node exporter

    # Conditional roles — applied based on host facts or variables
    - role: nginx
      when: "'webserver' in ansible_hostname"
    - role: postgresql
      when: "'db' in ansible_hostname"

Delegation and Local Action

At scale, some tasks must run on a different host than the one being configured — registering a server with a load balancer, updating a DNS record, or notifying a monitoring system. The delegate_to attribute redirects a single task to run on a specified host while all variable context from the current host remains available.

- name: Rolling deployment with load balancer management
  hosts: webservers
  serial: 1

  tasks:
    # This task runs on the load balancer host, not on the web server being updated
    - name: Remove this web server from load balancer
      community.general.haproxy:
        state: disabled
        host: "{{ inventory_hostname }}"    # still the current web server's name
        socket: /var/run/haproxy/admin.sock
      delegate_to: loadbalancer.example.com  # but runs on the LB host

    - name: Deploy new application release
      ansible.builtin.unarchive:
        src: "releases/{{ app_version }}.tar.gz"
        dest: "{{ deploy_dir }}/releases/{{ app_version }}"

    - name: Update current symlink
      ansible.builtin.file:
        src: "{{ deploy_dir }}/releases/{{ app_version }}"
        dest: "{{ deploy_dir }}/current"
        state: link

    - name: Restart application service
      ansible.builtin.service:
        name: myapp
        state: restarted

    - name: Health check — verify before re-adding to LB
      ansible.builtin.uri:
        url: "http://{{ ansible_default_ipv4.address }}:{{ app_port }}/health"
        status_code: 200
      retries: 5
      delay: 6
      until: health.status == 200
      register: health

    # Run on the LB again — re-add only after health check passes
    - name: Re-add web server to load balancer
      community.general.haproxy:
        state: enabled
        host: "{{ inventory_hostname }}"
        socket: /var/run/haproxy/admin.sock
      delegate_to: loadbalancer.example.com

    # Run on the control node (localhost) — update external monitoring
    - name: Record deployment in monitoring system
      ansible.builtin.uri:
        url: "https://monitoring.example.com/api/deployments"
        method: POST
        body_format: json
        body:
          host: "{{ inventory_hostname }}"
          version: "{{ app_version }}"
          timestamp: "{{ ansible_date_time.iso8601 }}"
      delegate_to: localhost
      run_once: false

Fleet Patterns — run_once, serial, and throttle

Large-fleet playbooks need precise control over execution order, parallelism, and which tasks run once versus on every host. These three attributes handle the most common fleet orchestration requirements.

Fleet control attributes

run_once: true Executes the task on only the first host in the current batch. Use for tasks that logically apply to the fleet as a whole — database migrations, DNS updates, Slack notifications — not to every individual host.
serial: "25%" Processes hosts in batches of the specified size or percentage. The entire play runs to completion on the first batch before moving to the next — ensures some hosts remain live during rolling updates.
throttle: 5 Limits a specific task to run on at most N hosts simultaneously — even if forks is higher. Use for tasks that put load on an external resource: API calls, database writes, package mirror downloads.
---
- name: Production rolling update — 200 web servers
  hosts: webservers
  become: true
  serial: "10%"             # update 20 servers at a time (10% of 200)
  max_fail_percentage: 5    # abort if more than 5% of hosts fail in any batch

  tasks:
    - name: Run database migration (once per batch, not once per host)
      ansible.builtin.command:
        cmd: python manage.py migrate --no-input
        chdir: "{{ deploy_dir }}/current"
      run_once: true          # runs once on the first host in the batch
      delegate_to: "{{ groups['databases'][0] }}"  # actually runs on the DB server

    - name: Pull new Docker image
      community.docker.docker_image:
        name: "registry.example.com/myapp:{{ app_version }}"
        source: pull
      throttle: 5            # only 5 hosts pull simultaneously — protects registry

    - name: Update application container
      community.docker.docker_container:
        name: myapp
        image: "registry.example.com/myapp:{{ app_version }}"
        state: started

    - name: Send deployment notification (once for the whole fleet)
      ansible.builtin.uri:
        url: "https://hooks.slack.com/services/{{ vault_slack_webhook }}"
        method: POST
        body_format: json
        body:
          text: "Deployed {{ app_version }} to {{ ansible_play_hosts | length }} web servers"
      run_once: true
      delegate_to: localhost

Dynamic Inventory Requires Cloud Credentials on the Control Node

AWS, GCP, and Azure dynamic inventory plugins authenticate using the same credentials as the AWS/GCP/Azure CLI — environment variables, instance profiles, or credential files. The control node running the playbook must have permission to call the cloud provider's DescribeInstances (or equivalent) API. In CI/CD pipelines this means configuring an IAM role or service account with read-only access to compute resources. Never use a root or full-admin credential for inventory queries — the required permissions are minimal and read-only.

Key Takeaways

Replace static inventory with dynamic inventory plugins the moment your fleet starts changing regularly. AWS EC2, GCP, and Azure all have first-class plugins — query tags and regions to build groups automatically.
AWX provides RBAC, credential vaulting, audit logs, and scheduled runs — everything a team needs to run Ansible safely in production without giving every engineer SSH keys to production servers.
ansible-pull inverts the connection model — each host applies its own configuration on a cron schedule. Scales to unlimited hosts; eliminates the central control node as a bottleneck and single point of failure.
delegate_to redirects individual tasks to a different host — enabling load balancer drain/restore, DNS updates, and monitoring notifications as part of the same play that deploys the application.
Use run_once, serial, and throttle together — serial for rolling batch control, run_once for fleet-wide singleton tasks, throttle to protect shared resources from overload.

Teacher's Note

If you run in AWS or GCP, replace your hosts.ini with a dynamic inventory plugin file today — even for a small lab fleet. Run ansible-inventory --graph and watch your instances appear grouped by their tags. That single change makes every future scaling step automatic rather than manual.

Practice Questions

1. Which task attribute redirects a single task to run on a different host than the one currently being configured — while keeping all variable context from the current host?



2. A play targets 50 web servers but the database migration task should only run once. Which task attribute restricts execution to the first host in the batch?



3. Which Ansible tool runs on each managed host via cron — pulling the latest playbook from Git and applying it locally, eliminating the need for a central control node at runtime?



Quiz

1. Your fleet runs on AWS and servers are created and terminated constantly. Why is a static hosts.ini file insufficient and what is the solution?


2. A play targeting 100 hosts uses serial: "25%". How does Ansible execute it and why is this useful for deployments?


3. A security team requires that no engineer has direct access to production SSH keys or vault passwords, and that all automation runs are logged. Which tool addresses this?


Up Next · Lesson 37

Ansible in CI/CD Pipelines

Integrate Ansible directly into your CI/CD workflow — GitHub Actions, GitLab CI, pipeline structure, testing stages, environment promotion, and automated deployment gates.