Terraform Lesson 43 – Performance Optimization | Dataplexa

Section IV · Lesson 43

Performance Optimization

A Terraform plan that takes 20 minutes is a plan that engineers stop running. When the feedback loop is too slow, people make larger changes between plans, accumulate more unreviewed drift, and start reaching for -refresh=false just to get something done. Performance is not a luxury — it directly affects how safely a team can operate infrastructure. This lesson covers every meaningful technique for making Terraform faster.

This lesson covers

Why Terraform slows down → The parallelism model → -parallelism flag → Provider plugin caching → Splitting configurations → -refresh=false use cases → Targeting with -target → Data source optimisation → Reading plan output efficiently

Why Terraform Slows Down

Terraform plans have two distinct phases of work: the refresh phase and the diff phase. The refresh phase queries every managed resource's current state from the cloud provider API. The diff phase compares that state to your configuration. The refresh phase dominates performance because it makes one API call per resource — serially by default for some resource types, in parallel for others.

Slow plan cause	Why it happens	Fix
Too many resources	500 resources = 500 API calls minimum	Split into separate root modules
Low parallelism	Default 10 concurrent operations — often the bottleneck	Increase -parallelism flag
Provider plugin download	Downloading 200MB provider binary on every CI run	Plugin cache directory
Slow data sources	Data sources that make multiple API calls per read	Consolidate data sources, avoid in loops
API rate limiting	AWS throttling requests — retries add latency	Reduce parallelism, split configurations

The Parallelism Model and -parallelism Flag

Terraform builds a dependency graph and walks it in parallel — resources with no dependencies between them can be created, refreshed, or destroyed simultaneously. The -parallelism flag sets the maximum number of concurrent operations. The default is 10.

New terms:

dependency graph — the directed acyclic graph Terraform builds from your configuration. Resources with explicit or implicit dependencies form edges. Independent resources are leaves that can be processed in parallel. Cycles in this graph are the "cycle error" from Lesson 35.
-parallelism=N — sets maximum concurrent operations. Higher values use more API concurrency. Too high and AWS throttles the requests, adding retry delays that slow things down more than the parallelism helps. Optimal value depends on the AWS service and account — typically 20–50 for most configurations.

# Default parallelism — 10 concurrent operations
terraform plan

# Increase parallelism for configurations with many independent resources
terraform plan  -parallelism=20
terraform apply -parallelism=20

# Reduce parallelism to avoid AWS rate limiting (ThrottlingException)
# Some AWS services have aggressive rate limits — IAM, Route53, CloudFront
terraform apply -parallelism=5

# Set parallelism via environment variable for consistency
export TF_CLI_ARGS_plan="-parallelism=20"
export TF_CLI_ARGS_apply="-parallelism=20"

# Finding the right parallelism value:
# Start at 20 — if you see ThrottlingException errors in TF_LOG=DEBUG, reduce to 10
# If the plan is clean and fast at 20, try 30
# There is no universal right answer — it depends on your AWS account limits

# Inspect the dependency graph — visualise what can run in parallel
terraform graph | dot -Tsvg > graph.svg  # Requires graphviz
# Resources at the same depth in the graph run concurrently
# Long chains of dependencies limit parallelism regardless of the -parallelism flag

Provider Plugin Caching

Every terraform init on a fresh CI/CD runner downloads the provider plugins from the registry — the AWS provider alone is over 400MB. Caching eliminates this download on subsequent runs.

New terms:

TF_PLUGIN_CACHE_DIR — environment variable pointing to a directory where downloaded provider plugins are cached. When init runs and finds a matching provider version in the cache, it creates a symlink instead of downloading.
terraform providers mirror — downloads all required providers to a local directory. Used to seed the cache, create an air-gapped mirror, or pre-populate a Docker image with providers already installed.

# Method 1: TF_PLUGIN_CACHE_DIR — reuse downloaded providers across runs
export TF_PLUGIN_CACHE_DIR="$HOME/.terraform.d/plugin-cache"
mkdir -p "$HOME/.terraform.d/plugin-cache"
terraform init  # Downloads provider and caches it in TF_PLUGIN_CACHE_DIR

# On subsequent init runs with the same provider version:
# Terraform finds the cached binary and creates a symlink instead of downloading
# AWS provider (400MB+) goes from 30-60 seconds to under 1 second

# ~/.terraformrc — set cache globally for all Terraform runs on this machine
cat ~/.terraformrc
# plugin_cache_dir = "$HOME/.terraform.d/plugin-cache"

# Method 2: Pre-bake providers into a Docker image
# Build a custom Docker image with providers pre-installed
# No download needed during CI/CD runs at all

# Dockerfile for a Terraform CI image
# FROM hashicorp/terraform:1.6.3
# RUN mkdir -p /root/.terraform.d/plugin-cache
# ENV TF_PLUGIN_CACHE_DIR="/root/.terraform.d/plugin-cache"
# COPY providers.tf /tmp/providers/
# WORKDIR /tmp/providers
# RUN terraform init -backend=false  # Downloads providers into the image layer
# WORKDIR /
# Provider binaries are now in the image — zero download time in CI

# Method 3: GitHub Actions cache
# Cache the .terraform directory across workflow runs

- name: Cache Terraform providers
  uses: actions/cache@v4
  with:
    path: ~/.terraform.d/plugin-cache
    key: ${{ runner.os }}-terraform-${{ hashFiles('**/.terraform.lock.hcl') }}
    restore-keys: |
      ${{ runner.os }}-terraform-
# When the lock file changes (provider upgrade), cache is invalidated and rebuilt
# When unchanged, providers load from cache — typically 60 seconds saved per run

Splitting Configurations for Faster Feedback

The most impactful performance optimisation is structural — splitting one large configuration into multiple smaller ones. A configuration with 50 resources plans in seconds. A configuration with 500 resources plans in minutes — regardless of any other optimisation.

# Measure your current plan time — establish a baseline
time terraform plan 2>&1 | tail -5
# real    4m12.345s  ← too slow if this is a frequent operation

# Profile which resources take longest — use TF_LOG to see timestamps
TF_LOG=DEBUG terraform plan 2>&1 | grep -E "^[0-9]{4}|Reading|Refreshing" \
  | awk '{print $1, $0}' | sort | head -30

# Count resources in the current state
terraform state list | wc -l
# If > 100, the configuration is a strong candidate for splitting

# Before splitting (monolith — one state file):
# networking.tf, databases.tf, compute.tf, iam.tf — all in one directory
# Plan: 8 minutes (500 resources × ~1 second each / 10 parallelism)

# After splitting (3 separate root modules):
# foundation/networking/  — 80 resources  → Plan: 45 seconds
# platform/databases/     — 60 resources  → Plan: 35 seconds
# services/compute/       — 360 resources → Plan: 4 minutes (still candidate for further split)

# The key insight:
# Engineers working on a feature branch only need to plan the services layer
# They never wait for the networking or databases plan
# Total plan time experienced: 4 minutes down from 8 minutes
# And the networking/database plans run much faster when they do change

# Splitting also improves the refresh phase:
# Fewer resources → fewer API calls per plan → faster refresh
# Independent state files → parallel plans across teams → no lock contention

Data Source Optimisation

Data sources make API calls on every plan — even when nothing has changed. Poorly written data sources multiply API calls unnecessarily. Understanding the patterns that create hidden performance problems prevents them.

# Anti-pattern: data source inside for_each — N data source reads for N resources
# This calls the AMI API once per EC2 instance — even if all instances use the same AMI

# BAD — one API call per resource
resource "aws_instance" "workers" {
  for_each = var.worker_configs  # 20 workers = 20 AMI API calls

  ami = data.aws_ami.latest[each.key].id  # Data source inside for_each = slow
}

data "aws_ami" "latest" {
  for_each    = var.worker_configs  # Creates 20 data source reads
  most_recent = true
  owners      = ["amazon"]
  filter {
    name   = "name"
    values = ["amzn2-ami-hvm-*-x86_64-gp2"]
  }
}

# GOOD — one data source read, referenced by all resources
data "aws_ami" "latest" {
  most_recent = true  # One API call total, regardless of how many instances
  owners      = ["amazon"]
  filter {
    name   = "name"
    values = ["amzn2-ami-hvm-*-x86_64-gp2"]
  }
}

resource "aws_instance" "workers" {
  for_each = var.worker_configs
  ami      = data.aws_ami.latest.id  # All workers share the same data source read
}

# Anti-pattern: data source on every plan for rarely-changing data
# If the VPC ID never changes, don't fetch it on every plan
# BAD:
data "aws_vpc" "main" {
  filter {
    name   = "tag:Name"
    values = ["prod-vpc"]
  }
}
# GOOD for stable cross-module data — use terraform_remote_state or variables instead
variable "vpc_id" {
  description = "VPC ID — pass from the networking module output"
  type        = string
  # No API call needed — value passed at apply time
}

-refresh=false — Legitimate Use Cases

Lesson 40 warned against using -refresh=false to hide drift. But it has legitimate performance use cases — specific scenarios where you know the state is accurate and want to skip the API call phase entirely.

# Legitimate uses of -refresh=false

# Use case 1: Testing a configuration change when you KNOW no external drift exists
# You just applied 5 minutes ago. Nothing has changed externally.
# You want to quickly preview a configuration edit.
terraform plan -refresh=false
# The plan skips all API calls — result in under 1 second for most configs
# Appropriate only when you are confident state is accurate

# Use case 2: Applying a saved plan file
# When you already ran terraform plan (which refreshed state) and saved the plan:
terraform plan -out=tfplan              # Refreshes, generates plan, saves file
# ... time passes, reviewer approves ...
terraform apply -refresh=false tfplan  # Apply the saved plan — no re-refresh needed
# The apply of a saved plan file always uses the state captured at plan time
# -refresh=false is implicit when applying a saved plan file anyway

# Use case 3: Debugging configuration logic, not infrastructure state
# You want to check how a local/variable expression evaluates without any API calls
terraform plan -refresh=false -target=local.computed_value

# Use case 4: Speeding up the apply of well-understood changes in CI/CD
# The plan already ran (with refresh) in the plan stage of the pipeline
# The apply stage can safely use -refresh=false if applying a saved plan file
# (The plan artifact already captured the refreshed state)

# NEVER use -refresh=false as a routine way to speed up plans
# The API calls in the refresh phase ARE the drift detection
# Skipping them makes every plan a diff against potentially stale state

Reading Plan Output Efficiently

On large configurations, the plan output itself can be overwhelming — hundreds of lines of resource diffs. Knowing how to read it quickly and filter it to what matters saves time and reduces the risk of missing critical changes.

# Save plan output for analysis
terraform plan -no-color 2>&1 | tee plan-output.txt

# Quick summary — just the final count line
grep "^Plan:" plan-output.txt
# Plan: 12 to add, 3 to change, 1 to destroy.

# Find all resources being destroyed — the dangerous ones
grep "^  # .* will be destroyed" plan-output.txt
grep "^  -" plan-output.txt | head -20

# Find all resources being replaced (destroy + create)
grep "# .* must be replaced" plan-output.txt
grep -A2 "must be replaced" plan-output.txt

# Find all in-place modifications
grep "# .* will be updated in-place" plan-output.txt

# Show only the lines with the change symbols
grep "^  [+~\-]" plan-output.txt | head -40

# JSON output — machine-readable for CI/CD tooling
terraform show -json tfplan > plan.json

# Extract just the resource changes from JSON
cat plan.json | jq '.resource_changes[] | {address, change: .change.actions}'

# Extract destroys from JSON
cat plan.json | jq '.resource_changes[] | select(.change.actions[] == "delete") | .address'

# Count changes by action type
cat plan.json | jq '.resource_changes[].change.actions[]' | sort | uniq -c

Common Performance Mistakes

Using -target regularly to work around slow plans

terraform plan -target=aws_ecs_service.payments plans only that resource — ignoring everything else. This is fast, but it hides dependency changes. If a security group that the ECS service depends on has drifted, -target will not show it. Teams that use -target regularly to avoid slow plans accumulate hidden drift and dependency inconsistencies. The correct fix for slow plans is splitting the configuration or enabling caching — not bypassing the dependency graph.

Setting -parallelism too high and causing API throttling

Increasing parallelism to 50 or 100 on a large configuration can trigger AWS ThrottlingException errors — especially for IAM, Route53, and CloudFront APIs which have strict rate limits. Terraform retries throttled requests with exponential backoff — each retry adds 2-30 seconds of wait. A plan with 200 throttled requests and 5-second average retry time adds 16 minutes. Start at 20, watch for throttling in debug logs, and reduce if you see retries.

Not caching provider plugins in CI/CD

Every GitHub Actions run on a fresh Ubuntu runner downloads the AWS provider from scratch — typically 400MB in 30-60 seconds per run. A team running 50 Terraform pipeline runs per day wastes 25-50 minutes of CI time just on provider downloads. The GitHub Actions cache or a pre-baked Docker image eliminates this entirely. It is the easiest performance win in CI/CD and takes under 10 lines of configuration to implement.

Performance optimisation priority order

1. Split the configuration — by far the highest impact. Reduces resource count per state file, eliminates lock contention, enables parallel team workflows. 2. Cache provider plugins — eliminates 30-60 seconds of download per CI run with 10 lines of configuration. 3. Increase -parallelism to 20 — safe default improvement for most configurations. 4. Fix data source anti-patterns — consolidate data sources read in loops. 5. Pre-bake providers into Docker images — for CI/CD environments where the cache cannot persist. These five steps, applied in order, typically reduce a 20-minute plan to under 3 minutes.

Practice Questions

1. Which environment variable points Terraform to a directory where downloaded provider plugins are cached between runs?

2. What is the default value of Terraform's -parallelism flag?

3. When reviewing a large plan output file, which grep pattern finds the most critical changes — resources being destroyed and recreated?

Quiz

Up Next · Lesson 44

Terraform Anti-Patterns

Performance optimised. Lesson 44 is a focused catalogue of the most damaging Terraform anti-patterns seen in real production codebases — the patterns that start as shortcuts, survive code review, and eventually cause incidents. Each one is explained with the exact failure mode and the correct replacement.

← Previous Course Index Next →

Terraform Course