Terraform Lesson 36 – Terraform at Scale | Dataplexa
Section IV · Lesson 36

Terraform at Scale

A single Terraform configuration works fine for one team managing one environment. Add ten teams, four environments, three cloud providers, and a hundred microservices — and a single monolithic configuration becomes a liability. This lesson covers how mature organisations structure Terraform at scale: how to split state, how to compose root modules, and the patterns that let dozens of engineers apply Terraform simultaneously without conflict.

This lesson covers

Why monolithic configurations fail at scale → State isolation strategies → Root module composition → terraform_remote_state → The terragrunt pattern → Workspaces — correct and incorrect uses → Team workflows at scale

Why Monolithic Configurations Fail

A configuration that manages everything — networking, databases, compute, IAM, DNS — in one state file has serious problems at scale that are not obvious until you hit them.

Problem Impact
Single state lock Only one engineer can apply at a time — all others are blocked
Blast radius A misconfigured plan can destroy production databases while deploying an IAM role
Slow plans Terraform refreshes every resource on every plan — 500 resources means 500 API calls before you see a single change
Access control You cannot give a developer permission to deploy their app without also giving them permission to modify VPCs and databases
Team ownership No clear ownership — every team modifies the same files, merge conflicts are constant

State Isolation Strategies

The primary scaling technique is splitting the monolith into many smaller state files — each responsible for one layer, one component, or one environment. Each state file has its own lock, its own blast radius, and can be owned by a separate team.

New terms:

  • root module — a directory of .tf files with its own backend and state file. A large organisation may have dozens of root modules — one per component, team, or layer. Not to be confused with a reusable module: a root module is always a leaf in the dependency graph.
  • state isolation — the practice of using separate state files for different components or environments so that a lock or error in one does not affect others.
  • layer separation — a common pattern where infrastructure is split vertically by rate of change: foundation (VPCs, DNS zones — changes rarely), platform (databases, caches, queues — changes occasionally), application (compute, deployments — changes frequently).
# Recommended directory structure for a scaled Terraform organisation
# Each directory is a separate root module with its own state file

infrastructure/
├── foundation/                    # Layer 1 — changes rarely, owned by platform team
│   ├── networking/                # VPCs, subnets, peering, Transit Gateway
│   │   ├── versions.tf            # Backend: s3://state/foundation/networking.tfstate
│   │   ├── main.tf
│   │   └── outputs.tf
│   ├── dns/                       # Route53 hosted zones
│   │   └── ...                    # Backend: s3://state/foundation/dns.tfstate
│   └── security/                  # Security groups, WAF, GuardDuty
│       └── ...                    # Backend: s3://state/foundation/security.tfstate
│
├── platform/                      # Layer 2 — changes occasionally, owned by platform team
│   ├── databases/                 # RDS, ElastiCache, DynamoDB tables
│   │   └── ...                    # Backend: s3://state/platform/databases.tfstate
│   └── queues/                    # SQS, SNS, EventBridge
│       └── ...
│
├── services/                      # Layer 3 — changes frequently, owned by app teams
│   ├── payments/                  # Payment service infrastructure
│   │   └── ...                    # Backend: s3://state/services/payments.tfstate
│   └── auth/                      # Auth service infrastructure
│       └── ...                    # Backend: s3://state/services/auth.tfstate
│
└── environments/                  # Per-environment overrides
    ├── dev/
    ├── staging/
    └── prod/

# Key rules:
# 1. Lower layers have no dependencies on higher layers
# 2. Higher layers depend on lower layers via terraform_remote_state or data sources
# 3. Each state file is owned by a single team
# 4. State key naming convention: LAYER/COMPONENT/ENVIRONMENT.tfstate

terraform_remote_state — Reading Outputs Across State Files

When a higher layer needs the outputs of a lower layer — a service needs the VPC ID from networking — the terraform_remote_state data source reads those outputs directly from the other state file. This is the primary cross-state dependency mechanism.

New terms:

  • terraform_remote_state — a data source that reads the outputs from another root module's state file. Does not read resource attributes — only outputs explicitly declared in the other module's outputs.tf. This is why well-documented outputs are critical at scale.
  • backend configuration in remote_state — must point to the exact same backend type and location as the source state file. If the source uses S3, the data source must also specify S3 with the matching bucket, key, and region.
# In services/payments/versions.tf — configure backend for THIS state file
terraform {
  backend "s3" {
    bucket = "acme-terraform-state"
    key    = "services/payments/terraform.tfstate"
    region = "us-east-1"
  }
}

# Read outputs from the networking state file (a different root module)
data "terraform_remote_state" "networking" {
  backend = "s3"  # Must match the networking module's backend type

  config = {
    bucket = "acme-terraform-state"
    key    = "foundation/networking/terraform.tfstate"  # The OTHER state file
    region = "us-east-1"
  }
}

# Read outputs from the databases state file
data "terraform_remote_state" "databases" {
  backend = "s3"
  config = {
    bucket = "acme-terraform-state"
    key    = "platform/databases/terraform.tfstate"
    region = "us-east-1"
  }
}

# Use the remote outputs — syntax: data.terraform_remote_state.NAME.outputs.OUTPUT_NAME
resource "aws_ecs_service" "payments" {
  name    = "payments-service"
  cluster = aws_ecs_cluster.main.id

  network_configuration {
    # VPC and subnets from the networking module — not hardcoded
    subnets         = data.terraform_remote_state.networking.outputs.private_subnet_ids
    security_groups = [aws_security_group.payments.id]
  }
}

resource "aws_ssm_parameter" "db_endpoint" {
  name  = "/payments/database/endpoint"
  type  = "String"
  # Database endpoint from the platform/databases module
  value = data.terraform_remote_state.databases.outputs.payments_db_endpoint
}

# The networking module's outputs.tf must declare these outputs explicitly:
# output "private_subnet_ids" { value = [for s in aws_subnet.private : s.id] }
# output "vpc_id"             { value = aws_vpc.main.id }
# Without these outputs, terraform_remote_state cannot read them
$ terraform plan  # Run from services/payments/

data.terraform_remote_state.networking: Reading...
data.terraform_remote_state.networking: Read complete
  [outputs: vpc_id, private_subnet_ids, public_subnet_ids, ...]

data.terraform_remote_state.databases: Reading...
data.terraform_remote_state.databases: Read complete
  [outputs: payments_db_endpoint, payments_db_port, ...]

  + aws_ecs_service.payments
      + network_configuration:
          + subnets = [
              "subnet-0abc123",   # From networking state file
              "subnet-0def456",   # Not hardcoded — read from remote state
            ]

Plan: 2 to add, 0 to change, 0 to destroy.

# The networking team can update subnet IDs (e.g. add AZs) without touching
# the payments configuration — the remote_state data source picks up the change
# on the next payments plan automatically

What just happened?

  • The payments service read networking outputs without any coupling to the networking code. The payments team has no access to the networking root module's .tf files. They consume only the outputs — a published interface. The networking team can refactor their implementation entirely as long as the output names and types stay the same.
  • Cross-state dependencies are read-only at plan time. terraform_remote_state reads the state file but never modifies it. The networking state file is completely safe — the payments plan cannot lock, corrupt, or modify it.

Workspaces — Correct and Incorrect Uses

Workspaces allow multiple state files within a single Terraform configuration. They are widely misunderstood and frequently misused — leading to configurations that look like they support environments but actually share far too much to be safe.

# Workspaces — what they are
terraform workspace new staging    # Create a new workspace
terraform workspace select prod    # Switch to prod workspace
terraform workspace list           # List all workspaces
terraform workspace show           # Show current workspace name

# What workspaces actually do:
# Each workspace gets its own state file at a different S3 key:
# default:  s3://bucket/key/terraform.tfstate
# staging:  s3://bucket/env:/staging/key/terraform.tfstate
# prod:     s3://bucket/env:/prod/key/terraform.tfstate

# Referencing the workspace name in configuration
locals {
  # Use workspace name to drive environment-specific values
  env_config = {
    dev = {
      instance_type = "t3.micro"
      db_class      = "db.t3.micro"
      min_capacity  = 1
    }
    staging = {
      instance_type = "t3.small"
      db_class      = "db.t3.small"
      min_capacity  = 2
    }
    prod = {
      instance_type = "t3.large"
      db_class      = "db.r6g.large"
      min_capacity  = 3
    }
  }

  config = local.env_config[terraform.workspace]  # terraform.workspace = current name
}

resource "aws_instance" "app" {
  instance_type = local.config.instance_type  # Different per workspace
}

# ── When workspaces ARE appropriate ──────────────────────────────────────────
# - Small teams managing truly identical environments that differ only in size/count
# - Feature branch environments — temporary, short-lived, parallel to main workspace
# - The configuration is simple and all environments use the same AWS account

# ── When workspaces are NOT appropriate ──────────────────────────────────────
# - Different AWS accounts per environment (dev account vs prod account)
#   → workspaces share the same provider configuration — cannot target different accounts
# - Environments with fundamentally different architectures
#   → conditional complexity explodes: if prod needs a NAT gateway but dev does not,
#     the configuration becomes unreadable
# - Large teams where different people manage different environments
#   → no clear ownership — anyone on the team can switch workspaces and apply to prod

The workspace vs separate directories debate

HashiCorp's own guidance says workspaces are not the recommended way to manage multiple environments in most organisations. Separate directories with separate state files give you separate provider configurations (different AWS accounts), clear ownership, and no risk of accidentally running in the wrong workspace. The only check against applying to prod when you meant dev is the workspace name — a single terraform workspace select prod away. Separate directories backed by separate CI/CD pipelines are the safer default for production organisations.

The Terragrunt Pattern

Terragrunt is a thin wrapper around Terraform that solves the DRY (Don't Repeat Yourself) problem that appears when managing many root modules. Without it, every root module duplicates the same backend configuration, provider version, and common variable values.

New terms:

  • Terragrunt — an open-source wrapper tool by Gruntwork. Adds DRY configuration for backends, remote state dependencies, and run-all commands that orchestrate multiple modules. Stores configuration in terragrunt.hcl files alongside the Terraform root modules.
  • run-all — a Terragrunt command that applies all modules in a directory tree in dependency order. terragrunt run-all apply applies foundation first, then platform, then services — automatically in the correct order.
# Root terragrunt.hcl — shared config inherited by all child modules
# Located at the root of the infrastructure/ directory

locals {
  account_id  = get_aws_account_id()   # Terragrunt built-in function
  region      = "us-east-1"
  environment = basename(dirname(path_relative_to_include()))  # dev/staging/prod from path
}

# Generate backend configuration dynamically — no more copy-pasting backend blocks
generate "backend" {
  path      = "backend.tf"
  if_exists = "overwrite_terragrunt"

  contents = <

Team Workflows at Scale

# Practical team workflow patterns for large organisations

# Pattern 1: Component-scoped CI/CD pipelines
# Each root module has its own pipeline — triggered by changes to its directory
# .github/workflows/payments.yml triggers on: paths: ['infrastructure/services/payments/**']
# The platform team's pipeline triggers on: paths: ['infrastructure/platform/**']
# No coordination needed — each team deploys independently

# Pattern 2: PR-based plan review
# Every pull request runs terraform plan and posts the output as a PR comment
# The reviewer approves the specific infrastructure change before it is applied
# Prevents surprises — the change in code becomes the change in infrastructure becomes visible

# Pattern 3: Locked production — manual approval required
# GitHub Actions example:
# terraform plan → save plan file → request review → on approval → terraform apply planfile
# Using a saved plan file ensures that exactly what was reviewed is what gets applied
# Even if code changes between plan and apply, the plan file is immutable

# Saving and applying a plan file
terraform plan -out=tfplan          # Save plan to file
terraform show -json tfplan         # Inspect the saved plan (human readable)
terraform apply tfplan              # Apply exactly this plan — no new plan run

# Pattern 4: -target for emergency fixes
# -target limits plan/apply to specific resources — use sparingly
terraform plan -target=aws_lb.main  # Only plan the load balancer
terraform apply -target=aws_lb.main # Only apply the load balancer

# CAUTION: -target can leave state in an inconsistent condition
# Dependencies may not be updated — use only for genuine emergencies
# Always run a full plan without -target afterwards to check for drift

Common Scale Mistakes

Using workspaces for multi-account deployments

Workspaces change the state key but not the provider configuration. If your dev account and prod account are different AWS accounts, a workspace switch does not change which account Terraform targets. Engineers who assume switching to the prod workspace targets the prod account and then apply have made a dangerous mistake. Use separate provider configurations — either separate directories or aliased providers — for multi-account setups.

Coupling layers via terraform_remote_state on mutable resources

If the networking module changes its outputs — renames them, changes their type — every module that reads those outputs via terraform_remote_state breaks simultaneously. Treat cross-state outputs as a published API. Change them with the same care as a module version change: add new outputs, deprecate old ones, communicate with consumers before removing them.

Applying -target regularly in CI/CD

-target was designed for emergency use — not as a workflow pattern. When used regularly it means the state file no longer reflects the full configuration, dependency tracking is incomplete, and subsequent full applies often show unexpected diffs. If you find yourself reaching for -target routinely, the root cause is usually a configuration that is too large and needs to be split into separate root modules.

Practice Questions

1. Which data source reads the outputs from a separate root module's state file?



2. What Terraform expression returns the name of the current workspace?



3. What two-step command sequence ensures that exactly the plan that was reviewed is what gets applied in a production approval workflow?



Quiz

1. Why are Terraform workspaces not suitable for managing separate AWS accounts per environment?


2. What is the key constraint of terraform_remote_state that makes well-documented outputs critical?


3. What is the recommended principle for deciding how to split a monolithic Terraform configuration into separate root modules?


Up Next · Lesson 37

Terraform in CI/CD Pipelines

Architecture established. Lesson 37 goes deep into CI/CD integration — GitHub Actions pipeline anatomy, secure credential injection, plan-on-PR and apply-on-merge patterns, saved plan files for approval gates, and the OIDC-based keyless authentication patterns that eliminate stored credentials entirely.