Terraform Lesson 40 – Drift Detection | Dataplexa

Section IV · Lesson 40

Drift Detection

Drift is the gap between what Terraform thinks your infrastructure looks like and what it actually looks like. It is one of the most common sources of production incidents — a security group rule changed manually during an incident and never reverted, a database instance type modified in the console by a well-meaning engineer, a tag removed by an automated tool. This lesson covers how drift happens, how to detect it systematically, and how to respond decisively.

This lesson covers

What drift is and how it happens → terraform plan as a drift detector → terraform refresh → Scheduled drift detection pipelines → Reading drift output → The four drift response strategies → Preventing drift with lifecycle rules → Documenting accepted deviations

What Drift Is and How It Happens

Drift occurs when real infrastructure diverges from the state Terraform recorded. The state file represents what Terraform last knew about a resource. When something changes outside of Terraform — a manual console edit, an AWS service updating a default, an incident response change — the state file is no longer accurate.

Drift cause	Example	Risk level
Manual console change	Security group rule added during incident, never reverted	High — security posture degraded silently
AWS default update	AWS changes a default setting on an existing resource	Medium — may cause config conflicts on next apply
Another tool modifying resources	AWS Config auto-remediation adding a tag Terraform does not track	Low — usually cosmetic unless Terraform fights back
Accidental terraform destroy	Resource deleted — state still shows it as existing	High — next apply recreates it, potentially causing downtime
State file out of sync	State migrated incorrectly, resource address moved	Medium — Terraform may try to recreate existing resources

The Analogy

Drift is like the difference between a building's official blueprints and the building as it actually stands after years of modifications. Someone added a door here, removed a wall there — none of it recorded on the blueprints. If a contractor arrives with the blueprints and starts "fixing" the building to match them, they will seal up doors people are actively using. Drift detection is the regular inspection that keeps the blueprints and the building in sync before a contractor arrives.

terraform plan as a Drift Detector

terraform plan is the primary drift detection tool. When it runs, Terraform queries the cloud provider API for the current state of every managed resource, compares it to the recorded state, and generates a diff. Any difference — whether it came from Terraform or from outside — appears in the plan output.

New terms:

terraform refresh — updates the state file to match the current real-world state, without making any changes to infrastructure. It is the "sync" operation — tells Terraform what is actually out there. Deprecated in favour of terraform apply -refresh-only in Terraform 1.1+.
terraform apply -refresh-only — the modern replacement for terraform refresh. Shows what would be updated in state and asks for confirmation. Safer than the old refresh because it shows a preview before modifying state.
-refresh=false — flag on plan or apply that skips the real-world state refresh. Terraform uses only the recorded state file without querying the provider. Faster but shows no drift. Use only in specific scenarios where you know state is accurate and want to skip API calls.

# Running plan as a drift check — no changes intended, just look for drift
terraform plan -detailed-exitcode

# Exit codes:
# 0 = success, no changes (no drift)
# 1 = error
# 2 = success, changes detected (drift found OR actual configuration change needed)
# Use -detailed-exitcode in scripts to distinguish "clean" from "drifted"

# Example: security group drift detection
# Original configuration: sg allows port 443 from 0.0.0.0/0
# Someone added port 22 from 0.0.0.0/0 manually during an incident

# terraform plan output:
# ~ resource "aws_security_group" "web" {
#     ~ ingress = [
#         + {
#             + cidr_blocks = ["0.0.0.0/0"]
#             + from_port   = 22
#             + protocol    = "tcp"
#             + to_port     = 22
#           },
#         # (existing rule unchanged)
#       ]
# }
# Plan: 0 to add, 1 to change, 0 to destroy.
# The ~ symbol shows this change will REMOVE the manually added SSH rule
# Applying this plan fixes the drift — but the engineer needs to know it happened

# Apply -refresh-only: update state to match reality WITHOUT changing infrastructure
# Use this when you ACCEPT the manual change and want to record it in state
terraform apply -refresh-only

# This is the "accept the drift" response:
# After refresh-only, the state file reflects the manually added SSH rule
# The next terraform plan will show no drift for this change
# BUT you should also update your .tf files to match — otherwise the next
# full apply will revert the change again

Scheduled Drift Detection Pipelines

Manual drift detection only works when someone remembers to run it. A scheduled pipeline runs drift checks automatically, reports findings, and notifies the team — without applying any changes. This is the operational practice that keeps infrastructure honest.

# .github/workflows/drift-detection.yml
# Runs every 6 hours — detects drift and reports it without applying

name: Terraform Drift Detection

on:
  schedule:
    - cron: '0 */6 * * *'  # Every 6 hours
  workflow_dispatch:         # Allow manual trigger

permissions:
  id-token: write
  contents: read
  issues: write              # Create GitHub issues for drift findings

jobs:
  detect-drift:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        # Run drift detection on all critical root modules in parallel
        module:
          - { name: networking,    dir: infrastructure/foundation/networking }
          - { name: databases,     dir: infrastructure/platform/databases }
          - { name: payments,      dir: infrastructure/services/payments }
      fail-fast: false       # Check all modules even if one has drift

    steps:
      - uses: actions/checkout@v4

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ vars.DRIFT_DETECTION_ROLE_ARN }}
          aws-region: us-east-1

      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: "1.6.3"

      - name: Terraform Init
        working-directory: ${{ matrix.module.dir }}
        run: terraform init -no-color

      - name: Terraform Plan — Drift Check
        id: plan
        working-directory: ${{ matrix.module.dir }}
        # -detailed-exitcode: exit 2 means changes detected (drift)
        # We capture the output and exit code separately
        run: |
          terraform plan -detailed-exitcode -no-color 2>&1 | tee plan-output.txt
          echo "exit_code=${PIPESTATUS[0]}" >> $GITHUB_OUTPUT
        continue-on-error: true

      - name: Report Drift
        if: steps.plan.outputs.exit_code == '2'
        uses: actions/github-script@v7
        with:
          script: |
            const planOutput = require('fs').readFileSync('${{ matrix.module.dir }}/plan-output.txt', 'utf8');
            const title = `Drift detected: ${{ matrix.module.name }} (${new Date().toISOString().split('T')[0]})`;

            // Check if a drift issue already exists for this module
            const issues = await github.rest.issues.listForRepo({
              owner: context.repo.owner,
              repo: context.repo.repo,
              labels: ['terraform-drift', '${{ matrix.module.name }}'],
              state: 'open'
            });

            if (issues.data.length === 0) {
              // Create a new issue only if one doesn't already exist
              await github.rest.issues.create({
                owner: context.repo.owner,
                repo: context.repo.repo,
                title: title,
                body: `## Drift Detected in ${{ matrix.module.name }}\n\n` +
                      `**Detected at:** ${new Date().toISOString()}\n\n` +
                      `**Plan output:**\n\`\`\`\n${planOutput.slice(0, 3000)}\n\`\`\`\n\n` +
                      `**Action required:** Review the drift and choose a response strategy.`,
                labels: ['terraform-drift', '${{ matrix.module.name }}', 'needs-review']
              });
            }

      - name: Notify Slack on Drift
        if: steps.plan.outputs.exit_code == '2'
        run: |
          curl -X POST ${{ secrets.SLACK_WEBHOOK_URL }} \
            -H 'Content-type: application/json' \
            --data '{
              "text": ":warning: Drift detected in *${{ matrix.module.name }}*. Review the GitHub issue for details.",
              "channel": "#infrastructure-alerts"
            }'

      # Exit 0 = no drift, exit 2 = drift found, exit 1 = error
      # We only fail the job on error (exit 1) — drift is a notification, not a blocker
      - name: Fail on Error
        if: steps.plan.outputs.exit_code == '1'
        run: exit 1

# Scheduled run: 06:00 UTC

matrix: networking  → terraform plan -detailed-exitcode
  No changes. Infrastructure is up-to-date.
  Exit code: 0 — No drift detected ✓

matrix: databases   → terraform plan -detailed-exitcode
  No changes. Infrastructure is up-to-date.
  Exit code: 0 — No drift detected ✓

matrix: payments    → terraform plan -detailed-exitcode
  ~ aws_security_group.payments_internal
      ~ ingress = [
          + {
              + from_port   = 22
              + to_port     = 22
              + cidr_blocks = ["10.0.0.0/8"]
            }
        ]
  Plan: 0 to add, 1 to change, 0 to destroy.
  Exit code: 2 — DRIFT DETECTED

GitHub Issue created: "Drift detected: payments (2024-01-15)"
Slack notification sent: #infrastructure-alerts

# Drift detection job: PASSED (drift is a notification, not a pipeline failure)
# Manual review required for payments module drift

What just happened?

Drift detection ran without making any changes. The pipeline ran terraform plan — not terraform apply. It detected real drift in the payments security group but left it alone. The job succeeded with a notification rather than failing. Drift is information — not an emergency requiring immediate automated remediation.
A GitHub issue was created only once per active drift. The script checks for existing open issues before creating a new one — so the team gets one issue per drift finding, not a new issue on every 6-hour run. The issue stays open until a human closes it after resolving the drift.
-detailed-exitcode enables scripting on drift. Exit code 2 means "changes detected" — which could be intentional configuration changes waiting to be applied, or drift. In this pipeline, any module not recently changed that shows exit 2 is drift. Teams build on this to create smarter detection.

The Four Drift Response Strategies

When drift is detected, the response depends on what drifted, why it drifted, and whether the drift represents a valid change or a violation. There are four strategies — the choice between them depends on context.

# Strategy 1: REVERT — apply Terraform to restore desired state
# Use when: the drift was accidental, unintended, or a security violation
# How: run terraform apply (the standard plan shows the correction)

# Example: SSH port 22 opened from the internet during panic — revert it
terraform plan   # Shows that applying will remove the rogue rule
terraform apply  # Restores the security group to the desired state
# Git commit: "revert: remove emergency SSH rule from payments SG"

# ─────────────────────────────────────────────────────────────────────────────

# Strategy 2: CODIFY — update .tf files to match reality, then refresh state
# Use when: the manual change was intentional and should be permanent
# How: update the .tf file, then run terraform apply to reconcile

# Example: database instance type upgraded manually for a performance incident
# The manual change is valid — update the config to match
# In main.tf: instance_class = "db.r6g.xlarge"  (was db.r6g.large)
terraform plan   # Should show no changes if the config now matches reality
# OR if the state and reality are already aligned:
terraform apply -refresh-only  # Update state to match current real-world state
# Then commit the .tf change so the codified version is in Git

# ─────────────────────────────────────────────────────────────────────────────

# Strategy 3: ACCEPT with documentation — acknowledge the deviation
# Use when: the drift is intentional, permanent, and managed outside Terraform
# How: use ignore_changes lifecycle to tell Terraform to stop tracking the attribute

resource "aws_db_instance" "main" {
  instance_class = "db.r6g.large"  # Initial value

  lifecycle {
    ignore_changes = [
      instance_class,   # Instance class managed by Auto Scaling — ignore manual changes
      engine_version,   # Minor version upgrades managed by AWS automatically
    ]
  }
}

# After adding ignore_changes:
# terraform plan will no longer show instance_class changes as drift
# Document WHY in a comment — otherwise future engineers will be confused

# ─────────────────────────────────────────────────────────────────────────────

# Strategy 4: IMPORT — bring the drifted resource back under Terraform management
# Use when: a resource was manually created outside Terraform and should be managed
# How: terraform import then verify plan shows no changes

terraform import aws_security_group.payments_v2 sg-0abc123def456789
terraform plan  # Should show no changes if the import captured the current state
# Add the resource block to .tf files matching the imported configuration

Preventing Drift with lifecycle Rules

Some drift is unavoidable — AWS auto-patching engines, RDS minor version upgrades, Auto Scaling changing instance types. For these, ignore_changes prevents Terraform from fighting systems that legitimately modify the resource.

# ignore_changes — tell Terraform to stop tracking specific attributes
# Use for attributes that are legitimately managed outside Terraform

resource "aws_instance" "app" {
  ami           = "ami-0c55b159cbfafe1f0"
  instance_type = "t3.medium"

  lifecycle {
    ignore_changes = [
      ami,            # AMI updates managed by AWS Systems Manager Patch Manager
      user_data,      # User data may be updated by bootstrap automation
      tags["LastUpdated"],  # Specific tag key updated by a config management tool
    ]
  }
}

resource "aws_rds_cluster" "main" {
  engine_version = "15.3"

  lifecycle {
    ignore_changes = [
      engine_version,       # AWS auto-applies minor patches: 15.3 → 15.4 etc.
      availability_zones,   # AWS may rebalance AZs — ignore automatic rebalancing
    ]
  }
}

# CAUTION: ignore_changes applies to creation as well as updates
# If the initial resource is created with a wrong value for an ignored attribute,
# Terraform will not correct it on subsequent applies
# Test with a fresh resource before adding ignore_changes to existing ones

# CAUTION: Do not use ignore_changes = all
# ignore_changes = all tells Terraform to ignore every attribute change
# After creation, Terraform will never detect any drift on this resource
# The resource is effectively unmanaged — defeats the purpose of Terraform entirely
# Only use specific attribute names in ignore_changes, never the wildcard all

Common Drift Detection Mistakes

Using terraform apply -refresh-only to accept security drift

terraform apply -refresh-only updates the state file to match reality. If the drift is a rogue security group rule or an over-permissive IAM policy, accepting it with refresh-only records the violation as the new desired state. Terraform will no longer flag it. Always assess what drifted before choosing refresh-only — it is the right response for intentional operational changes, not for security violations.

Using -refresh=false to make the plan "clean"

Running terraform plan -refresh=false skips the real-world state query — the plan only compares configuration to the recorded state file. It will show no drift even if massive drift exists. Some teams use this to speed up plans or silence drift noise. The result is a false sense of security — the plan looks clean but the infrastructure may have diverged significantly.

Auto-applying drift remediation in production

A drift detection pipeline that automatically applies when drift is found reverting manual changes made during active incidents. If an engineer added an emergency firewall rule to restore connectivity and the drift pipeline fires 6 hours later and removes it, production goes down again. Drift detection should notify, not auto-remediate in production. The apply decision must be human-made.

Drift is information — respond proportionally

Not all drift is equal. A tag missing from a database is informational. A security group with port 22 open to the internet is urgent. A database instance type that was manually scaled up during a traffic incident is intentional and should be codified. Build a drift response runbook for your team: what types of drift require immediate revert, what can be codified and committed, what should be accepted with ignore_changes, and what triggers a security incident response. Without this framework, teams either panic over cosmetic drift or ignore genuine security violations.

Practice Questions

1. Which command runs a plan and returns exit code 2 when changes are detected — enabling scripts to distinguish "no drift" from "drift found"?

2. Which command updates the Terraform state file to match current real-world infrastructure without making any changes to that infrastructure?

3. Which Terraform lifecycle rule prevents Terraform from detecting drift on attributes managed by external systems like AWS Auto Scaling?

Quiz

Up Next · Lesson 41

Upgrade Strategies

Drift managed. Lesson 41 covers how to safely upgrade Terraform itself, how to upgrade providers across major versions, the .terraform.lock.hcl file and how it protects you, and a step-by-step upgrade runbook for moving from Terraform 1.x to future versions without breaking production infrastructure.

← Previous Course Index Next →

Terraform Course