Terraform Lesson 35 – Terraform Troubleshooting | Dataplexa

Section IV · Lesson 35

Terraform Troubleshooting

Every Terraform engineer hits errors they have never seen before. The difference between an engineer who panics and one who resolves them quickly is a systematic diagnostic approach — reading the full error message, understanding what Terraform was trying to do, and knowing which tool to reach for. This lesson gives you that approach and walks through the most common errors in production Terraform.

This lesson covers

Reading Terraform error messages → TF_LOG debug logging → Common init errors → Common plan errors → Common apply errors → State corruption and recovery → Dependency and cycle errors → Provider authentication failures → terraform console for debugging

The Diagnostic Approach

When Terraform errors, most engineers read the last line and immediately start searching. That is the wrong order. The last line is usually a consequence of the actual error which appears earlier. Read the full error output from top to bottom — the root cause is almost always in the first error block, and everything after it is fallout.

The Diagnostic Order

When something breaks: 1. Read the full error output — do not skim. 2. Identify which operation failed — init, plan, or apply. 3. Find the first error block — this is usually the root cause. 4. Check which resource or file is mentioned. 5. If unclear, enable TF_LOG=DEBUG and re-run. Only then search for the error text.

TF_LOG — Debug Logging

Terraform's debug logging reveals exactly what it is doing internally — which API calls it is making, what responses it receives, how it is resolving dependencies. It is noisy but invaluable for errors that give no clear indication of what went wrong.

New terms:

TF_LOG — environment variable that sets the log level. Values in ascending verbosity order: ERROR, WARN, INFO, DEBUG, TRACE. DEBUG shows API calls and responses. TRACE shows everything including internal graph walks — very verbose.
TF_LOG_PATH — redirects log output to a file instead of stderr. Essential for CI/CD where terminal output is mixed — log to file, parse separately.
TF_LOG_PROVIDER — sets log level for provider plugins only. Use when you want provider API call detail without core Terraform noise.

# Enable debug logging — set before the terraform command

# Basic: show all log levels including DEBUG
export TF_LOG=DEBUG
terraform plan

# Log to file — essential for capturing CI/CD output
export TF_LOG=DEBUG
export TF_LOG_PATH=/tmp/terraform-debug.log
terraform apply

# Provider-only logging — isolate provider API calls
export TF_LOG_PROVIDER=DEBUG
terraform plan

# Trace-level — maximum verbosity, internal graph and walk detail
export TF_LOG=TRACE
terraform plan 2>&1 | head -200  # Pipe to head — TRACE output is enormous

# Disable after debugging
unset TF_LOG
unset TF_LOG_PATH

# Search the log for specific operations
grep "Request\|Response\|Error" /tmp/terraform-debug.log | head -50

Common terraform init Errors

# ── ERROR 1: Provider not found ──────────────────────────────────────────────

# Error message:
# Error: Failed to query available provider packages
# Could not retrieve the list of available versions for provider hashicorp/aws:
# could not connect to registry.terraform.io

# Causes:
# - No internet connectivity from the machine running terraform init
# - Corporate proxy blocking registry.terraform.io
# - Air-gapped environment without a mirror configured

# Fix: Configure a provider mirror for air-gapped environments
terraform providers mirror /path/to/local/mirror  # Download providers to a local directory

# In terraform.tf — point to the local mirror
terraform {
  provider_installation {
    filesystem_mirror {
      path    = "/usr/share/terraform/providers"
      include = ["registry.terraform.io/*/*"]
    }
    direct {
      exclude = ["registry.terraform.io/*/*"]  # Block direct downloads
    }
  }
}

# ── ERROR 2: Module not installed ────────────────────────────────────────────

# Error message:
# Error: Module not installed
# Module "vpc" is not yet installed. Run "terraform init" to install all modules.

# Cause: Added a new module block and ran terraform plan without terraform init
# Fix: Always run terraform init after adding any new module source
terraform init

# ── ERROR 3: Backend configuration changed ───────────────────────────────────

# Error message:
# Error: Backend configuration changed
# A change in the backend configuration has been detected, which may require
# migrating or reconfiguring the backend. Please run "terraform init".

# Cause: Changed the backend block (different S3 bucket, key, or region)
# Fix: Run terraform init — it will ask whether to migrate existing state
terraform init -reconfigure     # Re-initialise without migrating state
terraform init -migrate-state   # Re-initialise and migrate existing state

Common terraform plan Errors

# ── ERROR 1: Unsupported argument ────────────────────────────────────────────

# Error message:
# Error: Unsupported argument
#   on main.tf line 14, in resource "aws_s3_bucket" "app":
#   14:   acl = "private"
# An argument named "acl" is not expected here.

# Cause: AWS provider v4+ split bucket configuration into separate resources
# The acl argument was removed from aws_s3_bucket in provider v4

# Fix: Use the dedicated sub-resource
resource "aws_s3_bucket_acl" "app" {
  bucket = aws_s3_bucket.app.id
  acl    = "private"
}

# ── ERROR 2: Invalid reference ───────────────────────────────────────────────

# Error message:
# Error: Reference to undeclared resource
#   on main.tf line 22, in resource "aws_instance" "web":
#   22:   subnet_id = aws_subnet.public.id
# A managed resource "aws_subnet" "public" has not been declared.

# Cause: Typo in the resource name — resource is actually aws_subnet.public_a
# Fix: Check the exact resource name with:
terraform state list | grep subnet  # List all subnet resources in state

# ── ERROR 3: Inconsistent conditional result types ────────────────────────────

# Error message:
# Error: Inconsistent conditional result types
# The true and false result expressions must have consistent types.
# The 'true' value is string and 'false' value is null.

# Cause: Ternary where one branch returns string, other returns null
# Wrong:
variable "kms_key" { default = null }
resource "aws_s3_bucket_server_side_encryption_configuration" "app" {
  rule {
    apply_server_side_encryption_by_default {
      kms_master_key_id = var.kms_key != null ? var.kms_key : null  # null in both arms is fine
      sse_algorithm     = var.kms_key != null ? "aws:kms" : "AES256"  # Both are strings — fine
    }
  }
}

# ── ERROR 4: Cycle error — circular dependency ───────────────────────────────

# Error message:
# Error: Cycle: aws_security_group.web, aws_security_group.app
# Terraform detected a configuration cycle in the following resources.

# Cause: SG A references SG B as source, SG B references SG A as source
# Fix: Use aws_security_group_rule resources separately — break the cycle
resource "aws_security_group_rule" "web_to_app" {
  type                     = "egress"
  security_group_id        = aws_security_group.web.id
  source_security_group_id = aws_security_group.app.id
  # Remove the reference to web from app's definition — break the cycle
  from_port = 8080; to_port = 8080; protocol = "tcp"
}

Common terraform apply Errors

# ── ERROR 1: Resource already exists ─────────────────────────────────────────

# Error message:
# Error: creating S3 Bucket (my-app-bucket): BucketAlreadyOwnedByYou:
# Your previous request to create the named bucket succeeded and you own it.

# Cause: The resource exists in AWS but not in Terraform state
# This happens when someone created the resource manually, or state was lost
# Fix: Import the existing resource into state
terraform import aws_s3_bucket.app my-app-bucket  # Import by AWS resource ID

# ── ERROR 2: Error acquiring the state lock ──────────────────────────────────

# Error message:
# Error: Error acquiring the state lock
# Error message: ConditionalCheckFailedException: The conditional request failed
# Lock Info: ID=abc-123, Path=prod/app/terraform.tfstate, Operation=OperationTypePlan
# Created=2024-01-15 09:23:11 UTC, Info=alice@acme.com

# Cause: Another Terraform process holds the lock (another engineer running apply)
# OR: A previous apply crashed and left the lock unreleased

# Fix: Wait for the other process. If the process is definitely dead:
terraform force-unlock LOCK_ID  # Use the Lock ID from the error message
# DANGER: Only run force-unlock if you are certain no other process is running
# Releasing a live lock can corrupt state

# ── ERROR 3: Error modifying resource — update not allowed ───────────────────

# Error message:
# Error: updating RDS DB Instance (prod-db): InvalidParameterCombination:
# Cannot upgrade postgres from 14.5 to 15.3 directly.

# Cause: The upgrade path is invalid — must go 14.5 → 14.latest → 15.3
# Fix: Stage the upgrade in two separate applies
# First apply: engine_version = "14.9"   (upgrade to latest minor)
# Second apply: engine_version = "15.3"  (cross major version upgrade)

# ── ERROR 4: Provider authentication failure ─────────────────────────────────

# Error message:
# Error: configuring Terraform AWS Provider:
# error validating provider credentials: error calling sts:GetCallerIdentity:
# operation error STS: GetCallerIdentity, https response error StatusCode: 403

# Cause: Invalid or expired credentials, wrong profile, or no credentials at all
# Diagnose:
aws sts get-caller-identity  # Test if AWS CLI can authenticate independently
aws configure list           # Show which credential source is active
echo $AWS_ACCESS_KEY_ID      # Check if env vars are set

# Fix: Refresh credentials, re-export environment variables, or re-authenticate

State Corruption and Recovery

State corruption is rare but the most stressful Terraform scenario. S3 versioning exists precisely for this situation — every state file write is a new object version, and any previous version can be restored.

New terms:

terraform state list — lists every resource address tracked in the state file. First command to run when diagnosing state issues.
terraform state show — shows all attributes of a specific resource in state. Use to verify what Terraform thinks the current state of a resource is.
terraform state rm — removes a resource from state without destroying it in the cloud. Use when you want Terraform to stop managing a resource without deleting it.
terraform state mv — moves a resource address in state. Used when renaming a resource in configuration — prevents destroy+recreate.

# State inspection commands — diagnose before touching anything

# List all resources in state
terraform state list

# Show all attributes of a specific resource
terraform state show aws_db_instance.main
terraform state show 'module.vpc.aws_subnet.public["10.0.1.0/24"]'  # Module resources

# ── Renaming a resource — prevent destroy+recreate ────────────────────────────

# Before rename in configuration:
# resource "aws_instance" "web_server" { ... }  (old name)
# After rename in configuration:
# resource "aws_instance" "web" { ... }          (new name)

# Without state mv: Terraform will destroy web_server and create web (downtime!)
# With state mv: Terraform moves the state record — no destroy, no recreate
terraform state mv aws_instance.web_server aws_instance.web

# Same for module resource addresses
terraform state mv 'module.old_name.aws_vpc.this' 'module.new_name.aws_vpc.this'

# ── Remove a resource from state without destroying it ────────────────────────

# Use case: you want to stop managing an RDS instance with Terraform
# without deleting the database
terraform state rm aws_db_instance.main
# Resource is removed from state — Terraform no longer tracks it
# The actual RDS instance continues running in AWS — unaffected

# ── Recover from state corruption using S3 versioning ────────────────────────

# List versions of the state file in S3
aws s3api list-object-versions \
  --bucket acme-terraform-state \
  --prefix prod/app/terraform.tfstate \
  --query 'Versions[*].{VersionId:VersionId,LastModified:LastModified}' \
  --output table

# Download a previous state version
aws s3api get-object \
  --bucket acme-terraform-state \
  --key prod/app/terraform.tfstate \
  --version-id YOUR_VERSION_ID \
  terraform.tfstate.backup

# Restore: overwrite the current state with the backup
aws s3 cp terraform.tfstate.backup \
  s3://acme-terraform-state/prod/app/terraform.tfstate

# Then run terraform plan to verify state matches reality
terraform plan

terraform console — Interactive Debugging

terraform console opens an interactive REPL where you can evaluate any Terraform expression against the current state and configuration. It is the fastest way to debug complex expressions, function calls, and variable interpolations without running a full plan.

# Start the interactive console
terraform console

# Test expressions interactively — no plan needed

# Check a variable value
> var.environment
"dev"

# Test a function
> upper("hello")
"HELLO"

# Test cidr calculations
> cidrsubnet("10.0.0.0/16", 8, 1)
"10.0.1.0/24"

# Debug a complex local expression
> local.common_tags
{
  "Environment" = "dev"
  "ManagedBy"   = "Terraform"
  "Project"     = "acme"
}

# Test a for expression before using it in a resource
> [for s in ["10.0.1.0/24", "10.0.2.0/24"] : cidrhost(s, 1)]
["10.0.1.1", "10.0.2.1"]

# Check a resource attribute from state
> aws_vpc.main.id
"vpc-0abc123def456789"

# Test a conditional
> var.environment == "prod" ? "t3.large" : "t3.micro"
"t3.micro"

# Debug a jsondecode
> jsondecode("{\"username\":\"admin\",\"password\":\"secret\"}")
{
  "password" = "secret"
  "username" = "admin"
}

# Exit
> exit

$ terraform console

# Console reads current state and variables
# Useful for debugging complex expressions before committing to a plan

> cidrsubnet("10.0.0.0/16", 8, 0)
"10.0.0.0/24"

> cidrsubnet("10.0.0.0/16", 8, 1)
"10.0.1.0/24"

# Great for debugging for_each keys
> toset(["10.0.1.0/24", "10.0.2.0/24"])
toset([
  "10.0.1.0/24",
  "10.0.2.0/24",
])

# Check resource attributes without running terraform state show
> aws_instance.web.public_ip
"54.211.89.132"

# Test merge() before using in a module
> merge({Name = "web"}, {Environment = "dev", ManagedBy = "Terraform"})
{
  "Environment" = "dev"
  "ManagedBy"   = "Terraform"
  "Name"        = "web"
}

What just happened?

terraform console evaluates expressions against real state. Unlike a scratchpad, the console reads the actual current state file and all variable values. aws_instance.web.public_ip returns the actual IP of the running instance. This is what makes it useful for debugging — you are working with real data.
Test complex expressions before they go into production configurations. A for expression, a cidrsubnet() call, a merge() — all can be validated interactively in seconds before they are committed to a resource block and run against real infrastructure.

Common Troubleshooting Mistakes

Running terraform force-unlock without verifying no other process holds the lock

terraform force-unlock is a last resort. If you force-unlock a state that another active Terraform process holds, both processes write to state simultaneously. The result is state corruption — resources appearing in the wrong state, duplicate entries, or a completely invalid JSON state file. Only force-unlock when you are certain the locking process is dead — verified in CI/CD logs, not assumed.

Manually editing the state file

The state file is valid JSON but it contains checksums and internal consistency data. Editing it directly with a text editor almost always produces a corrupted state. Use terraform state mv, terraform state rm, and terraform import — they maintain internal consistency. If you must edit state directly, use terraform state pull to get it, make the minimum change with a JSON editor, and use terraform state push to put it back. But always restore from a backup instead if possible.

Destroying and recreating instead of using state mv when renaming resources

When a resource is renamed in the configuration, Terraform plans to destroy the old resource and create a new one — even if the underlying infrastructure is identical. For a database or a load balancer, this means downtime. Always run terraform state mv OLD_ADDRESS NEW_ADDRESS before applying after a rename. The state mv tells Terraform the resource is the same physical infrastructure under a new logical name.

The troubleshooting toolkit — in order of reach

When Terraform errors: 1. Read the full output — especially the first error block. 2. Run terraform validate — catches syntax and reference errors before any API calls. 3. Run terraform console to test suspicious expressions against real state. 4. Set TF_LOG=DEBUG and re-run — provider API call detail reveals what AWS/Azure/GCP rejected and why. 5. Check terraform state list and terraform state show to understand what Terraform thinks exists. 6. Check S3 bucket versioning for state recovery. In that order. Each step costs seconds and the answer is usually found by step 3.

Practice Questions

1. You rename aws_instance.web_server to aws_instance.web in your configuration. What command prevents Terraform from destroying and recreating the instance?

2. Which environment variable enables debug-level logging to see Terraform's API calls and responses?

3. Which command opens an interactive REPL where you can test Terraform expressions and functions against current state without running a plan?

Quiz

Up Next · Lesson 36

Terraform at Scale

Errors diagnosed. Lesson 36 covers how to scale Terraform to large organisations — workspaces vs separate state files, root module composition patterns, remote state data sources, and the team workflows that prevent state conflicts when dozens of engineers are applying simultaneously.

← Previous Course Index Next →

Terraform Course