Terraform Lesson 17 – State Locking and Consistency | Dataplexa

Section II · Lesson 17

State Locking and Consistency

You have seen locking prevent a concurrent apply. But what happens when a process crashes mid-apply with the lock held? What is actually inside a DynamoDB lock entry? How does Terraform guarantee state consistency even when the network drops between operations? This lesson answers all of it — with real recovery procedures you will use in production.

This lesson covers

How the DynamoDB lock entry works → What a crash mid-apply leaves behind → Safe lock recovery procedure → State consistency guarantees → Diagnosing and fixing a corrupted lock → Locking in CI/CD pipelines

How State Locking Works End to End

Every time Terraform is about to modify state — during apply, during state commands, during certain imports — it acquires a lock first. The locking mechanism depends on the backend. For S3 with DynamoDB, the sequence is precise and worth understanding completely.

The four-step lock lifecycle — acquire, read state, apply changes, release lock

The critical detail is step 1. Terraform uses a ConditionalExpression when writing to DynamoDB — it only succeeds if no entry with the same LockID already exists. If another process holds the lock, DynamoDB rejects the write with ConditionalCheckFailedException. This is an atomic operation at the database level — no race condition is possible. Either you get the lock or you do not.

Step 4 is where crashes cause problems. If the process is killed between step 3 and step 4, the lock entry remains in DynamoDB indefinitely. Terraform has no heartbeat mechanism — it cannot tell whether the lock holder is still running or has died. The stale lock must be cleared manually.

Inside the DynamoDB Lock Entry

When Terraform acquires a lock, it writes a JSON document to the DynamoDB table. Understanding every field in this document is what allows you to diagnose a stuck lock accurately — confirming it is genuinely stale before force-unlocking.

Set up a project with a remote backend and run an apply in one terminal while inspecting the DynamoDB table in another:

mkdir terraform-lesson-17
cd terraform-lesson-17
touch versions.tf main.tf

Add this to versions.tf:

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }

  # Remote backend — locks acquired and released here
  backend "s3" {
    bucket         = "acme-terraform-state-123456789012"
    key            = "lesson-17/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-state-lock"  # This is where lock entries are written
  }
}

provider "aws" {
  region = "us-east-1"
}

Add this to main.tf:

# Simple resource to trigger a real apply and lock acquisition
resource "aws_s3_bucket" "lesson17" {
  bucket = "terraform-lesson-17-lock-demo-${random_id.suffix.hex}"

  tags = {
    Name      = "lesson17-lock-demo"
    ManagedBy = "Terraform"
  }
}

resource "random_id" "suffix" {
  byte_length = 4  # Generates 8-character hex suffix for unique bucket name
}

Run init, then start an apply and immediately inspect the lock in a second terminal:

# Terminal 1 — run the apply
terraform init
terraform apply

# Terminal 2 — inspect the lock entry while Terminal 1 is running
# Query the DynamoDB table directly using the AWS CLI
aws dynamodb get-item \
  --table-name terraform-state-lock \
  --key '{"LockID": {"S": "acme-terraform-state-123456789012/lesson-17/terraform.tfstate"}}'

# The lock entry in DynamoDB while terraform apply is running
{
    "Item": {
        "LockID": {
            "S": "acme-terraform-state-123456789012/lesson-17/terraform.tfstate"
        },
        "Info": {
            "S": "{
                \"ID\": \"f47ac10b-58cc-4372-a567-0e02b2c3d479\",
                \"Operation\": \"OperationTypeApply\",
                \"Info\": \"\",
                \"Who\": \"engineer@MacBook-Pro.local\",
                \"Version\": \"1.6.0\",
                \"Created\": \"2024-01-15T14:22:33.456789Z\",
                \"Path\": \"acme-terraform-state-123456789012/lesson-17/terraform.tfstate\"
            }"
        }
    }
}

# After terraform apply completes — the entry is gone
aws dynamodb get-item \
  --table-name terraform-state-lock \
  --key '{"LockID": {"S": "acme-terraform-state-123456789012/lesson-17/terraform.tfstate"}}'

# Returns empty — lock was released
{}

What just happened?

The LockID is the full S3 path to the state file. The partition key in DynamoDB is bucket-name/key-path — the complete path to the state file. This means the same DynamoDB table can serve as the lock table for every project in your organisation — each project's lock entry has a unique LockID based on its state file path.
The Info field contains everything needed to diagnose a stuck lock. The JSON in the Info field reveals: who holds the lock (machine and username), which operation type is running (Apply, Plan, StatePush), which Terraform version, and most critically — when the lock was created. The Created timestamp is what you compare against the current time to determine if a lock is stale.
After apply completes, the entry is deleted. A successful apply or a failed apply both trigger lock release. The DynamoDB DeleteItem call is in a deferred cleanup function — it runs even if the apply errors partway through. The only scenario where it does not run is a process kill signal (SIGKILL) or a complete machine failure.

Simulating a Crash Mid-Apply

The most dangerous scenario in state management is a process crash between writing new state to S3 and releasing the lock in DynamoDB. Here is exactly what that looks like and what it leaves behind.

# Simulate a crash — start apply and kill the process mid-run
# Terminal 1:
terraform apply

# While "Creating..." is shown — immediately kill the process
# Press Ctrl+C or run: kill -9 $(pgrep terraform)

# Now check what state the system is in:

# 1. Check if the lock is still held
aws dynamodb get-item \
  --table-name terraform-state-lock \
  --key '{"LockID": {"S": "acme-terraform-state-123456789012/lesson-17/terraform.tfstate"}}'

# 2. Check what state was written before the crash
terraform state list

# 3. Attempt a new plan — this will fail because the lock is held
terraform plan

# Terraform apply killed mid-run with Ctrl+C
$ terraform apply

Plan: 2 to add, 0 to change, 0 to destroy.

  Enter a value: yes

random_id.suffix: Creating...
random_id.suffix: Creation complete [id=a3f2b1c4]

aws_s3_bucket.lesson17: Creating...
^C  # Process killed here — bucket creation was in progress

# Lock entry is still in DynamoDB — was not released
$ aws dynamodb get-item --table-name terraform-state-lock \
    --key '{"LockID": {"S": "acme-terraform-state-123456789012/lesson-17/terraform.tfstate"}}'
{
    "Item": {
        "LockID": { "S": "acme-terraform-state-123456789012/lesson-17/terraform.tfstate" },
        "Info": { "S": "{\"ID\":\"f47ac10b-...\",\"Created\":\"2024-01-15T14:22:33Z\",...}" }
    }
}

# State may be partially written — random_id created before the crash
$ terraform state list
random_id.suffix   # This was created before the crash
# aws_s3_bucket.lesson17 may or may not be in state — depends on timing

# Planning fails because the lock is still held
$ terraform plan

╷
│ Error: Error acquiring the state lock
│
│ Lock Info:
│   ID:      f47ac10b-58cc-4372-a567-0e02b2c3d479
│   Path:    acme-terraform-state-123456789012/lesson-17/terraform.tfstate
│   Who:     engineer@MacBook-Pro.local
│   Version: 1.6.0
│   Created: 2024-01-15 14:22:33 UTC   # Check this — is it recent or old?
╵

What just happened?

The random_id was created and state was partially written before the crash. Terraform writes state incrementally — after each resource is created, state is updated. The crash happened after the random_id succeeded but while the S3 bucket was being created. The state file in S3 may contain the random_id entry and a partial or missing bucket entry.
The lock entry survived because the release step never ran. Terraform defers lock release to the end of the operation. A Ctrl+C sends SIGINT which Terraform catches and handles gracefully — usually releasing the lock. A SIGKILL or a machine failure does not give Terraform the chance to clean up. The result is a stale lock entry in DynamoDB.
The Created timestamp is how you determine if the lock is stale. A lock created 30 seconds ago might belong to a running process. A lock created 6 hours ago in a pipeline that completed 5 hours ago is definitely stale. Compare the timestamp against your knowledge of recent operations before force-unlocking.

Safe Lock Recovery Procedure

Finding a stale lock is an event, not an emergency. Follow this procedure every time — in order, without skipping steps. Rushing force-unlock without diagnosis is how you corrupt state.

New terms:

terraform force-unlock LOCK_ID — removes a specific lock entry from the DynamoDB table. Requires the Lock ID — the UUID shown in the error message. Terraform will ask for confirmation unless -force is passed. This is a destructive operation — only use it when you are certain the holding process is dead.
aws dynamodb delete-item — the AWS CLI equivalent of force-unlock. Directly removes the lock entry from DynamoDB. Use this only as a last resort if terraform force-unlock itself fails. The command is identical in effect but bypasses Terraform's confirmation prompts entirely.
terraform apply -refresh-only — after clearing a stale lock, run this to sync the state file with reality. If the crash happened mid-apply, some resources may exist in AWS but not in state, or vice versa. Refresh-only detects these discrepancies without making further changes.

# ── SAFE LOCK RECOVERY PROCEDURE ────────────────────────────────────────────

# Step 1 — Confirm no Terraform process is currently running
# Check all machines that might have started the apply
# Check your CI/CD pipeline — is any job still in progress?
pgrep terraform   # Should return nothing if no local process is running

# Step 2 — Read the lock entry and note the Created timestamp
aws dynamodb get-item \
  --table-name terraform-state-lock \
  --key '{"LockID": {"S": "acme-terraform-state-123456789012/lesson-17/terraform.tfstate"}}'
# Compare Created timestamp to current time — is it old enough to be stale?

# Step 3 — Backup the current state before doing anything else
# State pull downloads the remote state as JSON — save it as a timestamped backup
terraform state pull > state-backup-$(date +%Y%m%d-%H%M%S).json

# Step 4 — Force unlock using the Lock ID from the error message
# The UUID from the error: ID: f47ac10b-58cc-4372-a567-0e02b2c3d479
terraform force-unlock f47ac10b-58cc-4372-a567-0e02b2c3d479

# Step 5 — Sync state with reality after the interrupted apply
# This detects resources created before the crash that may not be in state
terraform apply -refresh-only

# Step 6 — Review the refresh-only output and accept it if correct
# Then run a normal plan to see what still needs to be applied
terraform plan

# Step 7 — Apply any remaining changes
terraform apply

# Step 3 — state backup saved
$ terraform state pull > state-backup-20240115-143022.json
# File written: state-backup-20240115-143022.json

# Step 4 — force unlock
$ terraform force-unlock f47ac10b-58cc-4372-a567-0e02b2c3d479

Do you really want to force-unlock the state?
  Terraform will remove the lock on the remote state.
  This will allow local Terraform commands to modify this state, even though it
  may be still be in use. Only 'yes' will be accepted to confirm.

  Enter a value: yes

Terraform state has been successfully unlocked!

# Step 5 — refresh-only to sync state with reality
$ terraform apply -refresh-only

Terraform detected the following changes made outside of Terraform:

  # aws_s3_bucket.lesson17 has been created
  + resource "aws_s3_bucket" "lesson17" {
      + bucket = "terraform-lesson-17-lock-demo-a3f2b1c4"
    }
    # The bucket was created before the crash — now Terraform knows about it

Would you like to update the Terraform state to reflect these detected changes?
  Enter a value: yes

Apply complete! Resources: 0 added, 0 changed, 0 destroyed.

# Step 6 — verify clean state
$ terraform plan

No changes. Your infrastructure matches the configuration.

What just happened?

terraform force-unlock asked for confirmation. Even force-unlock has a safety prompt — you must type "yes" explicitly. This prevents accidental force-unlocks from scripts or muscle memory. The -force flag skips this prompt entirely — use it only in automation where the stale lock has already been verified.
refresh-only detected the bucket that was created before the crash. The bucket existed in AWS but was not in the state file because the crash happened before state was written. The refresh-only plan showed it as a new resource discovered outside Terraform. After accepting, the state file was updated to include the bucket — no duplicate creation, no resource orphaning.
The final plan showed zero changes. After refresh-only, state matches reality matches configuration. The interrupted apply effectively completed — just with a manual recovery step. Both the random_id and the bucket are now properly tracked in state.

State Consistency Guarantees

Beyond locking, Terraform provides consistency through the serial number system introduced in Lesson 14. Understanding how serial numbers protect against stale reads and concurrent writes is important for any team running Terraform in CI/CD with multiple pipelines.

New terms:

serial number — an integer in the state file that increments by 1 on every write. Terraform reads the remote serial before writing. If the remote serial is higher than the serial it read earlier — meaning another process wrote state between the read and the write — Terraform refuses the write with a serial conflict error.
serial conflict — when Terraform tries to write state with a serial that is lower than the current remote serial. This means another process has written state more recently than the current process read it. The error prevents overwriting newer state with stale state.
lineage mismatch — when the lineage UUID in a local state file does not match the remote state file's lineage. This means the local and remote files belong to different state histories — either the wrong state file was used or state was corrupted. Terraform refuses to use the remote state and exits with an error.

# What a serial conflict error looks like
# This happens when two pipelines run apply concurrently and both read state
# Pipeline A reads state (serial: 5), Pipeline B reads state (serial: 5)
# Pipeline A writes (serial: 5 -> 6)
# Pipeline B tries to write (serial: 5 -> 6) — but remote is now 6, not 5

# Pipeline B gets this error:
# Failed to save state: Failed to upload state: failed to upload state:
# state serial 5 does not match expected 6

# What you see if you force-push an old state file
$ terraform state push state-backup-20240115-143022.json

Failed to upload state: state serial 5 does not match the expected serial 7.
The "serial" number in the state is incorrect. This is caused by attempting to
use a state file that is outdated or that was generated out of order.

If you are sure you want to write this state, use the command:
  terraform state push -force state-backup-20240115-143022.json

# -force bypasses the serial check — use only when you are certain the backup
# is the correct state to restore to, not just the most recent backup you have

What just happened?

The serial conflict prevented stale state from overwriting newer state. Pipeline B's write was rejected because it was based on serial 5 — but the remote state had already advanced to serial 6 from Pipeline A's write. Without this check, Pipeline B's apply would have silently overwritten Pipeline A's changes, orphaning the resources Pipeline A created.
state push -force bypasses the serial check entirely. This is the nuclear option — it overwrites remote state regardless of serial. The only legitimate use is deliberate recovery: you have confirmed the backup is the correct state to restore, and you understand that any changes made after the backup was taken will be lost. Always confirm this with your team before using -force on state push.

Locking in CI/CD — The Correct Pattern

CI/CD pipelines introduce a specific locking problem. Multiple PR builds run in parallel. Each build might run terraform plan. Only one of them can hold the lock at a time. If plans are taking minutes, parallel PR builds queue behind each other unnecessarily.

The solution is to use different state keys per PR and only use the production state key for applies on the main branch. Here is the pattern:

# CI/CD locking pattern — separate state per PR for plans, shared state for applies

# For pull request builds — use a PR-specific state key
# This means parallel PR plans never contend for the same lock
terraform init \
  -backend-config="key=lesson-17/pr-${PULL_REQUEST_NUMBER}/terraform.tfstate" \
  -reconfigure

# Plan runs against PR-specific state — no lock contention with other PRs
terraform plan -out=tfplan

# After PR merges to main — apply uses the real production state key
# Only one apply can run at a time — the lock prevents concurrent production applies
terraform init \
  -backend-config="key=lesson-17/prod/terraform.tfstate" \
  -reconfigure

terraform apply -auto-approve -input=false tfplan

# Clean up PR state after merge — it is no longer needed
aws s3 rm \
  s3://acme-terraform-state-123456789012/lesson-17/pr-${PULL_REQUEST_NUMBER}/terraform.tfstate

What just happened?

PR builds use ephemeral state — no lock contention between parallel builds. PR #42 plans against lesson-17/pr-42/terraform.tfstate. PR #43 plans against lesson-17/pr-43/terraform.tfstate. Both can run simultaneously with no lock conflict because they write to different state files. Each PR gets its own isolated plan environment.
Only the merge-triggered apply uses the real production state. The production state key is only touched when code merges — a serialised event. Even if multiple merges happen in rapid succession, each apply queues behind the previous one's lock. This is the correct behaviour — production applies should never run concurrently.
PR state files are cleaned up after merge. The aws s3 rm command removes the ephemeral PR state file from S3. Left uncleaned, these accumulate — one per PR, potentially hundreds over months. Add cleanup as a pipeline step that runs after successful merge.

Common Mistakes

Force-unlocking without confirming the holding process is dead

Force-unlock removes the lock entry regardless of whether the process that acquired it is still running. If you force-unlock a lock held by an active apply — perhaps a slow pipeline job that appears stuck but is actually still running — you allow a second apply to run concurrently. Two concurrent applies writing to the same state file will corrupt it. Always verify the holding process is dead before force-unlocking.

Skipping state backup before recovery operations

Before any state recovery operation — force-unlock, state push, state rm, state mv — run terraform state pull > backup-$(date +%Y%m%d).json. If the recovery goes wrong, you have a restore point. State backups take three seconds. Not having one when you need it takes hours.

Using -lock=false to avoid lock contention in CI

When parallel CI builds contend for the same lock, the tempting fix is -lock=false. This disables locking entirely — allowing concurrent applies to run simultaneously against the same state file. State corruption follows eventually. The correct fix is the PR-specific state key pattern above, or serialising apply jobs in your pipeline configuration.

The seven-step recovery checklist — memorise this

1. Confirm no Terraform process is running. 2. Read the lock entry and note the Created timestamp. 3. Backup current state with terraform state pull. 4. Force-unlock with the Lock ID from the error. 5. Run terraform apply -refresh-only to sync state. 6. Run terraform plan to verify clean state. 7. Apply any remaining changes. Do not skip steps 1, 2, or 3 — those three are the difference between a safe recovery and a corrupted state file.

Practice Questions

1. When inspecting a DynamoDB lock entry to determine if it is stale, which field tells you when the lock was acquired?

2. Before any state recovery operation, which command should you run first to save a backup of the current remote state?

3. Terraform uses a ________ number that increments on every state write to detect and reject attempts to overwrite newer state with older state.

Quiz

Up Next · Lesson 18

Terraform State Commands

You have used state mv, state rm, state list, and state show. Lesson 18 covers all state commands in depth — including terraform import to bring existing infrastructure under Terraform management without destroying it.

← Previous Course Index Next →

Terraform Course

State Locking and Consistency

How State Locking Works End to End

Inside the DynamoDB Lock Entry

Simulating a Crash Mid-Apply

Safe Lock Recovery Procedure

State Consistency Guarantees

Locking in CI/CD — The Correct Pattern

Common Mistakes

Practice Questions

Quiz