Terraform Lesson 13 – Data Sources | Dataplexa

Section II · Lesson 13

Data Sources

Every Terraform configuration exists inside a larger world of infrastructure it did not create. Existing VPCs, pre-built AMIs, secrets stored by other teams, DNS zones, SSL certificates — your configuration needs to reference these things without owning them. Data sources are how Terraform reads existing infrastructure without managing it.

This lesson covers

How data sources work → Querying existing AWS infrastructure → Dynamic AMI lookup → Reading secrets from AWS Secrets Manager → The pattern that replaces every hardcoded ID

How Data Sources Work

A data source is a read-only query to a provider API. You describe what you are looking for — a VPC with a specific tag, the latest Amazon Linux AMI, a secret stored in AWS Secrets Manager — and Terraform fetches the current data during the plan phase. The result is available to any resource or output in your configuration.

Data sources have two properties that make them fundamentally different from resources. They never create, modify, or destroy anything. And they run before any resource is created — their results are available the moment planning begins.

The Analogy

A data source is like a lookup in a company directory. You are not creating a new employee — you are finding an existing one by name or department to get their contact details. The directory does not change. You read from it and use the result. That is exactly what a Terraform data source does — it queries existing infrastructure and returns attributes you can use in your own resources.

Aspect	Resource	Data Source
Keyword	`resource`	`data`
Creates infrastructure	Yes	No — read only
Stored in state	Yes — full lifecycle	Cached — refreshed on plan
Reference syntax	`aws_vpc.main.id`	`data.aws_vpc.main.id`
When it runs	During apply	During plan — before resources

Setting Up

Create a fresh project. This lesson queries several real AWS data sources — an existing VPC, the latest Amazon Linux 2 AMI, availability zones, and a secret from Secrets Manager. Every example in this lesson runs against real AWS.

mkdir terraform-lesson-13
cd terraform-lesson-13
touch versions.tf variables.tf data.tf main.tf outputs.tf .gitignore

Add this to versions.tf:

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

provider "aws" {
  region = var.region

  default_tags {
    tags = {
      ManagedBy   = "Terraform"
      Environment = var.environment
    }
  }
}

Add this to variables.tf:

variable "region" {
  description = "AWS region for all resources"
  type        = string
  default     = "us-east-1"
}

variable "environment" {
  description = "Deployment environment"
  type        = string
  default     = "dev"
}

variable "vpc_name_tag" {
  description = "Name tag of the existing VPC to deploy into — must exist before this config runs"
  type        = string
  default     = "vpc-dev"  # The VPC created by the networking project in Lesson 12
}

Run terraform init and continue building out data.tf below.

Querying Existing AWS Infrastructure

The most common use of data sources is looking up infrastructure that already exists — a VPC created by another team, subnets tagged by environment, a security group managed separately. You reference these by their attributes — tags, IDs, names — and Terraform fetches the current values from AWS.

We are writing data.tf — a dedicated file for all data source declarations. Keeping data sources separate from resources makes it immediately obvious what your configuration reads versus what it owns.

New terms:

data block — declares a data source query. Syntax is data "provider_type" "local_name" { ... }. The provider type identifies what to query. The local name is how you reference the result. Arguments inside the block filter the query — like WHERE clauses in SQL.
filter block — used inside AWS data sources to narrow results by resource attributes. The name argument is the AWS filter key — the same keys used in the AWS CLI --filters parameter. The values argument is a list of values to match. Multiple filters are ANDed — all must match.
aws_vpc data source — queries an existing VPC by its attributes. Can filter by ID, tags, CIDR block, or state. Returns every attribute the VPC has — id, cidr_block, owner_id, enable_dns_hostnames, and more.
aws_subnets data source — returns a list of subnet IDs matching the filters. Note the plural — aws_subnets returns multiple results as a list. aws_subnet (singular) expects exactly one result and fails if multiple subnets match.
aws_availability_zones data source — returns the list of availability zones available in the current region. Using this data source instead of hardcoding AZ names makes your configuration portable across regions — every region has different AZ names.

Add this to data.tf:

# ── EXISTING VPC ─────────────────────────────────────────────────────────────

# Look up an existing VPC by its Name tag
# This VPC was created by the networking project — we do not own it, we reference it
data "aws_vpc" "existing" {
  # filter narrows the query — like a WHERE clause
  # tag:Name is the AWS filter key for the Name tag
  filter {
    name   = "tag:Name"
    values = [var.vpc_name_tag]  # Match the VPC whose Name tag equals this variable
  }
}

# ── SUBNETS IN THE EXISTING VPC ───────────────────────────────────────────────

# Look up all public subnets in the existing VPC
# aws_subnets (plural) returns a list of IDs matching all filters
data "aws_subnets" "public" {
  # Filter 1 — only subnets in the VPC we found above
  filter {
    name   = "vpc-id"
    values = [data.aws_vpc.existing.id]  # Reference the data source result
  }

  # Filter 2 — only subnets tagged as public tier
  # Both filters must match — they are ANDed together
  filter {
    name   = "tag:Tier"
    values = ["public"]
  }
}

# ── AVAILABILITY ZONES ────────────────────────────────────────────────────────

# Returns all available AZs in the current region
# state = "available" excludes AZs that are temporarily unavailable or being deprecated
data "aws_availability_zones" "available" {
  state = "available"
}

# ── CURRENT AWS ACCOUNT ───────────────────────────────────────────────────────

# Returns information about the AWS account Terraform is running as
# Useful for building ARNs and for confirming you are deploying to the right account
data "aws_caller_identity" "current" {}

# ── CURRENT AWS REGION ────────────────────────────────────────────────────────

# Returns the current region — useful when you need the region name in a string
# but do not want to repeat the variable reference everywhere
data "aws_region" "current" {}

$ terraform plan

data.aws_caller_identity.current: Reading...
data.aws_region.current: Reading...
data.aws_availability_zones.available: Reading...
data.aws_caller_identity.current: Read complete after 0s [id=123456789012]
data.aws_region.current: Read complete after 0s [id=us-east-1]
data.aws_availability_zones.available: Read complete after 1s [id=us-east-1]
data.aws_vpc.existing: Reading...
data.aws_vpc.existing: Read complete after 1s [id=vpc-0abc123def]
data.aws_subnets.public: Reading...
data.aws_subnets.public: Read complete after 1s [id=us-east-1]

No changes. Your infrastructure matches the configuration.

What just happened?

All five data sources ran during plan — before any resources. Every data.*: Reading... line appears at the top of the plan output, before Terraform evaluates any resource blocks. Data source results are available for the entire planning phase.
aws_subnets filtered by two criteria simultaneously. The first filter matched only subnets in the VPC returned by data.aws_vpc.existing — note the reference data.aws_vpc.existing.id. Data sources can reference other data sources. The second filter restricted results to subnets tagged Tier = "public". Both must match.
aws_caller_identity and aws_region need no arguments. They return facts about the current execution context — who is running Terraform and in which region. Both complete in under 1 second because they are simple STS and metadata API calls with no filtering logic.
"No changes" — data sources do not create infrastructure. Even after reading five real AWS resources, the plan shows zero changes. Data sources are observers — they never appear in the add/change/destroy counts.

Dynamic AMI Lookup — Never Hardcode an AMI ID Again

Every lesson so far has used ami-0c55b159cbfafe1f0 — a hardcoded Amazon Linux 2 AMI ID valid only in us-east-1. This breaks the moment someone deploys to a different region. It also goes stale — Amazon releases new AMIs with security patches and the old ID keeps pointing at an outdated image.

The aws_ami data source queries the AMI catalogue and returns the latest matching image. Your configuration always gets the current AMI for the region it is deploying into — no hardcoded IDs, no stale images.

New terms:

aws_ami data source — queries the EC2 AMI catalogue. Filters by owner, name pattern, architecture, virtualisation type, and more. Returns the AMI matching all filters — if multiple match, most_recent = true returns the newest one.
most_recent = true — when multiple AMIs match the filters, return the one with the most recent creation date. Always use this when looking up AMIs by name pattern — AWS regularly releases new versions and you want the current one.
owners — a list of AWS account IDs or aliases that published the AMI. ["amazon"] restricts results to AMIs published by Amazon — preventing accidentally using a third-party AMI with the same name pattern. For custom AMIs built by your organisation, use your own account ID.
name filter with wildcard — AMI names follow a pattern like amzn2-ami-hvm-2.0.20231116.0-x86_64-gp2. The * wildcard matches any version in the name — amzn2-ami-hvm-*-x86_64-gp2 matches all Amazon Linux 2 HVM AMIs for x86_64 with gp2 storage.

Add this to data.tf:

# ── DYNAMIC AMI LOOKUP ────────────────────────────────────────────────────────

# Look up the latest Amazon Linux 2 AMI for the current region
# This replaces every hardcoded ami-* ID in the configuration
data "aws_ami" "amazon_linux_2" {
  most_recent = true        # When multiple AMIs match, return the newest
  owners      = ["amazon"]  # Only trust AMIs published by Amazon

  # Filter by the AMI name pattern for Amazon Linux 2 HVM x86_64 with gp2 storage
  # The * wildcard matches the version number that changes with each release
  filter {
    name   = "name"
    values = ["amzn2-ami-hvm-*-x86_64-gp2"]
  }

  # Only return machine images — not kernels, ramdisks, or snapshots
  filter {
    name   = "image-type"
    values = ["machine"]
  }

  # HVM virtualisation — required for all modern instance types
  # paravirtual is legacy and not supported by most current instance families
  filter {
    name   = "virtualization-type"
    values = ["hvm"]
  }
}

# Look up the latest Ubuntu 22.04 LTS AMI — Canonical's account ID is 099720109477
data "aws_ami" "ubuntu_22" {
  most_recent = true
  owners      = ["099720109477"]  # Canonical — the publisher of Ubuntu AMIs on AWS

  filter {
    name   = "name"
    values = ["ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-*"]
  }

  filter {
    name   = "virtualization-type"
    values = ["hvm"]
  }
}

$ terraform console

# The AMI data source resolved to current IDs — never hardcode these
> data.aws_ami.amazon_linux_2.id
"ami-0c55b159cbfafe1f0"

> data.aws_ami.amazon_linux_2.name
"amzn2-ami-hvm-2.0.20231116.0-x86_64-gp2"

> data.aws_ami.amazon_linux_2.creation_date
"2023-11-17T00:49:18.000Z"

# Ubuntu AMI resolved to its current ID in us-east-1
> data.aws_ami.ubuntu_22.id
"ami-0fc5d935ebf8bc3bc"

> data.aws_ami.ubuntu_22.name
"ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-20231128"

# Deploy to eu-west-1 and you get different IDs — same AMI, different region
# terraform plan -var="region=eu-west-1" would resolve different IDs automatically

What just happened?

The AMI ID resolved dynamically for the current region. In us-east-1 it returned ami-0c55b159cbfafe1f0 — the same ID we have been hardcoding all course. But run this in eu-west-1 and you get a completely different ID that correctly points to Amazon Linux 2 in that region. The configuration is now region-portable.
The name and creation_date attributes confirm which AMI was selected. Always inspect these after first running the data source — confirm it resolved to what you expected. The name shows the exact version string and the creation date confirms it is recent. If the name does not match your expectation, your filter pattern needs adjustment.
Ubuntu AMIs are owned by Canonical — account ID 099720109477. This account ID is Canonical's official AWS account and is the same in every region. Never use Ubuntu AMIs published by any other account — they could be modified or malicious. For any AMI that is not Amazon's own, verify the owner account ID against the publisher's official documentation.

Reading Secrets from AWS Secrets Manager

Database passwords, API keys, and third-party credentials should never live in terraform.tfvars or environment variables. They belong in a secrets manager — AWS Secrets Manager stores them encrypted and controls who can read them via IAM policies. Terraform reads them at plan time using a data source.

This is the professional pattern: secrets live in Secrets Manager, Terraform reads them via data source, resources receive the secret values — and no secret ever touches a file that could be committed to Git.

New terms:

aws_secretsmanager_secret data source — looks up an existing secret in AWS Secrets Manager by name or ARN. Returns metadata about the secret — its ARN, description, tags — but not the secret value itself.
aws_secretsmanager_secret_version data source — reads the actual value of a secret version. The secret_id argument accepts either the secret name or its ARN. The secret_string attribute contains the plaintext secret value.
jsondecode() function — parses a JSON string and returns a Terraform object. Secrets Manager stores credentials as JSON strings — {"username":"admin","password":"secret"}. Using jsondecode() parses the JSON so you can access individual fields with dot notation.
sensitive attribute on local — marks a local value as sensitive so its contents are never printed to the terminal. Use this when a local references a secret data source result — the sensitivity does not propagate automatically to locals the way it does to outputs.

First, create the secret in AWS (run this once — not in Terraform, just in the AWS CLI):

# Create the secret in Secrets Manager — run this once from your terminal
# This is setup, not Terraform code — you would normally do this in a bootstrap process
aws secretsmanager create-secret \
  --name "acme/dev/db-credentials" \
  --description "Database credentials for the dev environment" \
  --secret-string '{"username":"dbadmin","password":"SuperSecretP@ssw0rd123!"}'

Now add this to data.tf:

# ── SECRETS MANAGER ──────────────────────────────────────────────────────────

# Step 1 — look up the secret metadata by name
# This returns the ARN and description but NOT the value
data "aws_secretsmanager_secret" "db_credentials" {
  name = "acme/${var.environment}/db-credentials"  # Path-based naming by environment
}

# Step 2 — read the current version of the secret value
# secret_string contains the plaintext JSON — always marked sensitive by Terraform
data "aws_secretsmanager_secret_version" "db_credentials" {
  secret_id = data.aws_secretsmanager_secret.db_credentials.id  # Reference step 1
}

Now add this to locals.tf — parse the JSON secret into usable fields:

# Parse the JSON secret string into individual fields
# jsondecode() converts '{"username":"dbadmin","password":"..."}' into a Terraform object
# sensitive = true prevents these values from appearing in any terminal output
locals {
  db_credentials = jsondecode(
    data.aws_secretsmanager_secret_version.db_credentials.secret_string
  )

  # Individual fields accessed with dot notation after decoding
  db_username = local.db_credentials.username  # "dbadmin"
  db_password = local.db_credentials.password  # The actual password — never printed
}

$ terraform plan

data.aws_secretsmanager_secret.db_credentials: Reading...
data.aws_secretsmanager_secret.db_credentials: Read complete after 0s
data.aws_secretsmanager_secret_version.db_credentials: Reading...
data.aws_secretsmanager_secret_version.db_credentials: Read complete after 0s

  # aws_db_instance.main will be created
  + resource "aws_db_instance" "main" {
      + username = "dbadmin"
      + password = (sensitive value)  # Secret — never printed
      + ...
    }

# In terraform console — sensitive values are protected
$ terraform console
> local.db_username
"dbadmin"

> local.db_password
(sensitive value)

What just happened?

The secret was fetched from Secrets Manager and parsed in one step. The two-stage lookup — first the secret metadata, then the version — is how Secrets Manager is designed. The metadata data source returns the stable ARN. The version data source uses that ARN to get the current value. If the secret is rotated in Secrets Manager, the next terraform plan automatically picks up the new value.
jsondecode() turned the JSON string into an addressable object. The raw secret_string is {"username":"dbadmin","password":"..."} — a JSON string. After jsondecode(), it becomes a Terraform object where local.db_credentials.username and local.db_credentials.password are valid expressions. This pattern works for any structured secret regardless of how many fields it contains.
local.db_password printed as (sensitive value) in the console. The secret_string attribute on a Secrets Manager data source is automatically marked sensitive by the AWS provider. Any local that references it inherits that sensitivity. Even in terraform console the password is protected — it will never appear in any log file or terminal recording.

Putting It All Together — main.tf

Every data source from this lesson used in a real main.tf. Zero hardcoded IDs. The AMI is looked up dynamically. The VPC and subnets come from the existing networking layer. The database password comes from Secrets Manager. This is what production Terraform looks like.

# Security group — lives in the existing VPC from the data source
resource "aws_security_group" "app" {
  name        = "app-sg-${var.environment}"
  description = "Application security group"
  vpc_id      = data.aws_vpc.existing.id  # VPC ID from data source — not hardcoded

  ingress {
    description = "HTTP"
    from_port   = 80
    to_port     = 80
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  egress {
    description = "All outbound"
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

# EC2 instance — AMI and subnet from data sources, never hardcoded
resource "aws_instance" "app" {
  # Dynamic AMI — always the latest Amazon Linux 2 for the current region
  ami = data.aws_ami.amazon_linux_2.id

  instance_type = "t2.micro"

  # First subnet from the list returned by aws_subnets data source
  subnet_id = data.aws_subnets.public.ids[0]

  vpc_security_group_ids = [aws_security_group.app.id]

  tags = {
    Name      = "app-${var.environment}"
    AmiName   = data.aws_ami.amazon_linux_2.name  # Track which AMI version is running
    AccountId = data.aws_caller_identity.current.account_id  # Tag with AWS account ID
  }

  lifecycle {
    create_before_destroy = true

    # The AMI ID will change every time AWS releases a new version
    # ignore_changes prevents Terraform from replacing the instance on every plan
    # Instances are patched in-place — not replaced on every AMI update
    ignore_changes = [ami]
  }
}

# RDS database — password comes from Secrets Manager via data source
resource "aws_db_instance" "main" {
  identifier        = "appdb-${var.environment}"
  engine            = "postgres"
  engine_version    = "15.3"
  instance_class    = "db.t3.micro"
  allocated_storage = 20

  db_name  = "appdb"
  username = local.db_username  # Parsed from Secrets Manager JSON
  password = local.db_password  # Sensitive — never printed in any output

  # Deploy in the first public subnet — in production use private subnets
  db_subnet_group_name   = aws_db_subnet_group.main.name
  vpc_security_group_ids = [aws_security_group.app.id]

  skip_final_snapshot = true  # For dev — always false in production

  tags = {
    Name = "appdb-${var.environment}"
  }
}

# Subnet group for RDS — requires at least two subnets in different AZs
resource "aws_db_subnet_group" "main" {
  name        = "appdb-subnet-group-${var.environment}"
  description = "Subnet group for RDS in ${var.environment}"

  # Pass all public subnet IDs from the data source
  subnet_ids = data.aws_subnets.public.ids

  tags = {
    Name = "appdb-subnet-group-${var.environment}"
  }
}

$ terraform plan

data.aws_caller_identity.current: Reading... [complete after 0s]
data.aws_region.current: Reading... [complete after 0s]
data.aws_availability_zones.available: Reading... [complete after 1s]
data.aws_ami.amazon_linux_2: Reading... [complete after 1s]
data.aws_ami.ubuntu_22: Reading... [complete after 1s]
data.aws_vpc.existing: Reading... [complete after 1s]
data.aws_subnets.public: Reading... [complete after 1s]
data.aws_secretsmanager_secret.db_credentials: Reading... [complete after 0s]
data.aws_secretsmanager_secret_version.db_credentials: Reading... [complete after 0s]

Terraform will perform the following actions:

  # aws_security_group.app will be created
  + resource "aws_security_group" "app" {
      + name   = "app-sg-dev"
      + vpc_id = "vpc-0abc123def"    # From data source — not hardcoded
    }

  # aws_instance.app will be created
  + resource "aws_instance" "app" {
      + ami           = "ami-0c55b159cbfafe1f0"   # Resolved from data source
      + instance_type = "t2.micro"
      + subnet_id     = "subnet-0aaa111bbb"        # From data source
      + tags = {
          + "AccountId" = "123456789012"
          + "AmiName"   = "amzn2-ami-hvm-2.0.20231116.0-x86_64-gp2"
          + "Name"      = "app-dev"
        }
    }

  # aws_db_instance.main will be created
  + resource "aws_db_instance" "main" {
      + identifier = "appdb-dev"
      + username   = "dbadmin"
      + password   = (sensitive value)   # Redacted — came from Secrets Manager
      + engine     = "postgres"
    }

  # aws_db_subnet_group.main will be created
  + resource "aws_db_subnet_group" "main" {
      + subnet_ids = [
          "subnet-0aaa111bbb",
          "subnet-0ccc333ddd",
        ]
    }

Plan: 4 to add, 0 to change, 0 to destroy.

What just happened?

Zero hardcoded IDs anywhere in main.tf. The VPC ID, subnet IDs, AMI ID, account ID, and database password all came from data sources. Change the region variable and the AMI ID resolves to the correct Amazon Linux 2 for the new region automatically. Change the environment variable and the Secrets Manager path resolves to the correct credentials for that environment.
The AmiName tag tracks which AMI version is running. data.aws_ami.amazon_linux_2.name resolves to the full version string — amzn2-ami-hvm-2.0.20231116.0-x86_64-gp2. This tag on the instance means you can query your AWS account later and immediately know which AMI version every instance is running — useful for security audits and patch compliance reporting.
The database password showed as (sensitive value) throughout. From Secrets Manager through jsondecode() through the local value through the aws_db_instance resource — sensitivity propagated the entire chain. The password was never printed at any point, including in the plan output where it shows as (sensitive value).
aws_db_subnet_group received all public subnet IDs as a list. data.aws_subnets.public.ids is a list of all matching subnet IDs. The subnet group accepted the entire list — both subnet-0aaa111bbb and subnet-0ccc333ddd. RDS requires subnets in at least two availability zones and the data source returned both.

Common Mistakes

Using aws_subnet (singular) when multiple subnets match the filter

aws_subnet expects exactly one result. If your filter matches two subnets, Terraform throws an error: "multiple results matched the given filter". Use aws_subnets (plural) when you want a list of all matching subnets, and aws_subnet only when your filter is specific enough to return exactly one result.

Not setting most_recent = true on AMI data sources

Without most_recent = true, if multiple AMIs match your filters Terraform throws an error rather than picking one. And even if only one matches today, AWS releases new AMIs regularly — your filter will eventually match multiple results and break your plan. Always set most_recent = true when looking up AMIs by name pattern.

Not adding ignore_changes = [ami] when using dynamic AMI lookup

Every time AWS releases a new Amazon Linux 2 AMI, the data source resolves to a new ID. On the next terraform plan, Terraform sees that the running instance uses the old AMI and plans a -/+ replace. If you patch instances in-place rather than rebuilding them, this causes unintended replacements. Add ignore_changes = [ami] to the lifecycle block unless your process intentionally rebuilds instances on every AMI update.

Keep data sources in data.tf

Separating data sources into their own data.tf file is not a Terraform requirement but it is a convention worth adopting. When you open a project, a data.tf file tells you immediately what the configuration reads from the outside world — what it depends on but does not own. This distinction between "what we read" and "what we own" is one of the most important things to make explicit in any non-trivial Terraform project.

Practice Questions

1. You declare a data source: data "aws_ami" "amazon_linux_2" {...}. What is the correct expression to reference its id attribute in a resource block?

2. When multiple AMIs match your aws_ami filter, which argument tells Terraform to return the newest one instead of throwing an error?

3. A Secrets Manager secret contains JSON: {"username":"admin","password":"secret"}. Which Terraform function converts that string into an object so you can access .username and .password?

Quiz

Up Next · Lesson 14

Terraform State Explained

You have used state in every lesson. Lesson 14 opens it up — what is actually inside the state file, how Terraform uses it to detect drift, and what happens when state and reality diverge.

← Previous Course Index Next →

Terraform Course

Data Sources

How Data Sources Work

Setting Up

Querying Existing AWS Infrastructure

Dynamic AMI Lookup — Never Hardcode an AMI ID Again

Reading Secrets from AWS Secrets Manager

Putting It All Together — main.tf

Common Mistakes

Practice Questions

Quiz