Terraform Course
Data Sources
Every Terraform configuration exists inside a larger world of infrastructure it did not create. Existing VPCs, pre-built AMIs, secrets stored by other teams, DNS zones, SSL certificates — your configuration needs to reference these things without owning them. Data sources are how Terraform reads existing infrastructure without managing it.
This lesson covers
How data sources work → Querying existing AWS infrastructure → Dynamic AMI lookup → Reading secrets from AWS Secrets Manager → The pattern that replaces every hardcoded ID
How Data Sources Work
A data source is a read-only query to a provider API. You describe what you are looking for — a VPC with a specific tag, the latest Amazon Linux AMI, a secret stored in AWS Secrets Manager — and Terraform fetches the current data during the plan phase. The result is available to any resource or output in your configuration.
Data sources have two properties that make them fundamentally different from resources. They never create, modify, or destroy anything. And they run before any resource is created — their results are available the moment planning begins.
The Analogy
A data source is like a lookup in a company directory. You are not creating a new employee — you are finding an existing one by name or department to get their contact details. The directory does not change. You read from it and use the result. That is exactly what a Terraform data source does — it queries existing infrastructure and returns attributes you can use in your own resources.
| Aspect | Resource | Data Source |
|---|---|---|
| Keyword | resource |
data |
| Creates infrastructure | Yes | No — read only |
| Stored in state | Yes — full lifecycle | Cached — refreshed on plan |
| Reference syntax | aws_vpc.main.id |
data.aws_vpc.main.id |
| When it runs | During apply | During plan — before resources |
Setting Up
Create a fresh project. This lesson queries several real AWS data sources — an existing VPC, the latest Amazon Linux 2 AMI, availability zones, and a secret from Secrets Manager. Every example in this lesson runs against real AWS.
mkdir terraform-lesson-13
cd terraform-lesson-13
touch versions.tf variables.tf data.tf main.tf outputs.tf .gitignore
Add this to versions.tf:
terraform {
required_version = ">= 1.5.0"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
}
provider "aws" {
region = var.region
default_tags {
tags = {
ManagedBy = "Terraform"
Environment = var.environment
}
}
}
Add this to variables.tf:
variable "region" {
description = "AWS region for all resources"
type = string
default = "us-east-1"
}
variable "environment" {
description = "Deployment environment"
type = string
default = "dev"
}
variable "vpc_name_tag" {
description = "Name tag of the existing VPC to deploy into — must exist before this config runs"
type = string
default = "vpc-dev" # The VPC created by the networking project in Lesson 12
}
Run terraform init and continue building out data.tf below.
Querying Existing AWS Infrastructure
The most common use of data sources is looking up infrastructure that already exists — a VPC created by another team, subnets tagged by environment, a security group managed separately. You reference these by their attributes — tags, IDs, names — and Terraform fetches the current values from AWS.
We are writing data.tf — a dedicated file for all data source declarations. Keeping data sources separate from resources makes it immediately obvious what your configuration reads versus what it owns.
New terms:
- data block — declares a data source query. Syntax is
data "provider_type" "local_name" { ... }. The provider type identifies what to query. The local name is how you reference the result. Arguments inside the block filter the query — like WHERE clauses in SQL. - filter block — used inside AWS data sources to narrow results by resource attributes. The
nameargument is the AWS filter key — the same keys used in the AWS CLI--filtersparameter. Thevaluesargument is a list of values to match. Multiple filters are ANDed — all must match. - aws_vpc data source — queries an existing VPC by its attributes. Can filter by ID, tags, CIDR block, or state. Returns every attribute the VPC has — id, cidr_block, owner_id, enable_dns_hostnames, and more.
- aws_subnets data source — returns a list of subnet IDs matching the filters. Note the plural —
aws_subnetsreturns multiple results as a list.aws_subnet(singular) expects exactly one result and fails if multiple subnets match. - aws_availability_zones data source — returns the list of availability zones available in the current region. Using this data source instead of hardcoding AZ names makes your configuration portable across regions — every region has different AZ names.
Add this to data.tf:
# ── EXISTING VPC ─────────────────────────────────────────────────────────────
# Look up an existing VPC by its Name tag
# This VPC was created by the networking project — we do not own it, we reference it
data "aws_vpc" "existing" {
# filter narrows the query — like a WHERE clause
# tag:Name is the AWS filter key for the Name tag
filter {
name = "tag:Name"
values = [var.vpc_name_tag] # Match the VPC whose Name tag equals this variable
}
}
# ── SUBNETS IN THE EXISTING VPC ───────────────────────────────────────────────
# Look up all public subnets in the existing VPC
# aws_subnets (plural) returns a list of IDs matching all filters
data "aws_subnets" "public" {
# Filter 1 — only subnets in the VPC we found above
filter {
name = "vpc-id"
values = [data.aws_vpc.existing.id] # Reference the data source result
}
# Filter 2 — only subnets tagged as public tier
# Both filters must match — they are ANDed together
filter {
name = "tag:Tier"
values = ["public"]
}
}
# ── AVAILABILITY ZONES ────────────────────────────────────────────────────────
# Returns all available AZs in the current region
# state = "available" excludes AZs that are temporarily unavailable or being deprecated
data "aws_availability_zones" "available" {
state = "available"
}
# ── CURRENT AWS ACCOUNT ───────────────────────────────────────────────────────
# Returns information about the AWS account Terraform is running as
# Useful for building ARNs and for confirming you are deploying to the right account
data "aws_caller_identity" "current" {}
# ── CURRENT AWS REGION ────────────────────────────────────────────────────────
# Returns the current region — useful when you need the region name in a string
# but do not want to repeat the variable reference everywhere
data "aws_region" "current" {}
$ terraform plan data.aws_caller_identity.current: Reading... data.aws_region.current: Reading... data.aws_availability_zones.available: Reading... data.aws_caller_identity.current: Read complete after 0s [id=123456789012] data.aws_region.current: Read complete after 0s [id=us-east-1] data.aws_availability_zones.available: Read complete after 1s [id=us-east-1] data.aws_vpc.existing: Reading... data.aws_vpc.existing: Read complete after 1s [id=vpc-0abc123def] data.aws_subnets.public: Reading... data.aws_subnets.public: Read complete after 1s [id=us-east-1] No changes. Your infrastructure matches the configuration.
What just happened?
- All five data sources ran during plan — before any resources. Every
data.*: Reading...line appears at the top of the plan output, before Terraform evaluates any resource blocks. Data source results are available for the entire planning phase. - aws_subnets filtered by two criteria simultaneously. The first filter matched only subnets in the VPC returned by
data.aws_vpc.existing— note the referencedata.aws_vpc.existing.id. Data sources can reference other data sources. The second filter restricted results to subnets taggedTier = "public". Both must match. - aws_caller_identity and aws_region need no arguments. They return facts about the current execution context — who is running Terraform and in which region. Both complete in under 1 second because they are simple STS and metadata API calls with no filtering logic.
- "No changes" — data sources do not create infrastructure. Even after reading five real AWS resources, the plan shows zero changes. Data sources are observers — they never appear in the add/change/destroy counts.
Dynamic AMI Lookup — Never Hardcode an AMI ID Again
Every lesson so far has used ami-0c55b159cbfafe1f0 — a hardcoded Amazon Linux 2 AMI ID valid only in us-east-1. This breaks the moment someone deploys to a different region. It also goes stale — Amazon releases new AMIs with security patches and the old ID keeps pointing at an outdated image.
The aws_ami data source queries the AMI catalogue and returns the latest matching image. Your configuration always gets the current AMI for the region it is deploying into — no hardcoded IDs, no stale images.
New terms:
- aws_ami data source — queries the EC2 AMI catalogue. Filters by owner, name pattern, architecture, virtualisation type, and more. Returns the AMI matching all filters — if multiple match,
most_recent = truereturns the newest one. - most_recent = true — when multiple AMIs match the filters, return the one with the most recent creation date. Always use this when looking up AMIs by name pattern — AWS regularly releases new versions and you want the current one.
- owners — a list of AWS account IDs or aliases that published the AMI.
["amazon"]restricts results to AMIs published by Amazon — preventing accidentally using a third-party AMI with the same name pattern. For custom AMIs built by your organisation, use your own account ID. - name filter with wildcard — AMI names follow a pattern like
amzn2-ami-hvm-2.0.20231116.0-x86_64-gp2. The*wildcard matches any version in the name —amzn2-ami-hvm-*-x86_64-gp2matches all Amazon Linux 2 HVM AMIs for x86_64 with gp2 storage.
Add this to data.tf:
# ── DYNAMIC AMI LOOKUP ────────────────────────────────────────────────────────
# Look up the latest Amazon Linux 2 AMI for the current region
# This replaces every hardcoded ami-* ID in the configuration
data "aws_ami" "amazon_linux_2" {
most_recent = true # When multiple AMIs match, return the newest
owners = ["amazon"] # Only trust AMIs published by Amazon
# Filter by the AMI name pattern for Amazon Linux 2 HVM x86_64 with gp2 storage
# The * wildcard matches the version number that changes with each release
filter {
name = "name"
values = ["amzn2-ami-hvm-*-x86_64-gp2"]
}
# Only return machine images — not kernels, ramdisks, or snapshots
filter {
name = "image-type"
values = ["machine"]
}
# HVM virtualisation — required for all modern instance types
# paravirtual is legacy and not supported by most current instance families
filter {
name = "virtualization-type"
values = ["hvm"]
}
}
# Look up the latest Ubuntu 22.04 LTS AMI — Canonical's account ID is 099720109477
data "aws_ami" "ubuntu_22" {
most_recent = true
owners = ["099720109477"] # Canonical — the publisher of Ubuntu AMIs on AWS
filter {
name = "name"
values = ["ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-*"]
}
filter {
name = "virtualization-type"
values = ["hvm"]
}
}
$ terraform console # The AMI data source resolved to current IDs — never hardcode these > data.aws_ami.amazon_linux_2.id "ami-0c55b159cbfafe1f0" > data.aws_ami.amazon_linux_2.name "amzn2-ami-hvm-2.0.20231116.0-x86_64-gp2" > data.aws_ami.amazon_linux_2.creation_date "2023-11-17T00:49:18.000Z" # Ubuntu AMI resolved to its current ID in us-east-1 > data.aws_ami.ubuntu_22.id "ami-0fc5d935ebf8bc3bc" > data.aws_ami.ubuntu_22.name "ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-20231128" # Deploy to eu-west-1 and you get different IDs — same AMI, different region # terraform plan -var="region=eu-west-1" would resolve different IDs automatically
What just happened?
- The AMI ID resolved dynamically for the current region. In
us-east-1it returnedami-0c55b159cbfafe1f0— the same ID we have been hardcoding all course. But run this ineu-west-1and you get a completely different ID that correctly points to Amazon Linux 2 in that region. The configuration is now region-portable. - The name and creation_date attributes confirm which AMI was selected. Always inspect these after first running the data source — confirm it resolved to what you expected. The name shows the exact version string and the creation date confirms it is recent. If the name does not match your expectation, your filter pattern needs adjustment.
- Ubuntu AMIs are owned by Canonical — account ID 099720109477. This account ID is Canonical's official AWS account and is the same in every region. Never use Ubuntu AMIs published by any other account — they could be modified or malicious. For any AMI that is not Amazon's own, verify the owner account ID against the publisher's official documentation.
Reading Secrets from AWS Secrets Manager
Database passwords, API keys, and third-party credentials should never live in terraform.tfvars or environment variables. They belong in a secrets manager — AWS Secrets Manager stores them encrypted and controls who can read them via IAM policies. Terraform reads them at plan time using a data source.
This is the professional pattern: secrets live in Secrets Manager, Terraform reads them via data source, resources receive the secret values — and no secret ever touches a file that could be committed to Git.
New terms:
- aws_secretsmanager_secret data source — looks up an existing secret in AWS Secrets Manager by name or ARN. Returns metadata about the secret — its ARN, description, tags — but not the secret value itself.
- aws_secretsmanager_secret_version data source — reads the actual value of a secret version. The
secret_idargument accepts either the secret name or its ARN. Thesecret_stringattribute contains the plaintext secret value. - jsondecode() function — parses a JSON string and returns a Terraform object. Secrets Manager stores credentials as JSON strings —
{"username":"admin","password":"secret"}. Usingjsondecode()parses the JSON so you can access individual fields with dot notation. - sensitive attribute on local — marks a local value as sensitive so its contents are never printed to the terminal. Use this when a local references a secret data source result — the sensitivity does not propagate automatically to locals the way it does to outputs.
First, create the secret in AWS (run this once — not in Terraform, just in the AWS CLI):
# Create the secret in Secrets Manager — run this once from your terminal
# This is setup, not Terraform code — you would normally do this in a bootstrap process
aws secretsmanager create-secret \
--name "acme/dev/db-credentials" \
--description "Database credentials for the dev environment" \
--secret-string '{"username":"dbadmin","password":"SuperSecretP@ssw0rd123!"}'
Now add this to data.tf:
# ── SECRETS MANAGER ──────────────────────────────────────────────────────────
# Step 1 — look up the secret metadata by name
# This returns the ARN and description but NOT the value
data "aws_secretsmanager_secret" "db_credentials" {
name = "acme/${var.environment}/db-credentials" # Path-based naming by environment
}
# Step 2 — read the current version of the secret value
# secret_string contains the plaintext JSON — always marked sensitive by Terraform
data "aws_secretsmanager_secret_version" "db_credentials" {
secret_id = data.aws_secretsmanager_secret.db_credentials.id # Reference step 1
}
Now add this to locals.tf — parse the JSON secret into usable fields:
# Parse the JSON secret string into individual fields
# jsondecode() converts '{"username":"dbadmin","password":"..."}' into a Terraform object
# sensitive = true prevents these values from appearing in any terminal output
locals {
db_credentials = jsondecode(
data.aws_secretsmanager_secret_version.db_credentials.secret_string
)
# Individual fields accessed with dot notation after decoding
db_username = local.db_credentials.username # "dbadmin"
db_password = local.db_credentials.password # The actual password — never printed
}
$ terraform plan
data.aws_secretsmanager_secret.db_credentials: Reading...
data.aws_secretsmanager_secret.db_credentials: Read complete after 0s
data.aws_secretsmanager_secret_version.db_credentials: Reading...
data.aws_secretsmanager_secret_version.db_credentials: Read complete after 0s
# aws_db_instance.main will be created
+ resource "aws_db_instance" "main" {
+ username = "dbadmin"
+ password = (sensitive value) # Secret — never printed
+ ...
}
# In terraform console — sensitive values are protected
$ terraform console
> local.db_username
"dbadmin"
> local.db_password
(sensitive value)What just happened?
- The secret was fetched from Secrets Manager and parsed in one step. The two-stage lookup — first the secret metadata, then the version — is how Secrets Manager is designed. The metadata data source returns the stable ARN. The version data source uses that ARN to get the current value. If the secret is rotated in Secrets Manager, the next
terraform planautomatically picks up the new value. - jsondecode() turned the JSON string into an addressable object. The raw
secret_stringis{"username":"dbadmin","password":"..."}— a JSON string. Afterjsondecode(), it becomes a Terraform object wherelocal.db_credentials.usernameandlocal.db_credentials.passwordare valid expressions. This pattern works for any structured secret regardless of how many fields it contains. - local.db_password printed as (sensitive value) in the console. The
secret_stringattribute on a Secrets Manager data source is automatically marked sensitive by the AWS provider. Any local that references it inherits that sensitivity. Even interraform consolethe password is protected — it will never appear in any log file or terminal recording.
Putting It All Together — main.tf
Every data source from this lesson used in a real main.tf. Zero hardcoded IDs. The AMI is looked up dynamically. The VPC and subnets come from the existing networking layer. The database password comes from Secrets Manager. This is what production Terraform looks like.
# Security group — lives in the existing VPC from the data source
resource "aws_security_group" "app" {
name = "app-sg-${var.environment}"
description = "Application security group"
vpc_id = data.aws_vpc.existing.id # VPC ID from data source — not hardcoded
ingress {
description = "HTTP"
from_port = 80
to_port = 80
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
egress {
description = "All outbound"
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
}
# EC2 instance — AMI and subnet from data sources, never hardcoded
resource "aws_instance" "app" {
# Dynamic AMI — always the latest Amazon Linux 2 for the current region
ami = data.aws_ami.amazon_linux_2.id
instance_type = "t2.micro"
# First subnet from the list returned by aws_subnets data source
subnet_id = data.aws_subnets.public.ids[0]
vpc_security_group_ids = [aws_security_group.app.id]
tags = {
Name = "app-${var.environment}"
AmiName = data.aws_ami.amazon_linux_2.name # Track which AMI version is running
AccountId = data.aws_caller_identity.current.account_id # Tag with AWS account ID
}
lifecycle {
create_before_destroy = true
# The AMI ID will change every time AWS releases a new version
# ignore_changes prevents Terraform from replacing the instance on every plan
# Instances are patched in-place — not replaced on every AMI update
ignore_changes = [ami]
}
}
# RDS database — password comes from Secrets Manager via data source
resource "aws_db_instance" "main" {
identifier = "appdb-${var.environment}"
engine = "postgres"
engine_version = "15.3"
instance_class = "db.t3.micro"
allocated_storage = 20
db_name = "appdb"
username = local.db_username # Parsed from Secrets Manager JSON
password = local.db_password # Sensitive — never printed in any output
# Deploy in the first public subnet — in production use private subnets
db_subnet_group_name = aws_db_subnet_group.main.name
vpc_security_group_ids = [aws_security_group.app.id]
skip_final_snapshot = true # For dev — always false in production
tags = {
Name = "appdb-${var.environment}"
}
}
# Subnet group for RDS — requires at least two subnets in different AZs
resource "aws_db_subnet_group" "main" {
name = "appdb-subnet-group-${var.environment}"
description = "Subnet group for RDS in ${var.environment}"
# Pass all public subnet IDs from the data source
subnet_ids = data.aws_subnets.public.ids
tags = {
Name = "appdb-subnet-group-${var.environment}"
}
}
$ terraform plan
data.aws_caller_identity.current: Reading... [complete after 0s]
data.aws_region.current: Reading... [complete after 0s]
data.aws_availability_zones.available: Reading... [complete after 1s]
data.aws_ami.amazon_linux_2: Reading... [complete after 1s]
data.aws_ami.ubuntu_22: Reading... [complete after 1s]
data.aws_vpc.existing: Reading... [complete after 1s]
data.aws_subnets.public: Reading... [complete after 1s]
data.aws_secretsmanager_secret.db_credentials: Reading... [complete after 0s]
data.aws_secretsmanager_secret_version.db_credentials: Reading... [complete after 0s]
Terraform will perform the following actions:
# aws_security_group.app will be created
+ resource "aws_security_group" "app" {
+ name = "app-sg-dev"
+ vpc_id = "vpc-0abc123def" # From data source — not hardcoded
}
# aws_instance.app will be created
+ resource "aws_instance" "app" {
+ ami = "ami-0c55b159cbfafe1f0" # Resolved from data source
+ instance_type = "t2.micro"
+ subnet_id = "subnet-0aaa111bbb" # From data source
+ tags = {
+ "AccountId" = "123456789012"
+ "AmiName" = "amzn2-ami-hvm-2.0.20231116.0-x86_64-gp2"
+ "Name" = "app-dev"
}
}
# aws_db_instance.main will be created
+ resource "aws_db_instance" "main" {
+ identifier = "appdb-dev"
+ username = "dbadmin"
+ password = (sensitive value) # Redacted — came from Secrets Manager
+ engine = "postgres"
}
# aws_db_subnet_group.main will be created
+ resource "aws_db_subnet_group" "main" {
+ subnet_ids = [
"subnet-0aaa111bbb",
"subnet-0ccc333ddd",
]
}
Plan: 4 to add, 0 to change, 0 to destroy.What just happened?
- Zero hardcoded IDs anywhere in main.tf. The VPC ID, subnet IDs, AMI ID, account ID, and database password all came from data sources. Change the region variable and the AMI ID resolves to the correct Amazon Linux 2 for the new region automatically. Change the environment variable and the Secrets Manager path resolves to the correct credentials for that environment.
- The AmiName tag tracks which AMI version is running.
data.aws_ami.amazon_linux_2.nameresolves to the full version string —amzn2-ami-hvm-2.0.20231116.0-x86_64-gp2. This tag on the instance means you can query your AWS account later and immediately know which AMI version every instance is running — useful for security audits and patch compliance reporting. - The database password showed as (sensitive value) throughout. From Secrets Manager through jsondecode() through the local value through the aws_db_instance resource — sensitivity propagated the entire chain. The password was never printed at any point, including in the plan output where it shows as
(sensitive value). - aws_db_subnet_group received all public subnet IDs as a list.
data.aws_subnets.public.idsis a list of all matching subnet IDs. The subnet group accepted the entire list — bothsubnet-0aaa111bbbandsubnet-0ccc333ddd. RDS requires subnets in at least two availability zones and the data source returned both.
Common Mistakes
Using aws_subnet (singular) when multiple subnets match the filter
aws_subnet expects exactly one result. If your filter matches two subnets, Terraform throws an error: "multiple results matched the given filter". Use aws_subnets (plural) when you want a list of all matching subnets, and aws_subnet only when your filter is specific enough to return exactly one result.
Not setting most_recent = true on AMI data sources
Without most_recent = true, if multiple AMIs match your filters Terraform throws an error rather than picking one. And even if only one matches today, AWS releases new AMIs regularly — your filter will eventually match multiple results and break your plan. Always set most_recent = true when looking up AMIs by name pattern.
Not adding ignore_changes = [ami] when using dynamic AMI lookup
Every time AWS releases a new Amazon Linux 2 AMI, the data source resolves to a new ID. On the next terraform plan, Terraform sees that the running instance uses the old AMI and plans a -/+ replace. If you patch instances in-place rather than rebuilding them, this causes unintended replacements. Add ignore_changes = [ami] to the lifecycle block unless your process intentionally rebuilds instances on every AMI update.
Keep data sources in data.tf
Separating data sources into their own data.tf file is not a Terraform requirement but it is a convention worth adopting. When you open a project, a data.tf file tells you immediately what the configuration reads from the outside world — what it depends on but does not own. This distinction between "what we read" and "what we own" is one of the most important things to make explicit in any non-trivial Terraform project.
Practice Questions
1. You declare a data source: data "aws_ami" "amazon_linux_2" {...}. What is the correct expression to reference its id attribute in a resource block?
2. When multiple AMIs match your aws_ami filter, which argument tells Terraform to return the newest one instead of throwing an error?
3. A Secrets Manager secret contains JSON: {"username":"admin","password":"secret"}. Which Terraform function converts that string into an object so you can access .username and .password?
Quiz
1. When does Terraform execute data source queries?
2. What is the difference between aws_subnet and aws_subnets?
3. You use aws_ami to dynamically look up the latest AMI. On the next plan after AWS releases a new AMI, Terraform shows -/+ replace on all your instances. How do you prevent this?
Up Next · Lesson 14
Terraform State Explained
You have used state in every lesson. Lesson 14 opens it up — what is actually inside the state file, how Terraform uses it to detect drift, and what happens when state and reality diverge.