NoSQL
DynamoDB Introduction
Prime Day 2023. Amazon processes over 375 million items in 48 hours. At peak, that's millions of cart reads and writes per second — across every country, every device, every payment method simultaneously. The database underneath all of it is DynamoDB. Amazon built it internally in 2007, made it public in 2012, and it has been the backbone of the world's largest e-commerce platform ever since. This lesson teaches you exactly how it works — and how to use it.
What Makes DynamoDB Different
DynamoDB is a fully managed, serverless key-value and document database. "Fully managed" means AWS handles everything — servers, patching, replication, scaling. You never SSH into a DynamoDB node. You never set up replication. You just use it.
Single-digit ms
Guaranteed latency at any scale. 10 users or 10 million — same response time.
Unlimited scale
No maximum table size. AWS automatically partitions your data as it grows.
Zero administration
No servers to manage. No index rebuilds. No vacuum. No connection pools.
The Core Concept — Tables, Items, and Attributes
DynamoDB uses its own terminology. Before writing a single line of code, these three concepts must be clear:
| DynamoDB Term | SQL Equivalent | What It Actually Is |
|---|---|---|
| Table | Table | A collection of items. No fixed schema — items can have different attributes. |
| Item | Row | A collection of attributes. Like a JSON object. Max 400KB per item. |
| Attribute | Column | A name-value pair. Can be string, number, binary, boolean, list, map, or set. |
| Partition Key | Primary Key | The mandatory unique identifier. DynamoDB hashes this to decide which server stores the item. |
| Sort Key | Composite PK (second part) | Optional. When combined with partition key, enables range queries within a partition. |
Partition Keys and Sort Keys — The Design That Changes Everything
DynamoDB's performance entirely depends on your key design. This is the most important concept in the entire lesson. Get it right and queries are instant. Get it wrong and you'll scan entire tables.
The two key patterns:
Simple Primary Key (Partition Key only)
PK: user_id (String)
{ user_id: "u_441", name: "Priya" }
{ user_id: "u_882", name: "Carlos" }
Fetch: GetItem(user_id="u_441")
Result: instant — direct hash lookup
Use when you always fetch by one unique ID. One item per partition key value.
Composite Primary Key (Partition Key + Sort Key)
PK: user_id (String)
SK: order_date (String)
{ user_id:"u_441", order_date:"2024-01-15", total:£49 }
{ user_id:"u_441", order_date:"2024-02-03", total:£120 }
Fetch: Query(user_id="u_441", order_date BETWEEN...)
Result: all orders for u_441 in date range
Use when you need to query a set of related items by range. Multiple items share the same partition key.
The golden rule of DynamoDB key design:
Design your keys around your most frequent query pattern. Unlike SQL where you can query any column, DynamoDB is fast only when you query by partition key. Everything else requires a scan or a secondary index.
Creating a Table and Writing Items
The scenario: You're building the order management system for an e-commerce platform. Orders need to be retrieved by customer — all orders for a specific user, filtered by date range. The composite key pattern is the right design:
import boto3
from datetime import datetime
# Connect to DynamoDB
dynamodb = boto3.resource('dynamodb', region_name='eu-west-1')
# Create the orders table
table = dynamodb.create_table(
TableName='orders',
KeySchema=[
{'AttributeName': 'user_id', 'KeyType': 'HASH'}, # partition key
{'AttributeName': 'order_date', 'KeyType': 'RANGE'} # sort key
],
AttributeDefinitions=[
{'AttributeName': 'user_id', 'AttributeType': 'S'}, # S = String
{'AttributeName': 'order_date', 'AttributeType': 'S'}
],
BillingMode='PAY_PER_REQUEST' # on-demand pricing — no capacity planning
)
KeyType: 'HASH' vs 'RANGE'
AWS uses HASH for partition key and RANGE for sort key — legacy naming from DynamoDB's origins. Just remember: HASH = partition key, RANGE = sort key. Only these two attributes need to be in AttributeDefinitions — all other item attributes are schema-free and don't need to be declared here.
BillingMode: 'PAY_PER_REQUEST'
On-demand mode — you pay per read/write operation, not for reserved capacity. No capacity planning, no throttling, scales instantly. For most applications this is simpler and cost-effective. The alternative is PROVISIONED — you specify read/write capacity units upfront, cheaper at very high sustained throughput.
table = dynamodb.Table('orders')
# Write an order — note: all attributes beyond keys are schema-free
table.put_item(Item={
'user_id': 'u_441', # partition key — required
'order_date': '2024-01-15', # sort key — required
'order_id': 'ord_8821', # additional attributes — any shape
'status': 'delivered',
'total': Decimal('149.99'), # use Decimal for numbers in DynamoDB
'items': [ # lists and nested objects supported
{'sku': 'TEE-M-BLUE', 'qty': 2, 'price': Decimal('29.99')},
{'sku': 'HAT-L-RED', 'qty': 1, 'price': Decimal('19.99')}
]
})
Decimal('149.99')
DynamoDB's Python SDK requires Decimal for floating-point numbers — Python's float type can introduce precision errors. Always import from decimal import Decimal and use it for any monetary or decimal value in DynamoDB.
items: [{"{"}"sku": ..., "qty": ..., "price": ...{"}"}]
Lists and nested maps are native DynamoDB types. The entire order — including line items — is one item in the table. No separate order_items table. No JOIN needed to get the full order. One get_item call returns everything.
Reading Data — GetItem vs Query vs Scan
DynamoDB has three ways to read data. Knowing which one to use — and which one to avoid — is critical for both performance and cost.
GetItem — Fetch one exact item
O(1) — Always use this if you canProvide the full primary key (partition key + sort key if composite). Returns exactly one item. Uses the hash function to go directly to the storage partition — no scanning. Sub-millisecond at any table size.
Query — Fetch items in one partition
O(log N) — Use for range accessProvide the partition key + a condition on the sort key. Returns multiple items from one partition sorted by the sort key. Efficient — reads only the relevant partition. Use for "all orders by user X between date A and B."
Scan — Read every item in the table
O(N) — Avoid in productionReads every single item in the table and applies a filter afterwards. Expensive: you're billed for every item read even if it's filtered out. Slow on large tables. Only acceptable for one-time data exports or tables under a few thousand items.
The scenario: Retrieve a specific order by user and date (GetItem), then get all orders for a user in January (Query):
# GetItem — fetch one exact order
response = table.get_item(
Key={
'user_id': 'u_441', # partition key
'order_date': '2024-01-15' # sort key
}
)
order = response.get('Item')
print(f"Order total: £{order['total']}")
Order total: £149.99 -- Consumed capacity: 0.5 RCU (read capacity units) -- Latency: 2.1ms -- Items scanned: 1 (direct hash lookup — no scanning)
response.get('Item') — returns None if the item doesn't exist, rather than raising an exception. Always use .get('Item') rather than response['Item'] to avoid KeyError on missing items.
from boto3.dynamodb.conditions import Key
# Query — all orders for user u_441 in January 2024
response = table.query(
KeyConditionExpression=
Key('user_id').eq('u_441') & # partition key — exact match
Key('order_date').between('2024-01-01', '2024-01-31') # sort key — range condition
)
orders = response['Items']
print(f"Orders in January: {len(orders)}")
Orders in January: 4
-- Response Items:
[
{ user_id: "u_441", order_date: "2024-01-03", total: Decimal("89.99") },
{ user_id: "u_441", order_date: "2024-01-15", total: Decimal("149.99") },
{ user_id: "u_441", order_date: "2024-01-22", total: Decimal("34.50") },
{ user_id: "u_441", order_date: "2024-01-29", total: Decimal("220.00") }
]
-- Consumed capacity: 2 RCU (read only u_441's partition)
-- Latency: 3.8ms
-- Items scanned: 4 (only items matching the partition key)
Key('order_date').between('2024-01-01', '2024-01-31')
Sort key conditions: eq, lt, lte, gt, gte, between, begins_with. The between is inclusive on both ends. Sort keys work because items within a partition are physically stored in sort key order — range queries are sequential reads.
Key('user_id').eq('u_441') & Key('order_date').between(...)
The & operator combines key conditions. The partition key condition must always be an exact match (eq). Only the sort key condition can be a range. You cannot range-query on the partition key alone — that would require scanning all partitions.
Updating Items — Granular Field Updates
The scenario: An order's status changes to "shipped". You need to update just the status field and add a tracking number — without overwriting the entire item:
# Update specific attributes — leave everything else untouched
table.update_item(
Key={
'user_id': 'u_441',
'order_date': '2024-01-15'
},
UpdateExpression='SET #s = :status, tracking = :tracking, updated_at = :ts',
ExpressionAttributeNames={
'#s': 'status' # 'status' is a reserved word in DynamoDB
},
ExpressionAttributeValues={
':status': 'shipped',
':tracking': 'DHL-9921-EU',
':ts': datetime.now().isoformat()
}
)
UpdateExpression='SET #s = :status, tracking = :tracking'
The SET action adds or replaces specific attributes. Only the named attributes change — everything else in the item (items list, total, order_id) stays exactly as it was. No full-item replacement. DynamoDB also supports REMOVE (delete an attribute), ADD (increment a number or add to a set), and DELETE (remove set elements).
ExpressionAttributeNames: {"{"}'#s': 'status'{"}"}
DynamoDB has reserved words — status, name, type, count and many others. If your attribute name is a reserved word, you must use an expression attribute name (prefixed with #) as an alias. This is a common gotcha that causes confusing validation errors.
Global Secondary Indexes — Querying Non-Key Attributes
DynamoDB only lets you query efficiently by the table's primary key. But what if you need to find all orders with status "shipped"? Status is not the partition key — so a table query would require a full scan. The solution is a Global Secondary Index (GSI) — a separate copy of the data organised by a different key.
Think of a GSI as a second table with a different primary key, automatically kept in sync by DynamoDB:
PK: user_id
SK: order_date
Fast: "orders by user"
Slow: "orders by status"
PK: status
SK: order_date
Fast: "orders by status"
Fast: "shipped in Jan"
# Query the GSI — find all shipped orders in January
response = table.query(
IndexName='status-date-index', # name of the GSI
KeyConditionExpression=
Key('status').eq('shipped') & # GSI partition key
Key('order_date').between('2024-01-01', '2024-01-31')
)
print(f"Shipped orders in January: {len(response['Items'])}")
Shipped orders in January: 847 -- DynamoDB queried the GSI partition for "shipped" -- Then range-filtered by order_date within that partition -- No full table scan — efficient even at millions of orders -- GSI is automatically kept in sync with the main table
IndexName='status-date-index' — tells DynamoDB to use the GSI instead of the main table. The query syntax is identical to a regular table query — you just point it at a different index. DynamoDB bills GSI reads separately from main table reads.
Important GSI limitation: GSIs are eventually consistent by default. There's a brief lag (milliseconds to seconds) between writing to the main table and the GSI reflecting the change. For strongly consistent reads, you must read from the main table using the original key.
DynamoDB vs Redis — When to Use Which
| Criteria | Redis | DynamoDB |
|---|---|---|
| Data persistence | Optional — primarily in-memory | Always — durable by design |
| Data size limit | Limited by RAM (GBs) | Unlimited (petabytes) |
| Latency | Sub-ms (RAM access) | Single-digit ms (SSD) |
| Data structures | Rich: sorted sets, lists, pub/sub | Key-value + document only |
| Management | Self-managed or Redis Cloud | Fully managed by AWS |
| Best for | Cache, sessions, real-time counters | Durable application data at scale |
Teacher's Note
The biggest DynamoDB mistake I see is treating it like a SQL database with a weird syntax. DynamoDB rewards a completely different mental model: you are pre-computing your query results at write time by choosing the right keys. If you find yourself writing a Scan or adding a GSI for every new query, that's a signal your key design doesn't match your access patterns. The best time to think about DynamoDB key design is before you write the first item — not after you have 100 million of them.
Practice Questions — You're the Engineer
Scenario:
Scenario:
user_id as the only key. Their code reads every item in the table and filters by created_at in the application. They notice it gets slower as the user base grows and AWS bills are increasing unexpectedly. Which DynamoDB operation are they using, and why is it problematic?
Scenario:
user_id as partition key and order_date as sort key. Your operations team now needs to query all orders with status "pending" across all users to process them in bulk. You cannot change the primary key of an existing table. What DynamoDB feature lets you query by status efficiently without a Scan?
Quiz — DynamoDB Design Decisions
Scenario:
message_id, conversation_id, sender_id, body, and timestamp. What is the correct key design?
Scenario:
'price': 29.99 using a Python float in a DynamoDB put_item call. After fetching the item back, the price shows as 29.989999999999. Your finance team is reporting rounding errors in invoices. What is the fix?
Scenario:
Up Next · Lesson 14
Document Databases
A deep dive into how document stores work under the hood — BSON storage, indexing strategies, embedding vs referencing, and the aggregation pipeline that makes complex queries possible without joins.