NO SQL Lesson 39 – Cloud-Native NoSQL | Dataplexa
Enterprise & Cloud · Lesson 39

Cloud-Native NoSQL

Cloud-native is not just a hosting decision. It is a design philosophy — building systems that assume the cloud's elasticity, pay-per-use pricing, and managed services as first principles rather than bolting them on afterward. For NoSQL, cloud-native means databases that scale to zero when idle, burst to millions of operations per second under load, replicate globally with a config change, and never require a 3am pager alert because a node ran out of disk space. This lesson is about what that looks like in practice.

What Cloud-Native NoSQL Actually Means

The term gets overused. Here is a practical definition: a database is cloud-native when it is built around cloud primitives rather than adapted from on-premise ones. The characteristics that matter in production:

Scale to Zero
No traffic, no cost. A dev environment database costs nothing at midnight. A production database scales down between traffic valleys and up during spikes — automatically, with no manual intervention.
🌍
Global Distribution
Data replicated to multiple regions with a configuration change. Users in Tokyo get low-latency reads from the Tokyo replica. A region outage fails over automatically — no runbook required.
🔌
API-First Management
Every configuration change — capacity, backups, replication, access control — is an API call or Infrastructure-as-Code declaration. No SSH. No config files. No manual steps that cannot be reproduced.
💳
Consumption Pricing
Pay for what you use — reads, writes, storage, data transfer. No upfront capacity reservation. Early-stage products pay pennies. Hyperscale products pay proportionally to value delivered.
🔁
Built-in Resilience
Multi-AZ replication, automatic failover, and point-in-time recovery are defaults, not add-ons. The service SLA covers availability — not the engineer on call at 3am.
🏗️
Infrastructure as Code
Tables, indexes, capacity, and access policies declared in Terraform, CDK, or CloudFormation. Every environment — dev, staging, production — is reproducible from the same definition.

Serverless Patterns — DynamoDB with AWS Lambda

The most cloud-native NoSQL architecture pairs DynamoDB with Lambda. Both scale to zero, both scale to any load, both charge per operation. Together they form a completely serverless data path — no servers to manage, no capacity to plan, no idle infrastructure paying rent at 4am on a Sunday when no one is using your application.

The scenario: You are building a product recommendation API for an e-commerce platform. The traffic pattern is deeply spiky — near zero overnight, thousands of requests per second during business hours, and unpredictable surges during flash sales. You are building it serverless end-to-end: API Gateway → Lambda → DynamoDB. You need the Lambda function to read user preferences and write recommendation click events atomically, handling the cold start penalty gracefully.

Python — Lambda function with DynamoDB connection reuse
import boto3
import json
import os

# Initialise outside the handler — reused across warm invocations
# Cold start: this runs once and the connection is kept alive in the container
dynamodb = boto3.resource("dynamodb", region_name=os.environ["AWS_REGION"])
recommendations_table = dynamodb.Table(os.environ["RECOMMENDATIONS_TABLE"])
events_table          = dynamodb.Table(os.environ["EVENTS_TABLE"])

def handler(event, context):
    user_id    = event["pathParameters"]["userId"]
    product_id = event.get("body", {}).get("productId")

    # Read user preferences — consistent read for freshness
    prefs = recommendations_table.get_item(
        Key={"userId": user_id},
        ConsistentRead=True
    ).get("Item", {})

    # Write click event — conditional write to prevent duplicates
    # ConditionExpression: only write if this event_id doesn't exist yet
    if product_id:
        events_table.put_item(
            Item={
                "userId":    user_id,
                "SK":        f"{context.aws_request_id}",  # unique per invocation
                "eventType": "recommendation_click",
                "productId": product_id
            },
            ConditionExpression="attribute_not_exists(SK)"
        )

    return {
        "statusCode": 200,
        "body": json.dumps({
            "userId":          user_id,
            "preferences":     prefs.get("categories", []),
            "recommendedNext": prefs.get("nextProducts", [])
        })
    }
// Cold start (first invocation after idle period)
Init duration:    142ms  (boto3 + DynamoDB table resource initialised)
Handler duration: 8.1ms
Total:            150ms

// Warm invocations (container reused)
Handler duration: 3.2ms  (no re-initialisation — connection reused)
Handler duration: 2.9ms
Handler duration: 3.1ms

// Concurrent scale-out (flash sale spike — 5,000 req/sec)
Lambda concurrency: 500 containers
DynamoDB: auto-scaled to handle 5,000 RCU/sec + 5,000 WCU/sec
Zero throttling  ✓  |  Zero server management  ✓
boto3.resource() outside the handler

Lambda reuses the execution environment (container) between invocations as long as traffic continues. Code outside the handler function runs only during a cold start — the container initialisation. By placing the DynamoDB resource outside the handler, you pay the 142ms initialisation cost once and then get sub-5ms DynamoDB access on every subsequent warm invocation. Moving it inside the handler would reinitialise boto3 on every single call — a 10–50ms tax on 100% of your requests.

ConditionExpression="attribute_not_exists(SK)"

Lambda has at-least-once delivery semantics — if a function times out or a network error causes the caller to retry, the same event can be delivered twice. Without the condition, you write duplicate click events. With attribute_not_exists(SK), DynamoDB rejects the write if the item already exists — the second invocation is a no-op. This makes the function idempotent: safe to call multiple times with the same input and always produce the same result.

context.aws_request_id as the SK

AWS Lambda injects a unique aws_request_id into every invocation's context object. Using it as the sort key guarantees uniqueness — no two invocations share the same request ID. On a retry, the same aws_request_id is reused by the Lambda runtime, so the attribute_not_exists condition catches the duplicate and prevents the double write.

DynamoDB Streams — Event-Driven Architecture

Every write to a DynamoDB table can trigger a downstream reaction — a Lambda function, a search index update, a cache invalidation, an audit log entry. DynamoDB Streams captures every insert, update, and delete as an ordered log of change records, and Lambda consumes that stream in real time. This is the cloud-native equivalent of the Outbox pattern from Lesson 33 — except DynamoDB handles the reliability of the stream for you.

The scenario: Your platform stores product inventory in DynamoDB. When stock reaches zero, the customer-facing product page must immediately show "Out of Stock" — served from Elasticsearch for full-text search. Instead of updating Elasticsearch synchronously in the write path (which couples two systems and doubles your failure surface), you are using DynamoDB Streams to propagate inventory changes to Elasticsearch asynchronously and reliably.

Python — Lambda stream processor for DynamoDB → Elasticsearch sync
from elasticsearch import Elasticsearch
import boto3

# Initialised outside handler — reused across warm invocations
es = Elasticsearch([os.environ["ES_ENDPOINT"]])

def handler(event, context):
    """Triggered by DynamoDB Streams — processes a batch of change records."""
    for record in event["Records"]:

        event_name = record["eventName"]   # INSERT | MODIFY | REMOVE
        new_image  = record["dynamodb"].get("NewImage", {})
        old_image  = record["dynamodb"].get("OldImage", {})

        product_id = new_image.get("productId", {}).get("S") or \
                     old_image.get("productId", {}).get("S")

        if event_name in ("INSERT", "MODIFY"):
            # Upsert into Elasticsearch — update if exists, create if not
            es.index(
                index="products",
                id=product_id,
                body={
                    "productId": product_id,
                    "name":      new_image.get("name", {}).get("S"),
                    "stock":     int(new_image.get("stock", {}).get("N", 0)),
                    "inStock":   int(new_image.get("stock", {}).get("N", 0)) > 0
                }
            )

        elif event_name == "REMOVE":
            # Product deleted from DynamoDB — remove from search index too
            es.delete(index="products", id=product_id, ignore=[404])
DynamoDB Streams event received — 3 records in batch:

Record 1: MODIFY  productId=prod_001
  OldImage: { stock: 1 }
  NewImage: { stock: 0 }
  → Elasticsearch upsert: { inStock: false }  ✓
  → Product page now shows "Out of Stock" within ~200ms

Record 2: INSERT  productId=prod_892
  NewImage: { name: "Trail Runners", stock: 50 }
  → Elasticsearch index created  ✓

Record 3: REMOVE  productId=prod_017
  → Elasticsearch document deleted  ✓

Stream lag: 180ms  (DynamoDB write → Elasticsearch update)
Zero coupling between inventory write path and search sync  ✓
eventName: INSERT | MODIFY | REMOVE

DynamoDB Streams delivers every change as one of three event types. NewImage contains the item state after the change; OldImage contains it before. For INSERT, only NewImage exists. For REMOVE, only OldImage exists. For MODIFY, both are present — letting you compute exactly what changed. You configure which images to include when you enable Streams: NEW_IMAGE, OLD_IMAGE, NEW_AND_OLD_IMAGES, or KEYS_ONLY.

ignore=[404] on delete

If the product was never indexed in Elasticsearch — because it was created and deleted before the stream processor caught up, or because indexing failed earlier — the delete would throw a 404. Passing ignore=[404] silences this error and treats the delete as a no-op. The end state is the same: the document does not exist in Elasticsearch. Always design stream processors to be idempotent and tolerant of partial state.

180ms stream lag — acceptable for this use case

DynamoDB Streams delivers records near-real-time but not synchronously — there is typically 100–300ms of propagation delay. For inventory search, 200ms is completely acceptable. For a banking ledger where both systems must be in sync before a transaction is confirmed, this async pattern is not appropriate — you would need a synchronous dual-write with a transaction, not a stream.

Multi-Region Active-Active — DynamoDB Global Tables

DynamoDB Global Tables is one of the most powerful cloud-native database features available anywhere. You add regions to a table with an API call and DynamoDB automatically replicates every write to every replica region — typically within one second. Every region is both readable and writable. A user in Singapore writes to the Singapore replica; that write propagates to Frankfurt and Virginia within a second. No custom replication code, no cross-region middleware, no read-only secondaries.

The scenario: Your SaaS platform has just signed customers in Europe and Singapore. Users in Frankfurt are experiencing 180ms read latency because all requests hit your us-east-1 DynamoDB table. You want read latency under 10ms globally and write availability that survives a full regional outage. You are enabling Global Tables to add eu-west-1 (Ireland) and ap-southeast-1 (Singapore) as active replicas.

AWS CLI — enabling DynamoDB Global Tables across 3 regions
# Step 1: add replica regions to the existing table
# DynamoDB replicates all existing data and all future writes automatically
aws dynamodb update-table \
  --table-name UserProfiles \
  --replica-updates '[
    {"Create": {"RegionName": "eu-west-1"}},
    {"Create": {"RegionName": "ap-southeast-1"}}
  ]'

# Step 2: in application code — connect to the nearest region
# Each SDK client points to the local region replica
import boto3, os

# Reads from the region-local replica — low latency
# Writes go to the local replica and propagate globally automatically
dynamodb = boto3.resource(
    "dynamodb",
    region_name=os.environ["AWS_REGION"]   # set per deployment region
)

table = dynamodb.Table("UserProfiles")

# Write in Singapore — propagates to Ireland + Virginia within ~1s
table.put_item(Item={
    "userId":   "user_sg_8821",
    "name":     "Wei Lin",
    "region":   "ap-southeast-1",
    "lastSeen": "2025-03-10T14:22:01Z"
})
// Replica creation initiated
eu-west-1     status: CREATING  →  ACTIVE  (12 minutes, existing data sync)
ap-southeast-1 status: CREATING →  ACTIVE  (14 minutes)

// Read latency after Global Tables enabled
Frankfurt user → eu-west-1 replica:    6.2ms   (was 180ms)  ✓
Singapore user → ap-southeast-1 replica: 4.8ms (was 210ms)  ✓
Virginia  user → us-east-1 replica:    3.1ms               ✓

// Write in Singapore → propagated to:
  ap-southeast-1: immediate
  eu-west-1:      +820ms  (cross-region replication)
  us-east-1:      +910ms

// us-east-1 regional outage simulation
  → eu-west-1 and ap-southeast-1 continue serving reads AND writes
  → zero application changes required
  → automatic failover  ✓
Every region is writable — conflict resolution

When two users in different regions update the same item within the same second, DynamoDB uses last-write-wins based on wall-clock time. The write with the later timestamp wins — the other write is silently discarded. For most user profile and session data this is acceptable. For financial data where every write must be preserved, last-write-wins is not appropriate — use a single-region write anchor with multi-region read replicas instead.

Replication lag ~900ms — what does this mean?

A write in Singapore takes about 900ms to appear in Virginia. If a user in Singapore updates their profile and then a process in Virginia reads it 500ms later, it will read the old value. This is eventual consistency across regions. Design your application to tolerate this: use region-local reads for user-facing operations, and avoid cross-region read-after-write patterns for data the user just modified.

Zero application code changes for failover

When us-east-1 goes down, the application in eu-west-1 and ap-southeast-1 continues using its local DynamoDB replica — it never knew about the Virginia outage. There is no DNS failover to configure, no connection string to change, no circuit breaker to trip. The regional isolation is built into the architecture. This is the defining advantage of active-active global distribution over traditional primary-secondary replication.

Infrastructure as Code — Provisioning with Terraform

Cloud-native means every piece of infrastructure is reproducible from code. A DynamoDB table created by clicking through the AWS console is a liability — no one knows exactly how it was configured six months later, it cannot be replicated to staging or a new AWS account without manual effort, and there is no change history. Declaring it in Terraform means every configuration decision is version-controlled, reviewable, and deployable to any environment in minutes.

The scenario: Your team has been managing the production DynamoDB table manually through the console. After an incident where someone accidentally changed the billing mode and the table started throttling, your engineering manager mandates that all infrastructure must be in Terraform before the next quarter. You are writing the Terraform definition for your UserEvents table, including the GSI, stream, and auto-scaling policy.

Terraform — DynamoDB table with GSI, streams, and auto-scaling
resource "aws_dynamodb_table" "user_events" {
  name         = "UserEvents-${var.environment}"
  billing_mode = "PROVISIONED"
  hash_key     = "userId"    # partition key
  range_key    = "SK"        # sort key

  read_capacity  = 100
  write_capacity = 50

  attribute {
    name = "userId"
    type = "S"
  }
  attribute {
    name = "SK"
    type = "S"
  }
  attribute {
    name = "eventType"    # needed for GSI — must be declared here
    type = "S"
  }

  # GSI for querying by eventType across all users
  global_secondary_index {
    name            = "EventTypeIndex"
    hash_key        = "eventType"
    range_key       = "SK"
    projection_type = "ALL"
    read_capacity   = 50
    write_capacity  = 25
  }

  # Enable DynamoDB Streams — captures NEW and OLD images
  stream_enabled   = true
  stream_view_type = "NEW_AND_OLD_IMAGES"

  # Point-in-time recovery — 35-day restore window
  point_in_time_recovery { enabled = true }

  tags = {
    Environment = var.environment
    ManagedBy   = "terraform"
  }
}
$ terraform plan
Plan: 1 to add, 0 to change, 0 to destroy.

  + aws_dynamodb_table.user_events
      name:               "UserEvents-production"
      billing_mode:       "PROVISIONED"
      hash_key:           "userId"
      range_key:          "SK"
      read_capacity:      100
      write_capacity:     50
      stream_enabled:     true
      stream_view_type:   "NEW_AND_OLD_IMAGES"
      point_in_time_recovery.enabled: true
      global_secondary_index:  EventTypeIndex (eventType → SK)

$ terraform apply
aws_dynamodb_table.user_events: Creating...
aws_dynamodb_table.user_events: Creation complete after 14s  ✓

// Same definition deploys identically to dev, staging, production
// Change history in git — who changed what, when, and why  ✓
name = "UserEvents-${var.environment}"

Parameterising the table name with the environment variable lets the same Terraform module deploy identical infrastructure to dev, staging, and production — with different names so they do not collide. terraform apply -var="environment=staging" creates UserEvents-staging. This is the foundation of environment parity: every environment is created from the same definition, eliminating "works in dev, fails in production" configuration drift.

attribute block for GSI fields

Terraform requires you to declare every attribute used in a key (primary or GSI) in an attribute block — even if those attributes are only used in the GSI. You do not declare all your item attributes here, only the key attributes. If you reference eventType in a GSI but forget the attribute block, Terraform will error with a validation failure before creating anything.

point_in_time_recovery: enabled = true

One line in Terraform activates a 35-day continuous backup window. Without it, your only recovery option after an accidental mass delete is the most recent on-demand backup — potentially hours old. With it, you can restore to any second in the past 35 days. In a code review, this line is visible, discussable, and mergeable. A forgotten checkbox in the AWS console is invisible until the incident.

Teacher's Note

Cloud-native NoSQL is not about using the newest services — it is about eliminating the operational work that does not differentiate your product. Nobody becomes a better engineer because they spent Saturday patching a Cassandra cluster. The engineers who move fastest are the ones who spend zero time on undifferentiated infrastructure and 100% of their time on the problems only their company can solve. Managed services, Infrastructure as Code, and event-driven patterns are how you get there. Use them deliberately, not as cargo cult — understand the trade-offs, design for eventual consistency where it is acceptable, and build the simplest architecture that meets your actual requirements.

Practice Questions — You're the Engineer

Scenario:

Your AWS Lambda function initialises a boto3.resource("dynamodb") client inside the handler function on every invocation. The function is called 10,000 times per minute. Your X-Ray traces show 40ms of initialisation overhead on every single invocation — totalling 400 seconds of wasted compute per minute. A senior engineer tells you the fix is a single-line code change that moves one statement to a different location in the file. Where should the boto3.resource() initialisation be placed so it runs only once per container lifecycle rather than once per invocation?


Scenario:

Your application uses DynamoDB Global Tables across us-east-1 and eu-west-1. A user in New York and a user in London both update the same shared document at 14:22:01.000 UTC — New York sets the status field to "approved" and London sets it to "rejected" in the same millisecond. Both writes succeed with a 200 response. Half a second later, your monitoring shows the document has status: "approved" in both regions. The London write was silently discarded. What conflict resolution strategy does DynamoDB Global Tables use when two replicas receive conflicting writes simultaneously?


Scenario:

You are building a Lambda function that processes DynamoDB Streams to maintain an audit log of all changes to a Contracts table. Each audit log entry must record both the value of every field before the change and the value after the change — so auditors can see exactly what was modified. You are configuring the stream_view_type in Terraform. Which stream view type gives your Lambda function access to both the before and after state of every modified item?


Quiz — Cloud-Native NoSQL in Production

Scenario:

Your Lambda function processes order placement events from an SQS queue and writes each order to DynamoDB with put_item(). During a network degradation event, Lambda times out while waiting for DynamoDB to confirm the write. SQS marks the message as unprocessed and redelivers it. Your Lambda runs again, successfully writes the order — but now the same order appears twice in DynamoDB. A colleague suggests adding ConditionExpression="attribute_not_exists(orderId)" to the put_item() call. Why is this condition necessary, and what problem does it solve?

Scenario:

Your team previously updated Elasticsearch synchronously inside the inventory write path: write to DynamoDB, then immediately update Elasticsearch, then return success to the caller. During an Elasticsearch cluster upgrade, the sync call started timing out — causing every inventory write to fail with a 500 error, even though DynamoDB was perfectly healthy. You replaced the synchronous call with a DynamoDB Streams trigger. How does the Streams architecture prevent an Elasticsearch outage from affecting inventory writes?

Scenario:

Your fintech company is evaluating DynamoDB Global Tables for a multi-region payment ledger. The requirements state that every payment write must be durably recorded — no write can ever be silently discarded. Two engineers run a test: they write a debit of $500 from a New York instance and a credit of $200 from a London instance to the same account item within 50ms of each other. Both writes receive 200 OK responses. When they query the item 2 seconds later, only the credit of $200 is reflected. The $500 debit has vanished with no error anywhere in the logs. Why did this happen, and why does it disqualify Global Tables for this use case?

Up Next · Lesson 40

Mini Project

Build a complete cloud-native data platform from scratch — applying everything from schema design to Global Tables, monitoring to Terraform, in a single end-to-end project.