DataBase Design Lesson 24 – Surrogate vs Natural Keys | Dataplexa
Normalization · Lesson 24

Surrogate vs Natural Keys

Master the critical decision between artificial database keys and real-world identifiers that affects every table you design.

The Two Paths

Every table needs a primary key. That's database law. But which kind? Natural keys use real-world data that already exists - like email addresses or ISBN numbers. Surrogate keys are artificial numbers the database creates just for identification.

Think about your own ID documents. Your Social Security number? That's a surrogate key - created purely for identification. Your fingerprint? Natural key - it existed before any database did.

Surrogate Keys

Artificial numbers created by the database. Auto-increment integers or UUIDs. Never change, never have meaning outside the database.

Natural Keys

Real-world identifiers with business meaning. Email addresses, ISBN numbers, license plates. Users understand them instantly.

BookVault Key Decision

BookVault faces this choice on every table. Look at the customers table. Should email be the primary key? Or create an artificial customer_id?

Natural Key Approach
-- Using email as primary key
CREATE TABLE customers (
    email VARCHAR(255) PRIMARY KEY,
    first_name VARCHAR(100),
    last_name VARCHAR(100),
    city VARCHAR(100),
    created_at TIMESTAMP
);
Surrogate Key Approach
-- Using auto-increment surrogate key
CREATE TABLE customers (
    customer_id INT AUTO_INCREMENT PRIMARY KEY,
    email VARCHAR(255) UNIQUE,
    first_name VARCHAR(100),
    last_name VARCHAR(100),
    city VARCHAR(100),
    created_at TIMESTAMP
);

Notice the difference? Natural key makes email the primary identifier. Surrogate key creates an artificial number and makes email a unique constraint instead.

When Natural Keys Break

Natural keys seem logical. Email addresses are unique, right? Until they aren't. Sarah changes her email from sarah.johnson@gmail.com to sarah.smith@newcompany.com after marriage and job change.

With a natural key, that's a primary key update. Every foreign key reference breaks. The orders table, order_items table, customer reviews - they all reference the old email. You need cascading updates across the entire database.

The Cascade Problem

When sarah.johnson@gmail.com becomes sarah.smith@newcompany.com, you must update every foreign key reference in orders, reviews, addresses, and payment methods. One customer change triggers database-wide updates.

With surrogate keys? Sarah remains customer_id = 42 forever. Email updates are simple column changes with no cascading effects.

Real-World Evidence

Data Insight

Amazon processes 1.6 million package deliveries daily. Every package has a surrogate tracking number, not the customer's address as the primary key. Address changes don't break the shipping system.

Stripe handles millions of payment transactions. Each customer gets a surrogate ID like cus_9s6XKzkNRiz8i3. Email addresses change frequently, but payment history stays linked to the unchanging surrogate key.

Netflix assigns surrogate IDs to every movie and show. The title "The Office" exists in multiple countries with different content. Natural keys would create conflicts - surrogate keys keep everything separate.

The Performance Factor

Integer surrogate keys are faster. Much faster. A 4-byte integer compares quicker than a 50-character email string. Foreign key joins run faster. Index sizes shrink dramatically.

-- BookVault orders referencing customers
-- Surrogate key join (fast)
SELECT c.first_name, o.total_amount
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE c.customer_id = 42;

-- Natural key join (slower)  
SELECT c.first_name, o.total_amount
FROM customers c
JOIN orders o ON c.email = o.customer_email
WHERE c.email = 'sarah.johnson@gmail.com';
Sarah Johnson | $127.50
Sarah Johnson | $89.99
Sarah Johnson | $234.75

What just happened?

The integer comparison customer_id = 42 processes faster than string comparison email = 'sarah.johnson@gmail.com'. Try this: Run EXPLAIN on both queries to see the performance difference in your database.

When Natural Keys Win

Natural keys aren't always wrong. Sometimes they're perfect. ISBN numbers for books never change - that's their entire purpose. Country codes (US, UK, FR) stay stable for decades.

-- BookVault books with ISBN as natural key
CREATE TABLE books (
    isbn VARCHAR(13) PRIMARY KEY,
    title VARCHAR(255),
    author_id INT,
    category_id INT,
    price DECIMAL(10,2),
    stock_qty INT,
    published_year INT,
    FOREIGN KEY (author_id) REFERENCES authors(author_id),
    FOREIGN KEY (category_id) REFERENCES categories(category_id)
);
Table created successfully.

ISBN works as a natural key because it's designed to be permanent and globally unique. Publishers can't change ISBN numbers after publication - books would lose their identity in the supply chain.

Natural keys work when the data is designed for identification: ISBN, Social Security numbers, product SKUs, postal codes. If it was created to be a unique identifier, it probably works as a natural key.

The Hybrid Approach

Most successful databases use both. Surrogate keys for primary keys, natural keys for business logic. BookVault follows this pattern everywhere.

Primary Keys

customer_id, book_id, order_id, author_id - all surrogate integers for performance and stability

Business Keys

email, isbn, sku, order_number - unique constraints for business rules and user interface

-- BookVault hybrid approach
CREATE TABLE customers (
    customer_id INT AUTO_INCREMENT PRIMARY KEY,  -- Surrogate
    email VARCHAR(255) UNIQUE NOT NULL,          -- Natural (business)
    first_name VARCHAR(100),
    last_name VARCHAR(100)
);

CREATE TABLE orders (
    order_id INT AUTO_INCREMENT PRIMARY KEY,     -- Surrogate
    order_number VARCHAR(20) UNIQUE NOT NULL,    -- Natural (business)
    customer_id INT,                             -- References surrogate
    order_date DATE,
    FOREIGN KEY (customer_id) REFERENCES customers(customer_id)
);
Tables created successfully.

What just happened?

We created stable surrogate primary keys for database relationships, plus meaningful natural keys for business operations. Users see order_number "ORD-2024-001", but foreign keys reference the integer order_id. Try this: Query by order_number but join by order_id for the best of both approaches.

Making the Right Choice

The decision comes down to stability and control. Do you control the data completely? Can it change? Will users need to modify it?

1

Can the data change?

2

Is it designed for identification?

3

How often will you join on it?

4

Choose surrogate for changing data, natural for permanent identifiers

Honestly, most production databases lean heavily toward surrogate keys. They're safer, faster, and easier to maintain. The hybrid approach gives you the stability of surrogates with the meaning of natural keys.

Common Mistake: All Natural Keys

New developers often use email as primary key everywhere because "it's more meaningful." But when users change emails, or you need to support OAuth login with multiple email addresses, the database design crumbles. Start with surrogate keys and add natural unique constraints.

The key insight? Primary keys are for the database. Business identifiers are for humans. Design for both, but never confuse their purposes. Your database will thank you when it's processing millions of records and users are constantly updating their information.

Quiz

1. BookVault needs to store customer data. Customers can change their email addresses, and you expect millions of orders referencing customers. What's the best approach?


2. BookVault stores book information including ISBN numbers. ISBN numbers never change after publication and are globally unique. Should ISBN be the primary key?


3. You used email as the primary key in BookVault's customers table. Customer Sarah Johnson gets married, changes jobs, and updates her email from sarah.johnson@gmail.com to sarah.smith@newcompany.com. What happens?


Up Next

Handling NULLs in Design

Master the trickiest aspect of database design - when to allow missing data and when to enforce values that prevent data integrity disasters.