MS SQL Lesson 7 – Databases & File Structure | Dataplexa

Databases & File Structure

When you created the DataplexaStore database in Lesson 5, SQL Server silently created two files on your disk — a data file and a log file. Every database you will ever work with in SQL Server has this same physical foundation. Understanding how these files are organised, how SQL Server reads and writes to them, how they grow, and how separating them across drives can transform performance is not just background knowledge — it is the practical skill that separates a developer who can get data in and out from one who understands why things go wrong and how to fix them. This lesson covers the physical file structure of SQL Server databases from the ground up, using DataplexaStore as the working example throughout.

Every Database is Two Files

A SQL Server database is always backed by at least two physical files. The data file (extension .mdf for the primary file, .ndf for secondary files) stores all the actual data — every table row, every index, every stored procedure definition. The transaction log file (extension .ldf) records every change made to the database before the change is applied, enabling crash recovery and point-in-time restore. These two files have completely different I/O patterns. The data file has mixed random I/O — reads jump to wherever the needed page lives, writes scatter updates across many locations. The transaction log is purely sequential — log records are always appended to the end of the log in strict order. This difference matters enormously for hardware: placing data and log files on the same physical disk means a random-I/O workload and a sequential-I/O workload compete for the same drive head, increasing latency for both. Separating them onto different drives — or using a high-speed NVMe SSD for the log — is one of the most impactful infrastructure decisions you can make for a busy production SQL Server.

-- View the physical files belonging to the DataplexaStore database
-- This shows location on disk, current size, and growth settings

USE DataplexaStore;

SELECT
    name                            AS logical_name,     -- the name SQL Server uses internally
    type_desc                       AS file_type,        -- ROWS (data) or LOG
    physical_name                   AS path_on_disk,     -- actual file path on the server
    size * 8 / 1024                AS current_size_mb,  -- current allocated size in MB
    FILEPROPERTY(name, 'SpaceUsed') * 8 / 1024
                                    AS used_mb,          -- space actually containing data
    CASE growth
        WHEN 0 THEN 'Fixed size — no auto-growth'
        ELSE CAST(growth AS VARCHAR) + ' (pages or %)'
    END                             AS auto_growth
FROM sys.database_files;
logical_name file_type path_on_disk current_size_mb used_mb auto_growth
──────────────────── ───────── ──────────────────────────────────────────────────────── ─────────────── ─────── ──────────────────────
DataplexaStore ROWS C:\Program Files\MSSQL\DATA\DataplexaStore.mdf 8 1 1024 (pages or %)
DataplexaStore_log LOG C:\Program Files\MSSQL\DATA\DataplexaStore_log.ldf 8 1 10 (pages or %)
  • .mdf is the primary data file — every database has exactly one, and it is where SQL Server stores the first page of every allocation and the boot page that describes the database
  • .ldf is the transaction log — its growth setting of 10 here means 10% growth per event, which is a poor default on large databases; production logs should be pre-sized and use a fixed MB growth increment instead
  • FILEPROPERTY with SpaceUsed returns how much of the allocated file is actually occupied by data — the difference between current_size_mb and used_mb is pre-allocated empty space ready for new rows without triggering a growth event

Data Files — Pages, Extents, and Filegroups

Inside every data file, SQL Server organises space in a strict hierarchy. The smallest unit is the page — 8 KB of data, metadata, and a row offset array. Every table row, every index entry, every allocation map lives on a page. Eight consecutive pages make an extent — 64 KB. Extents are either uniform (all eight pages belong to one object — used once a table grows past eight pages) or mixed (pages from different small objects share one extent — used for objects with fewer than eight pages to avoid wasting space). Tables and indexes are grouped into filegroups, which are named collections of one or more data files. Every database starts with a filegroup called PRIMARY. You can create additional filegroups pointing to files on different drives and place specific tables on specific filegroups, which gives fine-grained control over which data lives on which physical disk. This is a powerful technique for large databases: put a hot, frequently queried table on a fast NVMe drive in its own filegroup while archival data sits on cheaper spinning disks in a separate filegroup.

-- Inspect filegroups and the files assigned to each one in DataplexaStore

SELECT
    fg.name                         AS filegroup_name,
    fg.type_desc                    AS filegroup_type,   -- ROWS_FILEGROUP or MEMORY_OPTIMIZED_DATA
    fg.is_default                   AS is_default_fg,    -- new objects land here unless specified
    df.name                         AS file_name,
    df.physical_name                AS path_on_disk,
    df.size * 8 / 1024             AS size_mb
FROM sys.filegroups fg
JOIN sys.database_files df
    ON fg.data_space_id = df.data_space_id
ORDER BY fg.name;
filegroup_name filegroup_type is_default_fg file_name path_on_disk size_mb
────────────── ──────────────── ───────────── ────────────── ───────────────────────────────────────── ───────
PRIMARY ROWS_FILEGROUP 1 DataplexaStore C:\...\DATA\DataplexaStore.mdf 8
  • is_default_fg = 1 means PRIMARY is the default filegroup — any CREATE TABLE or CREATE INDEX that does not specify ON filegroup_name will place its data here
  • Adding a second filegroup for a large archive table is as simple as ALTER DATABASE DataplexaStore ADD FILEGROUP ArchiveFG followed by ALTER DATABASE DataplexaStore ADD FILE (...) TO FILEGROUP ArchiveFG
  • In a healthy production database you will typically see at least two filegroups — PRIMARY for system objects and a separate one for user data, making it straightforward to move data to different storage tiers later

How Rows Fit on Pages

Knowing how rows map to pages explains several behaviours that would otherwise seem mysterious. Each 8 KB page holds a 96-byte page header, a row offset array at the end (2 bytes per row), and the actual row data in between. SQL Server never splits a single row across two pages — if a row is too wide to fit in the remaining space on a page, it moves to the next page, leaving the unused space behind. This leftover space is called internal fragmentation, and it is why a table with very wide or variable-length rows uses disk and memory less efficiently than a table with compact, fixed-width rows. When a new row must be inserted onto a full page and the table has an index, SQL Server splits the page — a page split — moving roughly half the rows to a new page to make room. Page splits are an I/O-expensive operation and are the main reason why choosing a good clustered index key matters so much. The queries below reveal exactly how many pages the DataplexaStore tables currently occupy and whether fragmentation is building up.

-- Check page usage and fragmentation for all tables in DataplexaStore
-- avg_fragmentation_in_percent above 30% is the threshold for an index rebuild

SELECT
    OBJECT_NAME(i.object_id)        AS table_name,
    i.name                          AS index_name,
    i.type_desc                     AS index_type,      -- CLUSTERED or NONCLUSTERED
    s.page_count,                   -- how many 8 KB pages this index occupies
    s.record_count,                 -- number of rows (records) in the index
    CAST(s.avg_fragmentation_in_percent AS DECIMAL(5,1))
                                    AS fragmentation_pct, -- 0-30% = reorganise, >30% = rebuild
    CAST(s.avg_page_space_used_in_percent AS DECIMAL(5,1))
                                    AS page_fullness_pct  -- how full each page is on average
FROM sys.indexes i
CROSS APPLY sys.dm_db_index_physical_stats(
    DB_ID(), i.object_id, i.index_id, NULL, 'DETAILED') s
WHERE i.object_id > 100             -- skip system tables
  AND s.page_count > 0              -- skip empty indexes
ORDER BY s.avg_fragmentation_in_percent DESC;
table_name index_name index_type page_count record_count fragmentation_pct page_fullness_pct
─────────── ────────────────────────── ──────────── ────────── ──────────── ───────────────── ─────────────────
OrderItems PK_OrderItems CLUSTERED 1 10 0.0 68.4
Orders PK_Orders CLUSTERED 1 5 0.0 72.1
Reviews PK_Reviews CLUSTERED 1 5 0.0 65.8
Products PK_Products CLUSTERED 1 10 0.0 74.2
Users PK_Users CLUSTERED 1 10 0.0 71.5
  • All tables currently fit on a single page each — fragmentation is 0% because there has been no page splitting yet; this changes as data grows through the course
  • page_fullness_pct around 70% is healthy — 100% means every page is full, leaving no room for inserts without splitting; very low values mean wasted space and poor Buffer Pool efficiency
  • The threshold rule is practical: fragmentation below 10% — do nothing; 10–30% — ALTER INDEX ... REORGANIZE (online, low-impact); above 30% — ALTER INDEX ... REBUILD (rebuilds from scratch, brief lock on Standard Edition)

The Transaction Log File in Depth

While the data file stores the current state of your data, the transaction log stores the history of every change. It is a circular structure divided into segments called Virtual Log Files (VLFs). SQL Server writes new log records into the active VLF, moving forward through the circle. A VLF can only be reused — and its space reclaimed — once all the transactions it contains have been committed and a log backup has been taken (in Full recovery model) or a checkpoint has occurred (in Simple recovery model). This circular reuse mechanism means a healthy log stays small. A log that grows unexpectedly is almost always caused by one of two things: either log backups are not running (Full recovery model), or there is a long-running open transaction holding the log open because its records cannot be marked for reuse until the transaction completes or rolls back. The query below shows the reason the log cannot shrink right now.

-- Diagnose why the transaction log cannot be truncated (reused)
-- log_reuse_wait_desc tells you exactly what is blocking log truncation

SELECT
    name                            AS database_name,
    recovery_model_desc,            -- SIMPLE, FULL, or BULK_LOGGED
    log_reuse_wait_desc,            -- what is preventing log space from being reclaimed
    log_size_mb   = (
        SELECT SUM(size) * 8.0 / 1024
        FROM sys.master_files
        WHERE database_id = d.database_id AND type = 1),
    log_used_pct  = (
        SELECT log_used_space_kb * 100.0 / log_size_kb
        FROM sys.dm_db_log_space_usage)
FROM sys.databases d
WHERE name = 'DataplexaStore';
database_name recovery_model_desc log_reuse_wait_desc log_size_mb log_used_pct
─────────────── ─────────────────── ─────────────────── ─────────── ────────────
DataplexaStore SIMPLE NOTHING 8.00 12.50
  • log_reuse_wait_desc = NOTHING is the healthy state — it means SQL Server can freely reuse any inactive VLF and the log will not grow unexpectedly
  • LOG_BACKUP in this column means the database is in Full recovery model and no log backup has run recently — log space cannot be reclaimed until a backup clears it
  • ACTIVE_TRANSACTION means a transaction started but has not committed or rolled back — if left open for hours it will cause the log to grow until it fills the drive, which will halt all writes to the database
  • SIMPLE recovery model (appropriate for development and test) automatically reclaims log space at each checkpoint; FULL recovery model (required for production point-in-time restore) requires regular log backups to do the same

Creating a Database with Proper File Settings

The CREATE DATABASE statement you used in Lesson 5 accepted all defaults — reasonable for learning, but in production every setting matters. The initial file size determines how much disk is pre-allocated before the first auto-growth event. The auto-growth increment determines how much disk is claimed each time the file fills up — too small and SQL Server grows the file dozens of times under load, each growth event briefly pausing all writes while the disk is extended; too large and you waste disk and delay I/O during the growth. The file location matters because the default DATA folder often sits on the same drive as Windows, which is both slow and dangerous. The query below shows how to create a properly configured database from scratch — the same pattern used by DBAs in production environments.

-- Create a database with explicit, production-appropriate file settings
-- This is what a properly configured database creation looks like

CREATE DATABASE DataplexaStore_Prod
ON PRIMARY (
    NAME        = 'DataplexaStore_Data',         -- logical name used internally by SQL Server
    FILENAME    = 'C:\SQLData\DataplexaStore.mdf', -- dedicated data drive (separate from log)
    SIZE        = 256MB,                           -- pre-allocate 256 MB to avoid early growth events
    MAXSIZE     = UNLIMITED,                       -- let it grow as needed
    FILEGROWTH  = 128MB                            -- grow in fixed 128 MB chunks (never use %)
)
LOG ON (
    NAME        = 'DataplexaStore_Log',            -- logical name for the log file
    FILENAME    = 'D:\SQLLogs\DataplexaStore.ldf', -- dedicated log drive — sequential I/O only
    SIZE        = 64MB,                            -- pre-size the log to cover normal activity
    MAXSIZE     = 2GB,                             -- cap prevents runaway log from filling the drive
    FILEGROWTH  = 64MB                             -- fixed growth on the log is critical
);

-- Verify the result
SELECT name, type_desc, physical_name, size * 8 / 1024 AS size_mb, growth
FROM sys.master_files
WHERE database_id = DB_ID('DataplexaStore_Prod');
name type_desc physical_name size_mb growth
──────────────────── ───────── ───────────────────────────────── ─────── ──────
DataplexaStore_Data ROWS C:\SQLData\DataplexaStore.mdf 256 16384
DataplexaStore_Log LOG D:\SQLLogs\DataplexaStore.ldf 64 4096
  • growth = 16384 pages corresponds to 128 MB (16384 × 8 KB) — confirming the fixed-MB auto-growth setting was applied correctly
  • Separating data and log onto different drives (C:\SQLData and D:\SQLLogs) eliminates the I/O contention between random data reads/writes and sequential log appends — one of the most impactful hardware configurations for a production SQL Server
  • MAXSIZE = 2GB on the log acts as a safety cap — without it, a runaway open transaction or missing log backup job can grow the log until it fills the entire drive and halts all writes

Shrinking, Moving, and Monitoring Files

Once you understand how files grow, you need to know how to manage them when they grow unexpectedly or need to move. Shrinking a data file (DBCC SHRINKFILE) reclaims unused space but causes severe fragmentation because SQL Server has to move pages from the end of the file to fill gaps — immediately after shrinking, every index in the database is fragmented and performance degrades noticeably. The correct approach is to investigate why the file grew (missing log backup, unexpectedly large table, missing partitioning strategy) and fix the root cause rather than repeatedly shrinking. Moving a file to a different drive requires detaching the database, physically moving the files, and reattaching — or using the ALTER DATABASE ... MODIFY FILE approach followed by a restart of the SQL Server service, which is the safer method for production. The query below is the first thing to check when a file is growing faster than expected.

-- Find the largest tables in DataplexaStore — the usual suspects for unexpected growth

SELECT
    OBJECT_NAME(i.object_id)        AS table_name,
    SUM(a.total_pages) * 8 / 1024  AS total_space_mb,    -- allocated space including indexes
    SUM(a.used_pages)  * 8 / 1024  AS used_space_mb,     -- space actually containing rows
    SUM(a.data_pages)  * 8 / 1024  AS data_space_mb,     -- space used by the data itself (no indexes)
    SUM(p.rows)                     AS row_count          -- approximate row count
FROM sys.indexes i
JOIN sys.partitions p
    ON i.object_id = p.object_id AND i.index_id = p.index_id
JOIN sys.allocation_units a
    ON p.partition_id = a.container_id
WHERE OBJECTPROPERTY(i.object_id, 'IsUserTable') = 1   -- only user tables, no system tables
GROUP BY i.object_id
ORDER BY total_space_mb DESC;
table_name total_space_mb used_space_mb data_space_mb row_count
─────────── ────────────── ───────────── ───────────── ─────────
OrderItems 0.06 0.06 0.06 10
Orders 0.06 0.06 0.06 5
Reviews 0.06 0.06 0.06 5
Products 0.06 0.06 0.06 10
Users 0.06 0.06 0.06 10
  • At the current sample dataset size all tables are tiny — this query becomes genuinely useful once the dataset grows to thousands of rows and you need to identify which table is consuming the most space
  • total_space_mb minus data_space_mb gives index overhead — if index space is dramatically larger than data space, you may have more indexes than the query workload justifies
  • Running this query weekly on production and comparing to the previous week immediately shows which tables are growing fastest, enabling proactive capacity planning before a disk fills up

Summary Table

Concept Detail Key Point
.mdf file Primary data file Stores all table rows and indexes — random I/O pattern
.ldf file Transaction log file Sequential writes only — separate drive eliminates I/O contention
Page 8 KB unit of storage Smallest unit SQL Server reads and writes — logical_reads counts these
Extent 8 pages = 64 KB Uniform (one object) or mixed (small objects share)
Filegroup Named collection of data files Place hot tables on fast drives by assigning them to a specific filegroup
VLF Virtual Log File — segment of the .ldf Reused circularly — blocked by open transactions or missing log backups
log_reuse_wait_desc DMV column in sys.databases NOTHING = healthy; LOG_BACKUP or ACTIVE_TRANSACTION = investigate

Practice Questions

Practice 1. Why is it recommended to place the data file and transaction log file on separate physical drives?



Practice 2. What is a page split and what causes it?



Practice 3. What does log_reuse_wait_desc = LOG_BACKUP mean and how do you fix it?



Practice 4. Why should you avoid using a percentage-based auto-growth setting on the transaction log file?



Practice 5. What is a filegroup and give one real-world scenario where creating a second filegroup is beneficial?



Quiz

Quiz 1. What is the file extension of a SQL Server primary data file?






Quiz 2. How many bytes make up a single SQL Server page?






Quiz 3. What is the default filegroup name in every new SQL Server database?






Quiz 4. What fragmentation percentage threshold triggers a full index REBUILD rather than a REORGANIZE?






Quiz 5. What structure does SQL Server use to divide the transaction log file internally?






Next up - Tables, Rows & Columns - Everything about how SQL Server defines structure, enforces constraints, and keeps every row consistent.