MS SQL Server
Databases & File Structure
When you created the DataplexaStore database in Lesson 5, SQL Server silently created two files on your disk — a data file and a log file. Every database you will ever work with in SQL Server has this same physical foundation. Understanding how these files are organised, how SQL Server reads and writes to them, how they grow, and how separating them across drives can transform performance is not just background knowledge — it is the practical skill that separates a developer who can get data in and out from one who understands why things go wrong and how to fix them. This lesson covers the physical file structure of SQL Server databases from the ground up, using DataplexaStore as the working example throughout.
Every Database is Two Files
A SQL Server database is always backed by at least two physical files. The data file (extension .mdf for the primary file, .ndf for secondary files) stores all the actual data — every table row, every index, every stored procedure definition. The transaction log file (extension .ldf) records every change made to the database before the change is applied, enabling crash recovery and point-in-time restore. These two files have completely different I/O patterns. The data file has mixed random I/O — reads jump to wherever the needed page lives, writes scatter updates across many locations. The transaction log is purely sequential — log records are always appended to the end of the log in strict order. This difference matters enormously for hardware: placing data and log files on the same physical disk means a random-I/O workload and a sequential-I/O workload compete for the same drive head, increasing latency for both. Separating them onto different drives — or using a high-speed NVMe SSD for the log — is one of the most impactful infrastructure decisions you can make for a busy production SQL Server.
-- View the physical files belonging to the DataplexaStore database
-- This shows location on disk, current size, and growth settings
USE DataplexaStore;
SELECT
name AS logical_name, -- the name SQL Server uses internally
type_desc AS file_type, -- ROWS (data) or LOG
physical_name AS path_on_disk, -- actual file path on the server
size * 8 / 1024 AS current_size_mb, -- current allocated size in MB
FILEPROPERTY(name, 'SpaceUsed') * 8 / 1024
AS used_mb, -- space actually containing data
CASE growth
WHEN 0 THEN 'Fixed size — no auto-growth'
ELSE CAST(growth AS VARCHAR) + ' (pages or %)'
END AS auto_growth
FROM sys.database_files;──────────────────── ───────── ──────────────────────────────────────────────────────── ─────────────── ─────── ──────────────────────
DataplexaStore ROWS C:\Program Files\MSSQL\DATA\DataplexaStore.mdf 8 1 1024 (pages or %)
DataplexaStore_log LOG C:\Program Files\MSSQL\DATA\DataplexaStore_log.ldf 8 1 10 (pages or %)
.mdfis the primary data file — every database has exactly one, and it is where SQL Server stores the first page of every allocation and the boot page that describes the database.ldfis the transaction log — its growth setting of 10 here means 10% growth per event, which is a poor default on large databases; production logs should be pre-sized and use a fixed MB growth increment insteadFILEPROPERTYwithSpaceUsedreturns how much of the allocated file is actually occupied by data — the difference between current_size_mb and used_mb is pre-allocated empty space ready for new rows without triggering a growth event
Data Files — Pages, Extents, and Filegroups
Inside every data file, SQL Server organises space in a strict hierarchy. The smallest unit is the page — 8 KB of data, metadata, and a row offset array. Every table row, every index entry, every allocation map lives on a page. Eight consecutive pages make an extent — 64 KB. Extents are either uniform (all eight pages belong to one object — used once a table grows past eight pages) or mixed (pages from different small objects share one extent — used for objects with fewer than eight pages to avoid wasting space). Tables and indexes are grouped into filegroups, which are named collections of one or more data files. Every database starts with a filegroup called PRIMARY. You can create additional filegroups pointing to files on different drives and place specific tables on specific filegroups, which gives fine-grained control over which data lives on which physical disk. This is a powerful technique for large databases: put a hot, frequently queried table on a fast NVMe drive in its own filegroup while archival data sits on cheaper spinning disks in a separate filegroup.
-- Inspect filegroups and the files assigned to each one in DataplexaStore
SELECT
fg.name AS filegroup_name,
fg.type_desc AS filegroup_type, -- ROWS_FILEGROUP or MEMORY_OPTIMIZED_DATA
fg.is_default AS is_default_fg, -- new objects land here unless specified
df.name AS file_name,
df.physical_name AS path_on_disk,
df.size * 8 / 1024 AS size_mb
FROM sys.filegroups fg
JOIN sys.database_files df
ON fg.data_space_id = df.data_space_id
ORDER BY fg.name;────────────── ──────────────── ───────────── ────────────── ───────────────────────────────────────── ───────
PRIMARY ROWS_FILEGROUP 1 DataplexaStore C:\...\DATA\DataplexaStore.mdf 8
- is_default_fg = 1 means PRIMARY is the default filegroup — any
CREATE TABLEorCREATE INDEXthat does not specifyON filegroup_namewill place its data here - Adding a second filegroup for a large archive table is as simple as
ALTER DATABASE DataplexaStore ADD FILEGROUP ArchiveFGfollowed byALTER DATABASE DataplexaStore ADD FILE (...) TO FILEGROUP ArchiveFG - In a healthy production database you will typically see at least two filegroups — PRIMARY for system objects and a separate one for user data, making it straightforward to move data to different storage tiers later
How Rows Fit on Pages
Knowing how rows map to pages explains several behaviours that would otherwise seem mysterious. Each 8 KB page holds a 96-byte page header, a row offset array at the end (2 bytes per row), and the actual row data in between. SQL Server never splits a single row across two pages — if a row is too wide to fit in the remaining space on a page, it moves to the next page, leaving the unused space behind. This leftover space is called internal fragmentation, and it is why a table with very wide or variable-length rows uses disk and memory less efficiently than a table with compact, fixed-width rows. When a new row must be inserted onto a full page and the table has an index, SQL Server splits the page — a page split — moving roughly half the rows to a new page to make room. Page splits are an I/O-expensive operation and are the main reason why choosing a good clustered index key matters so much. The queries below reveal exactly how many pages the DataplexaStore tables currently occupy and whether fragmentation is building up.
-- Check page usage and fragmentation for all tables in DataplexaStore
-- avg_fragmentation_in_percent above 30% is the threshold for an index rebuild
SELECT
OBJECT_NAME(i.object_id) AS table_name,
i.name AS index_name,
i.type_desc AS index_type, -- CLUSTERED or NONCLUSTERED
s.page_count, -- how many 8 KB pages this index occupies
s.record_count, -- number of rows (records) in the index
CAST(s.avg_fragmentation_in_percent AS DECIMAL(5,1))
AS fragmentation_pct, -- 0-30% = reorganise, >30% = rebuild
CAST(s.avg_page_space_used_in_percent AS DECIMAL(5,1))
AS page_fullness_pct -- how full each page is on average
FROM sys.indexes i
CROSS APPLY sys.dm_db_index_physical_stats(
DB_ID(), i.object_id, i.index_id, NULL, 'DETAILED') s
WHERE i.object_id > 100 -- skip system tables
AND s.page_count > 0 -- skip empty indexes
ORDER BY s.avg_fragmentation_in_percent DESC;─────────── ────────────────────────── ──────────── ────────── ──────────── ───────────────── ─────────────────
OrderItems PK_OrderItems CLUSTERED 1 10 0.0 68.4
Orders PK_Orders CLUSTERED 1 5 0.0 72.1
Reviews PK_Reviews CLUSTERED 1 5 0.0 65.8
Products PK_Products CLUSTERED 1 10 0.0 74.2
Users PK_Users CLUSTERED 1 10 0.0 71.5
- All tables currently fit on a single page each — fragmentation is 0% because there has been no page splitting yet; this changes as data grows through the course
- page_fullness_pct around 70% is healthy — 100% means every page is full, leaving no room for inserts without splitting; very low values mean wasted space and poor Buffer Pool efficiency
- The threshold rule is practical: fragmentation below 10% — do nothing; 10–30% —
ALTER INDEX ... REORGANIZE(online, low-impact); above 30% —ALTER INDEX ... REBUILD(rebuilds from scratch, brief lock on Standard Edition)
The Transaction Log File in Depth
While the data file stores the current state of your data, the transaction log stores the history of every change. It is a circular structure divided into segments called Virtual Log Files (VLFs). SQL Server writes new log records into the active VLF, moving forward through the circle. A VLF can only be reused — and its space reclaimed — once all the transactions it contains have been committed and a log backup has been taken (in Full recovery model) or a checkpoint has occurred (in Simple recovery model). This circular reuse mechanism means a healthy log stays small. A log that grows unexpectedly is almost always caused by one of two things: either log backups are not running (Full recovery model), or there is a long-running open transaction holding the log open because its records cannot be marked for reuse until the transaction completes or rolls back. The query below shows the reason the log cannot shrink right now.
-- Diagnose why the transaction log cannot be truncated (reused)
-- log_reuse_wait_desc tells you exactly what is blocking log truncation
SELECT
name AS database_name,
recovery_model_desc, -- SIMPLE, FULL, or BULK_LOGGED
log_reuse_wait_desc, -- what is preventing log space from being reclaimed
log_size_mb = (
SELECT SUM(size) * 8.0 / 1024
FROM sys.master_files
WHERE database_id = d.database_id AND type = 1),
log_used_pct = (
SELECT log_used_space_kb * 100.0 / log_size_kb
FROM sys.dm_db_log_space_usage)
FROM sys.databases d
WHERE name = 'DataplexaStore';─────────────── ─────────────────── ─────────────────── ─────────── ────────────
DataplexaStore SIMPLE NOTHING 8.00 12.50
log_reuse_wait_desc = NOTHINGis the healthy state — it means SQL Server can freely reuse any inactive VLF and the log will not grow unexpectedlyLOG_BACKUPin this column means the database is in Full recovery model and no log backup has run recently — log space cannot be reclaimed until a backup clears itACTIVE_TRANSACTIONmeans a transaction started but has not committed or rolled back — if left open for hours it will cause the log to grow until it fills the drive, which will halt all writes to the database- SIMPLE recovery model (appropriate for development and test) automatically reclaims log space at each checkpoint; FULL recovery model (required for production point-in-time restore) requires regular log backups to do the same
Creating a Database with Proper File Settings
The CREATE DATABASE statement you used in Lesson 5 accepted all defaults — reasonable for learning, but in production every setting matters. The initial file size determines how much disk is pre-allocated before the first auto-growth event. The auto-growth increment determines how much disk is claimed each time the file fills up — too small and SQL Server grows the file dozens of times under load, each growth event briefly pausing all writes while the disk is extended; too large and you waste disk and delay I/O during the growth. The file location matters because the default DATA folder often sits on the same drive as Windows, which is both slow and dangerous. The query below shows how to create a properly configured database from scratch — the same pattern used by DBAs in production environments.
-- Create a database with explicit, production-appropriate file settings
-- This is what a properly configured database creation looks like
CREATE DATABASE DataplexaStore_Prod
ON PRIMARY (
NAME = 'DataplexaStore_Data', -- logical name used internally by SQL Server
FILENAME = 'C:\SQLData\DataplexaStore.mdf', -- dedicated data drive (separate from log)
SIZE = 256MB, -- pre-allocate 256 MB to avoid early growth events
MAXSIZE = UNLIMITED, -- let it grow as needed
FILEGROWTH = 128MB -- grow in fixed 128 MB chunks (never use %)
)
LOG ON (
NAME = 'DataplexaStore_Log', -- logical name for the log file
FILENAME = 'D:\SQLLogs\DataplexaStore.ldf', -- dedicated log drive — sequential I/O only
SIZE = 64MB, -- pre-size the log to cover normal activity
MAXSIZE = 2GB, -- cap prevents runaway log from filling the drive
FILEGROWTH = 64MB -- fixed growth on the log is critical
);
-- Verify the result
SELECT name, type_desc, physical_name, size * 8 / 1024 AS size_mb, growth
FROM sys.master_files
WHERE database_id = DB_ID('DataplexaStore_Prod');──────────────────── ───────── ───────────────────────────────── ─────── ──────
DataplexaStore_Data ROWS C:\SQLData\DataplexaStore.mdf 256 16384
DataplexaStore_Log LOG D:\SQLLogs\DataplexaStore.ldf 64 4096
- growth = 16384 pages corresponds to 128 MB (16384 × 8 KB) — confirming the fixed-MB auto-growth setting was applied correctly
- Separating data and log onto different drives (
C:\SQLDataandD:\SQLLogs) eliminates the I/O contention between random data reads/writes and sequential log appends — one of the most impactful hardware configurations for a production SQL Server MAXSIZE = 2GBon the log acts as a safety cap — without it, a runaway open transaction or missing log backup job can grow the log until it fills the entire drive and halts all writes
Shrinking, Moving, and Monitoring Files
Once you understand how files grow, you need to know how to manage them when they grow unexpectedly or need to move. Shrinking a data file (DBCC SHRINKFILE) reclaims unused space but causes severe fragmentation because SQL Server has to move pages from the end of the file to fill gaps — immediately after shrinking, every index in the database is fragmented and performance degrades noticeably. The correct approach is to investigate why the file grew (missing log backup, unexpectedly large table, missing partitioning strategy) and fix the root cause rather than repeatedly shrinking. Moving a file to a different drive requires detaching the database, physically moving the files, and reattaching — or using the ALTER DATABASE ... MODIFY FILE approach followed by a restart of the SQL Server service, which is the safer method for production. The query below is the first thing to check when a file is growing faster than expected.
-- Find the largest tables in DataplexaStore — the usual suspects for unexpected growth
SELECT
OBJECT_NAME(i.object_id) AS table_name,
SUM(a.total_pages) * 8 / 1024 AS total_space_mb, -- allocated space including indexes
SUM(a.used_pages) * 8 / 1024 AS used_space_mb, -- space actually containing rows
SUM(a.data_pages) * 8 / 1024 AS data_space_mb, -- space used by the data itself (no indexes)
SUM(p.rows) AS row_count -- approximate row count
FROM sys.indexes i
JOIN sys.partitions p
ON i.object_id = p.object_id AND i.index_id = p.index_id
JOIN sys.allocation_units a
ON p.partition_id = a.container_id
WHERE OBJECTPROPERTY(i.object_id, 'IsUserTable') = 1 -- only user tables, no system tables
GROUP BY i.object_id
ORDER BY total_space_mb DESC;─────────── ────────────── ───────────── ───────────── ─────────
OrderItems 0.06 0.06 0.06 10
Orders 0.06 0.06 0.06 5
Reviews 0.06 0.06 0.06 5
Products 0.06 0.06 0.06 10
Users 0.06 0.06 0.06 10
- At the current sample dataset size all tables are tiny — this query becomes genuinely useful once the dataset grows to thousands of rows and you need to identify which table is consuming the most space
- total_space_mb minus data_space_mb gives index overhead — if index space is dramatically larger than data space, you may have more indexes than the query workload justifies
- Running this query weekly on production and comparing to the previous week immediately shows which tables are growing fastest, enabling proactive capacity planning before a disk fills up
Summary Table
| Concept | Detail | Key Point |
|---|---|---|
| .mdf file | Primary data file | Stores all table rows and indexes — random I/O pattern |
| .ldf file | Transaction log file | Sequential writes only — separate drive eliminates I/O contention |
| Page | 8 KB unit of storage | Smallest unit SQL Server reads and writes — logical_reads counts these |
| Extent | 8 pages = 64 KB | Uniform (one object) or mixed (small objects share) |
| Filegroup | Named collection of data files | Place hot tables on fast drives by assigning them to a specific filegroup |
| VLF | Virtual Log File — segment of the .ldf | Reused circularly — blocked by open transactions or missing log backups |
| log_reuse_wait_desc | DMV column in sys.databases | NOTHING = healthy; LOG_BACKUP or ACTIVE_TRANSACTION = investigate |
Practice Questions
Practice 1. Why is it recommended to place the data file and transaction log file on separate physical drives?
Practice 2. What is a page split and what causes it?
Practice 3. What does log_reuse_wait_desc = LOG_BACKUP mean and how do you fix it?
Practice 4. Why should you avoid using a percentage-based auto-growth setting on the transaction log file?
Practice 5. What is a filegroup and give one real-world scenario where creating a second filegroup is beneficial?
Quiz
Quiz 1. What is the file extension of a SQL Server primary data file?
Quiz 2. How many bytes make up a single SQL Server page?
Quiz 3. What is the default filegroup name in every new SQL Server database?
Quiz 4. What fragmentation percentage threshold triggers a full index REBUILD rather than a REORGANIZE?
Quiz 5. What structure does SQL Server use to divide the transaction log file internally?
Next up - Tables, Rows & Columns - Everything about how SQL Server defines structure, enforces constraints, and keeps every row consistent.