Topic 09

Storage Engines

When you save something in a database, where does it actually go? How does the database survive a power outage without losing your data? This topic covers what happens under the hood: the journal that makes writes safe, the memory cache that makes reads fast, and the two main ways databases store data on disk.

InnoDB (MySQL)PostgreSQL heap RocksDB (LSM)WiredTiger (MongoDB)MyISAM

At a Glance

Core Concepts

Pages & Buffer Pool

Think of a database like a book. The database does not read one word at a time — it reads a whole page at a time. Each page is typically 8KB or 16KB and contains many rows. The buffer pool is the database's working desk: recently used pages sit in RAM so they can be read instantly. When a page is modified but not yet saved to disk, it is called a dirty page. The database writes dirty pages to disk in the background.

Write-Ahead Log (WAL)

Imagine a chef who writes every step in a notebook before doing anything in the kitchen. If the kitchen catches fire mid-recipe, you can replay the notebook and pick up exactly where things went wrong. The WAL is that notebook. Every change is written to the journal before it touches the actual data. A transaction is considered committed the moment the journal entry is safely on disk — the data page itself can be updated later at a convenient time.

LSM Trees

Traditional databases update data in place on disk, which involves a lot of jumping around (random I/O). LSM trees take a different approach: every write lands in memory first (in a structure called the memtable), and once memory fills up, the batch is written to disk sequentially — like appending to a log file. Sequential writes are much faster than random ones. The tradeoff is that reads may need to check multiple sorted files, and background work called compaction periodically merges those files to keep reads manageable.

Checkpoints & Recovery

Think of a checkpoint like a save point in a video game. Periodically, the database takes all the dirty pages from memory and writes them to disk, then marks the current position in the journal. If the machine crashes later, recovery only needs to replay the journal from the last checkpoint — not from the very beginning. More frequent checkpoints mean faster recovery after a crash, but slightly more disk activity while the database is running.

Page Layout

How Data Pages Work

The page (also called a block) is the fundamental unit of I/O in a database engine — typically 4KB, 8KB, or 16KB. Everything the database reads from or writes to disk is in whole pages. Understanding page layout explains why certain operations are expensive and why fragmentation occurs.

1

Page structure

A database page is like a physical sheet of paper in a filing folder. At the top is a header with bookkeeping information — the page number, how much free space is left, and a sequence number (called the LSN) that the journal uses to track which changes have been applied. The middle holds the actual row data. A small directory at the bottom keeps track of where each row starts within the page. When the page fills up, a new page is allocated — like adding a new sheet to the folder.

2

Fill factor

If you pack a page completely full, the next INSERT has nowhere to go and the database must split the page in two — an expensive operation. Fill factor is the setting that controls how full a page is allowed to get before the database considers it "full." InnoDB leaves about 7% empty by default. PostgreSQL lets you configure this per table. For tables that receive lots of updates, leaving extra free space (setting fillfactor to 70-80%) means updates can fit on the same page rather than spilling to a new one.

3

HOT updates (PostgreSQL)

In PostgreSQL, every UPDATE creates a new version of the row rather than changing it in place. Normally this also requires updating every index that points to that row — which is expensive if you have many indexes. A HOT update (Heap Only Tuple) is a special shortcut: if the new version fits on the same page and none of the indexed columns changed, PostgreSQL can skip updating the indexes entirely and just leave a pointer inside the page from the old row to the new one. This is much cheaper for tables with frequent small updates on non-indexed columns.

4

Page fragmentation

Imagine a notebook where you keep erasing entries. Over time, the notebook has lots of half-empty pages with gaps scattered throughout. Every time you read a page, you bring back a full 8KB from disk — but if half of it is empty space, you wasted that read. This is fragmentation. PostgreSQL's VACUUM FULL and InnoDB's OPTIMIZE TABLE physically rewrite the entire table from scratch, packing rows together tightly. The downside is that this operation locks the table and takes a while — so it is not something you do casually on a large production table.

Storage Model Comparison

B+Tree vs LSM Tree

PropertyB+Tree (InnoDB/PostgreSQL)LSM Tree (RocksDB/Cassandra)
Write pathRandom I/O to existing pages (in-place)Sequential append to memtable, then SSTable
Read pathO(log n) — single B+tree traversalCheck memtable + L0..Ln SSTables (bloom filter skips)
Write amplificationLow — write page onceHigh — data written multiple times during compaction
Read amplificationLow — one path from root to leafHigher — may need to check multiple levels
Space amplificationLow (with VACUUM/OPTIMIZE)Higher — multiple versions exist until compaction
Delete handlingMark dead, VACUUM laterTombstone record, compaction removes it
Best workloadMixed read/write, random lookups, range scansWrite-heavy: logs, events, IoT telemetry, append-only
Used byPostgreSQL, MySQL InnoDB, SQLite, OracleRocksDB, Cassandra, HBase, LevelDB, TiKV
LSM Internals

LSM Tree and Compaction Strategies

The LSM tree trades read performance for dramatically higher write throughput by converting random writes into sequential I/O. Understanding compaction is essential for tuning LSM-backed databases like Cassandra and RocksDB.

1

Memtable and immutable memtable

All incoming writes land in memory first, in a sorted structure called the memtable. Think of it as an in-progress draft document. Once it reaches a size limit (typically 64MB to 256MB), it is "sealed" — no more writes go to it. A fresh memtable takes over for new writes, while the sealed one (now immutable) is flushed to disk in the background. A WAL journal backs the in-memory data so nothing is lost if the machine crashes before the flush completes.

2

SSTables and bloom filters

When the memtable is flushed to disk, it becomes a sorted, immutable file called an SSTable (Sorted String Table). Over time many SSTables accumulate, and looking up a key might require checking several of them. To speed this up, each SSTable carries a bloom filter — a small lookup structure that can quickly answer "is this key definitely NOT in this file?" If the bloom filter says no, the whole file is skipped. This reduces a lookup that might scan dozens of files down to checking just one or two.

3

Leveled compaction (RocksDB default)

SSTables are organized into levels. Level 0 holds newly flushed files and may have overlapping key ranges. Compaction periodically merges L0 files into Level 1, then Level 1 into Level 2, and so on — each level being about 10x larger than the previous. Within Levels 1 and up, key ranges never overlap, which makes lookups fast (at most one file per level to check). The tradeoff is that data is physically rewritten multiple times as it moves through levels. This is called write amplification, and it is the main cost of the LSM approach.

4

Size-tiered compaction (Cassandra default)

Instead of organizing files by level, size-tiered compaction groups SSTables that are similar in size and merges them together. This avoids the high write amplification of leveled compaction — data is not rewritten as many times. The tradeoff is that multiple files may cover the same key range, so reads may need to check more files. This strategy works best for write-heavy workloads where you mostly append and rarely look up individual records. Cassandra also offers a Time-Window Compaction Strategy (TWCS) which is ideal for time-series data that naturally expires after a fixed period.

How It Works

InnoDB Write Path

1

Write to Redo Log (WAL)

Before any data page on disk is touched, the change is first recorded in the redo log (the journal). Once that journal entry is safely written to disk, the transaction is considered committed. This is the key insight: writing to a sequential journal file is fast; writing to a random location in a data file is slow. The journal entry is the durable record of what happened.

2

Modify Page in Buffer Pool

The relevant data page is loaded from disk into the buffer pool (if it is not already there). The change is applied to the in-memory copy of the page. The page is now "dirty" — its content in memory does not match what is on disk. It will be written to disk later, in the background.

3

Undo Log Entry

Alongside the redo log, InnoDB writes an undo record — a "before photo" of what the row looked like before the change. If the transaction is rolled back, InnoDB uses the undo record to put the row back to its original state. This also supports MVCC: other transactions can read the old version of the row from the undo log even while your transaction is mid-flight.

4

Background Flush (Checkpoint)

Dirty pages sit in memory until a background process writes them to disk. Periodically, the database performs a checkpoint: it flushes all dirty pages and records exactly where in the journal it has caught up to. This is the save-game moment. If the database crashes after a checkpoint, it only needs to replay the journal from that point forward — not from the beginning of time.

5

Crash Recovery: ARIES

After a crash, the database runs through three phases. First, the Analysis phase scans the journal from the last checkpoint to figure out which transactions were in-flight and which pages were dirty. Second, the Redo phase replays every change in the journal — both committed and uncommitted — to restore the database to the exact state it was in at the moment of the crash. Third, the Undo phase rolls back any transactions that had not yet committed, using the undo logs. This three-phase process is called ARIES recovery.

SQL — InnoDB Configuration
Buffer pool: most important InnoDB tuning parameter
-- Set to 70-80% of available RAM on dedicated DB servers
SET GLOBAL innodb_buffer_pool_size = 8G;

WAL sync mode: 1 = safest (fsync on commit)
0 = fastest (no fsync, risky), 2 = flush but no fsync
SET GLOBAL innodb_flush_log_at_trx_commit = 1PostgreSQL: WAL configuration
wal_level = logical enables logical replication
checkpoint_completion_target = 0.9 spreads checkpoint I/O

View dirty page stats (InnoDB)
SELECT * FROM information_schema.INNODB_BUFFER_POOL_STATS\G
Interactive Demo

Page/Block Storage Visualizer

INSERT fills free blocks, DELETE marks blocks dashed (fragmented). VACUUM compacts the table. Watch the fill factor gauge change.

8 KB Heap Page Page Header lsn | page_id | checksum | free_space_start | free_space_end Slot Array (grows down) [slot1: offset=180, len=42] [slot2: offset=140, len=38] [slot3: deleted] ... Free Space (slot array grows down, tuples grow up) Tuple Data (grows up) t_xmin | t_xmax | t_ctid | null_bitmap | attrs... WAL Buffer write-ahead log record XID | page | old+new images Buffer Pool dirty page cache evict via clock/LRU Disk (fsync) durable on COMMIT

InnoDB and PostgreSQL use fixed-size heap pages (8 or 16 KB). Each tuple includes transaction metadata (xmin/xmax) enabling MVCC visibility rules without a separate undo log in Postgres. Writes go to WAL first — data pages are flushed lazily, often in batches.

Storage Page Layout

Blue = used page, red dashed = deleted (dead tuple), dark = free. VACUUM reclaims dead tuples.

B+Tree (InnoDB/PostgreSQL)

  • Reads: O(log n) — excellent
  • Writes: random I/O to existing pages
  • Good for mixed read/write workloads
  • VACUUM needed for dead tuple cleanup
  • Row-level compression possible

LSM Tree (RocksDB/Cassandra)

  • Writes: sequential append — very fast
  • Reads: may check multiple levels
  • Bloom filters reduce false reads
  • Compaction happens asynchronously
  • Best for write-heavy time-series/log data
Anti-patterns

Setting innodb_flush_log_at_trx_commit = 0

This disables WAL flushing on commit — MySQL uses OS buffers only. Up to 1 second of committed data can be lost on crash. Never use this for financial or critical data. Mode 2 (flush but no fsync) is a safer compromise.

Not running VACUUM (PostgreSQL)

PostgreSQL MVCC never overwrites rows — dead tuples accumulate. Without VACUUM, tables bloat (4× is common), queries slow down, and eventually transaction ID wraparound (every ~2B transactions) causes catastrophic downtime. Autovacuum must be tuned and monitored.

Undersizing the buffer pool

If the buffer pool is too small, hot pages are constantly evicted and re-read from disk. 70–80% of RAM on a dedicated database server is the standard starting point. Monitor innodb_buffer_pool_reads (disk reads) vs innodb_buffer_pool_read_requests (cache hits).

Quiz
Question 1 of 3
Write-Ahead Logging (WAL) ensures durability by:
AMaking all writes faster by buffering them
BWriting log records to durable storage BEFORE applying changes to data pages
CWriting all changes directly to data pages synchronously
DEliminating the need for periodic checkpoints
Question 2 of 3
In InnoDB, the clustered index means:
ARow data is stored in a separate file from the index
BRow data is stored directly in the primary key B+tree leaf pages
CRows are locked at the page level, not row level
DRows are stored in a hash table sorted by primary key
Question 3 of 5
LSM trees are optimized for:
ARead-heavy random lookup workloads
BWrite-heavy workloads where sequential write throughput matters
CDirect random access without memory buffering
DComplex multi-table JOIN operations
Question 4 of 5
What is the purpose of a bloom filter in an LSM-tree storage engine?
ACompressing SSTable data to reduce disk usage
BSorting keys in an SSTable for binary search
CQuickly determining if a key is definitely not in an SSTable, avoiding unnecessary I/O
DBuffering writes before flushing to SSTable
Question 5 of 5
In PostgreSQL, a HOT (Heap Only Tuple) update avoids which expensive operation?
AWriting a WAL record for the update
BUpdating index entries — the index is not touched for HOT updates
CWriting the new row version to the heap
DMVCC versioning of the row
Interview Q&A
How does crash recovery work in InnoDB (ARIES algorithm)?
InnoDB uses the ARIES (Algorithm for Recovery and Isolation Exploiting Semantics) algorithm: (1) Analysis passscan the WAL from the last checkpoint to determine which pages were dirty (incomplete flushes) and which transactions were in-progress at the time of crash. (2) Redo passreplay ALL WAL records from the last checkpoint, both committed and uncommitted. This restores the database to the exact state at crash time. (3) Undo passusing undo logs, rollback any transactions that were in-progress (uncommitted) at crash time. ARIES's key innovation: "steal + no-force" — dirty pages CAN be flushed before commit (steal), and committed transactions DON'T need pages flushed before commit (no-force) — WAL provides durability.
What is VACUUM in PostgreSQL and why is it necessary?
PostgreSQL's MVCC never overwrites or deletes rows in place — instead, UPDATE inserts a new row version and marks the old xmax as the current transaction ID. DELETE just marks xmax. These "dead tuples" remain on disk until VACUUM reclaims them. VACUUM does: (1) Reclaim dead tuple storage for reuse. (2) Update the Free Space Map (FSM) so INSERTs can find free space. (3) Update the Visibility Map (VM) — marks pages with all-visible tuples, allowing index-only scans. (4) Prevent transaction ID wraparound (TXID wraparound vacuum freezes old rows). AUTOVACUUM triggers based on autovacuum_vacuum_scale_factor (default 20% dead tuples). On high-write tables, tune to trigger more aggressively.
What is the difference between InnoDB and MyISAM storage engines?
InnoDB: full ACID transactions, row-level locking, MVCC, foreign keys, crash recovery via WAL, clustered primary key index. MyISAM: no transactions, table-level locking only, no FK support, no crash recovery, full-text search (pre-InnoDB 5.6). MyISAM was MySQL's default before 5.5; InnoDB has been the default since. Reasons to still use MyISAM: slightly faster full table reads on read-only tables, simpler storage (no transaction overhead). For any production workload with writes or needing transactions, always use InnoDB. MyISAM is effectively deprecated for most use cases.
How does an LSM tree work, and what is compaction?
LSM (Log-Structured Merge) tree: writes go to an in-memory memtable (sorted skip list or red-black tree). When memtable fills, it's flushed to disk as an immutable SSTable (Sorted String Table). Over time, many SSTables accumulate. Compaction merges SSTables: sort-merge overlapping key ranges, discard outdated (overwritten/deleted) entries, produce a new sorted SSTable. Levels: L0 = recently flushed (unsorted across files), L1+ = sorted, compacted. Bloom filters on each SSTable let you skip files that definitely don't contain a key. Read amplification: may need to check multiple levels. Write amplification: data is written multiple times during compaction. Tuning: level-based vs size-tiered compaction strategies (different tradeoffs).
What is the buffer pool and how should you tune it?
The buffer pool (InnoDB) or shared_buffers (PostgreSQL) is an in-memory cache of database pages. All reads check the cache first. All writes modify pages in cache first — WAL ensures durability for the unflushed pages. Sizing: for InnoDB, set innodb_buffer_pool_size to 70–80% of RAM on a dedicated DB server. For PostgreSQL, set shared_buffers to ~25% of RAM — PostgreSQL also benefits from the OS page cache via effective_cache_size. Monitor hit ratio: SELECT blks_hit / (blks_hit + blks_read) from pg_stat_database — below 99% is a red flag for read-heavy workloads. InnoDB's LRU has "young" (recently accessed) and "old" (newly inserted) sub-lists to prevent bulk scans from evicting hot data.
What is write amplification and why does it matter for SSD lifetime?
Write amplification is the ratio of actual bytes written to disk to the logical bytes written by the application. In a B+tree, a 100-byte update may write an entire 8KB page (80x amplification on small updates). In an LSM tree, compaction rewrites the same data multiple times across levels — a byte written to L1 is rewritten when merged to L2, and again to L3. Write amplification matters for SSDs because NAND flash cells have finite write endurance (typically 1,000–10,000 Program/Erase cycles). High write amplification — from WAL writes, B+tree page writes, and compaction — accelerates flash wear. In production: WAL alone on a busy PostgreSQL server can write 100GB/day; on a 1TB SSD with 3,000 P/E cycles, that's ~30GB total — the SSD would last ~8 years from WAL alone. High-write databases need enterprise SSDs with high endurance ratings or NVMe with over-provisioning.
How does RocksDB differ from LevelDB?
RocksDB was forked from Google's LevelDB (2013, Facebook) with production workloads in mind. Key differences: (1) Concurrency: RocksDB supports multiple concurrent memtable writes and parallel compaction; LevelDB has a single global lock. (2) Compaction options: RocksDB adds Universal and FIFO compaction strategies in addition to Leveled; LevelDB only has Leveled. (3) Column families: RocksDB organizes data into separate column families (each with its own LSM tree), enabling different compaction and compression settings per logical table. LevelDB has a single key space. (4) Production hardening: backups, SST file ingestion, rate limiters, statistics. RocksDB is used as the storage layer for: CockroachDB, TiKV, MyRocks (MySQL), PebbleDB (CockroachDB's RocksDB replacement), Cassandra plugins. LevelDB is primarily used in Bitcoin Core and simpler embedded use cases.
Further Reading
Previous
Concurrency Control