When you save something in a database, where does it actually go? How does the database survive a power outage without losing your data? This topic covers what happens under the hood: the journal that makes writes safe, the memory cache that makes reads fast, and the two main ways databases store data on disk.
Think of a database like a book. The database does not read one word at a time — it reads a whole page at a time. Each page is typically 8KB or 16KB and contains many rows. The buffer pool is the database's working desk: recently used pages sit in RAM so they can be read instantly. When a page is modified but not yet saved to disk, it is called a dirty page. The database writes dirty pages to disk in the background.
Imagine a chef who writes every step in a notebook before doing anything in the kitchen. If the kitchen catches fire mid-recipe, you can replay the notebook and pick up exactly where things went wrong. The WAL is that notebook. Every change is written to the journal before it touches the actual data. A transaction is considered committed the moment the journal entry is safely on disk — the data page itself can be updated later at a convenient time.
Traditional databases update data in place on disk, which involves a lot of jumping around (random I/O). LSM trees take a different approach: every write lands in memory first (in a structure called the memtable), and once memory fills up, the batch is written to disk sequentially — like appending to a log file. Sequential writes are much faster than random ones. The tradeoff is that reads may need to check multiple sorted files, and background work called compaction periodically merges those files to keep reads manageable.
Think of a checkpoint like a save point in a video game. Periodically, the database takes all the dirty pages from memory and writes them to disk, then marks the current position in the journal. If the machine crashes later, recovery only needs to replay the journal from the last checkpoint — not from the very beginning. More frequent checkpoints mean faster recovery after a crash, but slightly more disk activity while the database is running.
The page (also called a block) is the fundamental unit of I/O in a database engine — typically 4KB, 8KB, or 16KB. Everything the database reads from or writes to disk is in whole pages. Understanding page layout explains why certain operations are expensive and why fragmentation occurs.
A database page is like a physical sheet of paper in a filing folder. At the top is a header with bookkeeping information — the page number, how much free space is left, and a sequence number (called the LSN) that the journal uses to track which changes have been applied. The middle holds the actual row data. A small directory at the bottom keeps track of where each row starts within the page. When the page fills up, a new page is allocated — like adding a new sheet to the folder.
If you pack a page completely full, the next INSERT has nowhere to go and the database must split the page in two — an expensive operation. Fill factor is the setting that controls how full a page is allowed to get before the database considers it "full." InnoDB leaves about 7% empty by default. PostgreSQL lets you configure this per table. For tables that receive lots of updates, leaving extra free space (setting fillfactor to 70-80%) means updates can fit on the same page rather than spilling to a new one.
In PostgreSQL, every UPDATE creates a new version of the row rather than changing it in place. Normally this also requires updating every index that points to that row — which is expensive if you have many indexes. A HOT update (Heap Only Tuple) is a special shortcut: if the new version fits on the same page and none of the indexed columns changed, PostgreSQL can skip updating the indexes entirely and just leave a pointer inside the page from the old row to the new one. This is much cheaper for tables with frequent small updates on non-indexed columns.
Imagine a notebook where you keep erasing entries. Over time, the notebook has lots of half-empty pages with gaps scattered throughout. Every time you read a page, you bring back a full 8KB from disk — but if half of it is empty space, you wasted that read. This is fragmentation. PostgreSQL's VACUUM FULL and InnoDB's OPTIMIZE TABLE physically rewrite the entire table from scratch, packing rows together tightly. The downside is that this operation locks the table and takes a while — so it is not something you do casually on a large production table.
| Property | B+Tree (InnoDB/PostgreSQL) | LSM Tree (RocksDB/Cassandra) |
|---|---|---|
| Write path | Random I/O to existing pages (in-place) | Sequential append to memtable, then SSTable |
| Read path | O(log n) — single B+tree traversal | Check memtable + L0..Ln SSTables (bloom filter skips) |
| Write amplification | Low — write page once | High — data written multiple times during compaction |
| Read amplification | Low — one path from root to leaf | Higher — may need to check multiple levels |
| Space amplification | Low (with VACUUM/OPTIMIZE) | Higher — multiple versions exist until compaction |
| Delete handling | Mark dead, VACUUM later | Tombstone record, compaction removes it |
| Best workload | Mixed read/write, random lookups, range scans | Write-heavy: logs, events, IoT telemetry, append-only |
| Used by | PostgreSQL, MySQL InnoDB, SQLite, Oracle | RocksDB, Cassandra, HBase, LevelDB, TiKV |
The LSM tree trades read performance for dramatically higher write throughput by converting random writes into sequential I/O. Understanding compaction is essential for tuning LSM-backed databases like Cassandra and RocksDB.
All incoming writes land in memory first, in a sorted structure called the memtable. Think of it as an in-progress draft document. Once it reaches a size limit (typically 64MB to 256MB), it is "sealed" — no more writes go to it. A fresh memtable takes over for new writes, while the sealed one (now immutable) is flushed to disk in the background. A WAL journal backs the in-memory data so nothing is lost if the machine crashes before the flush completes.
When the memtable is flushed to disk, it becomes a sorted, immutable file called an SSTable (Sorted String Table). Over time many SSTables accumulate, and looking up a key might require checking several of them. To speed this up, each SSTable carries a bloom filter — a small lookup structure that can quickly answer "is this key definitely NOT in this file?" If the bloom filter says no, the whole file is skipped. This reduces a lookup that might scan dozens of files down to checking just one or two.
SSTables are organized into levels. Level 0 holds newly flushed files and may have overlapping key ranges. Compaction periodically merges L0 files into Level 1, then Level 1 into Level 2, and so on — each level being about 10x larger than the previous. Within Levels 1 and up, key ranges never overlap, which makes lookups fast (at most one file per level to check). The tradeoff is that data is physically rewritten multiple times as it moves through levels. This is called write amplification, and it is the main cost of the LSM approach.
Instead of organizing files by level, size-tiered compaction groups SSTables that are similar in size and merges them together. This avoids the high write amplification of leveled compaction — data is not rewritten as many times. The tradeoff is that multiple files may cover the same key range, so reads may need to check more files. This strategy works best for write-heavy workloads where you mostly append and rarely look up individual records. Cassandra also offers a Time-Window Compaction Strategy (TWCS) which is ideal for time-series data that naturally expires after a fixed period.
Before any data page on disk is touched, the change is first recorded in the redo log (the journal). Once that journal entry is safely written to disk, the transaction is considered committed. This is the key insight: writing to a sequential journal file is fast; writing to a random location in a data file is slow. The journal entry is the durable record of what happened.
The relevant data page is loaded from disk into the buffer pool (if it is not already there). The change is applied to the in-memory copy of the page. The page is now "dirty" — its content in memory does not match what is on disk. It will be written to disk later, in the background.
Alongside the redo log, InnoDB writes an undo record — a "before photo" of what the row looked like before the change. If the transaction is rolled back, InnoDB uses the undo record to put the row back to its original state. This also supports MVCC: other transactions can read the old version of the row from the undo log even while your transaction is mid-flight.
Dirty pages sit in memory until a background process writes them to disk. Periodically, the database performs a checkpoint: it flushes all dirty pages and records exactly where in the journal it has caught up to. This is the save-game moment. If the database crashes after a checkpoint, it only needs to replay the journal from that point forward — not from the beginning of time.
After a crash, the database runs through three phases. First, the Analysis phase scans the journal from the last checkpoint to figure out which transactions were in-flight and which pages were dirty. Second, the Redo phase replays every change in the journal — both committed and uncommitted — to restore the database to the exact state it was in at the moment of the crash. Third, the Undo phase rolls back any transactions that had not yet committed, using the undo logs. This three-phase process is called ARIES recovery.
Buffer pool: most important InnoDB tuning parameter -- Set to 70-80% of available RAM on dedicated DB servers SET GLOBAL innodb_buffer_pool_size = 8G; WAL sync mode: 1 = safest (fsync on commit) 0 = fastest (no fsync, risky), 2 = flush but no fsync SET GLOBAL innodb_flush_log_at_trx_commit = 1PostgreSQL: WAL configuration wal_level = logical enables logical replication checkpoint_completion_target = 0.9 spreads checkpoint I/O View dirty page stats (InnoDB) SELECT * FROM information_schema.INNODB_BUFFER_POOL_STATS\G
INSERT fills free blocks, DELETE marks blocks dashed (fragmented). VACUUM compacts the table. Watch the fill factor gauge change.
InnoDB and PostgreSQL use fixed-size heap pages (8 or 16 KB). Each tuple includes transaction metadata (xmin/xmax) enabling MVCC visibility rules without a separate undo log in Postgres. Writes go to WAL first — data pages are flushed lazily, often in batches.
This disables WAL flushing on commit — MySQL uses OS buffers only. Up to 1 second of committed data can be lost on crash. Never use this for financial or critical data. Mode 2 (flush but no fsync) is a safer compromise.
PostgreSQL MVCC never overwrites rows — dead tuples accumulate. Without VACUUM, tables bloat (4× is common), queries slow down, and eventually transaction ID wraparound (every ~2B transactions) causes catastrophic downtime. Autovacuum must be tuned and monitored.
If the buffer pool is too small, hot pages are constantly evicted and re-read from disk. 70–80% of RAM on a dedicated database server is the standard starting point. Monitor innodb_buffer_pool_reads (disk reads) vs innodb_buffer_pool_read_requests (cache hits).
autovacuum_vacuum_scale_factor (default 20% dead tuples). On high-write tables, tune to trigger more aggressively.innodb_buffer_pool_size to 70–80% of RAM on a dedicated DB server. For PostgreSQL, set shared_buffers to ~25% of RAM — PostgreSQL also benefits from the OS page cache via effective_cache_size. Monitor hit ratio: SELECT blks_hit / (blks_hit + blks_read) from pg_stat_database — below 99% is a red flag for read-heavy workloads. InnoDB's LRU has "young" (recently accessed) and "old" (newly inserted) sub-lists to prevent bulk scans from evicting hot data.