A single computer can only store so much data and handle so many requests. When one machine is not enough, databases spread their data across many machines. That sounds simple, but it creates hard problems: how do all those machines agree on what the latest data is? What happens when the network between them fails? This topic explains how modern databases solve those problems.
The CAP theorem (Brewer, 2000) states that a distributed data store can guarantee at most two of three properties simultaneously. In practice, network partitions happen — so the real choice is between Consistency and Availability during a partition.
Every read returns the most recent write. If you update a value on one server, any other server in the cluster must immediately reflect that update. To achieve this, servers must synchronize before responding — which adds latency. The strongest form, where reads always reflect writes in real time, is called linearizability. Examples: etcd, ZooKeeper, Spanner.
Every request gets a response — the system never says "I am unavailable." Even if some servers are unreachable, the remaining ones keep serving requests. The trade-off is that those responses might be based on slightly stale data that has not yet been updated on that particular server. Examples: Cassandra, DynamoDB, and CouchDB prioritize staying available over being perfectly up to date.
A network partition is when some servers in the cluster can no longer communicate with others — a cable gets cut, a router fails, a data center loses connectivity. Partition tolerance means the system keeps running despite this. Since network failures are inevitable at scale, partition tolerance is not a choice you can opt out of. Every real distributed system must be partition tolerant. The genuine choice is: when a partition happens, do you preserve Consistency or Availability?
CAP only tells you what happens during a network failure. But most of the time, nothing is broken — and even then, trade-offs exist. The PACELC model adds the everyday dimension: when there is no partition (E = "Else"), you still choose between Latency (fast but possibly stale) and Consistency (accurate but slower). Confirming a write on 3 servers before responding is correct but adds network round-trips. Confirming on 1 and updating the others in the background is fast but risks data loss if that one server crashes. Every distributed database lives somewhere on this spectrum.
| Database | CAP Classification | During Partition | Without Partition | PACELC |
|---|---|---|---|---|
| Spanner / CockroachDB | CP | Rejects writes to minority shards | Linearizable reads (higher latency) | PC/EL (low latency traded for consistency) |
| Cassandra (quorum) | AP | Accepts writes to any live node | Tunable consistency (ONE to ALL) | PA/EL |
| DynamoDB (eventual) | AP | Serves stale reads | Low latency reads | PA/EL |
| PostgreSQL (sync standby) | CP | Blocks until quorum available | Strong consistency, higher write latency | PC/EC |
| MySQL (async replication) | AP | Primary keeps serving | Stale reads from replicas possible | PA/EL |
| etcd / ZooKeeper | CP | Refuses requests without quorum | Linearizable, low read latency | PC/EC |
Raft is an algorithm for getting a group of servers to agree on a shared log of changes. Think of it like a group of people trying to keep a shared notebook in sync. One person (the leader) writes all new entries. Everyone else copies them. A new entry is considered official only once more than half of the group has written it down. If the leader drops out, the group holds a quick election and picks someone else. Raft was designed to be easier to understand than its predecessor, Paxos.
Imagine a library that has grown so large it cannot fit in one building. You split it across multiple buildings — "A-M" in one, "N-Z" in the other. Sharding does the same for databases: it splits rows across multiple machines based on a shard key. You need a strategy for how to split: Range sharding keeps rows in alphabetical or numerical order (good for range queries, but new rows might pile up at the end). Hash sharding scrambles the assignment (even distribution, but range queries become expensive).
Replication means keeping copies of the same data on multiple machines so that if one fails, the others can take over. The key question is timing. Synchronous replication waits for the copies to confirm they received the data before telling the client "done" — no data is ever lost, but every write takes longer. Asynchronous replication tells the client "done" immediately and updates the copies in the background — faster writes, but if the primary crashes before the copies catch up, the recent writes are gone.
When a single transaction needs to update data on two or more separate machines, those machines need to agree: either all commit or none do. Two-Phase Commit (2PC) coordinates this. Phase 1: the coordinator asks everyone "are you ready to commit?" Each machine does the work, holds the locks, and says "yes" or "no." Phase 2: if everyone said yes, the coordinator sends the final commit signal. The known flaw: if the coordinator crashes after all the "yes" votes but before sending the final signal, every participant is stuck indefinitely — holding locks with no way to proceed.
Raft breaks consensus into three sub-problems: leader election, log replication, and safety. Its key design goal: understandability — one correct implementation path, not a proof of equivalence to Paxos.
The leader keeps followers informed by sending periodic heartbeat messages — essentially "I am still alive" pings. Each follower has a countdown timer. Every time a heartbeat arrives, the timer resets. If the timer runs out before the next heartbeat, the follower concludes the leader is gone and starts an election. Timers are randomized (150–300ms) so that two followers do not start elections at the exact same moment, which would split the votes and delay the outcome.
When a follower's timer runs out, it becomes a candidate and asks the other servers to vote for it. A server will vote "yes" only if two conditions are met: it has not already voted in this election round, and the candidate's log is at least as complete as its own. This second condition is crucial — it prevents a server with outdated data from becoming leader and potentially overwriting committed changes with old ones.
If the candidate collects votes from more than half the cluster, it wins and immediately starts sending heartbeats to assert its leadership. With 3 servers, you need 2 votes. With 5, you need 3. With 7, you need 4. Larger clusters can survive more failures, but also require more ACKs before a write is considered committed — which adds latency. This is why most Raft clusters run with 3 or 5 nodes, not dozens.
When a client sends a write to the leader, the leader appends it to its own log and then sends the entry to all followers. Once a majority of servers have written the entry to their logs and confirmed it, the entry is considered committed. The leader applies the change and responds to the client. Followers apply committed entries in the same order. The guarantee: once any server considers entry number N to be committed, no server will ever record a different entry at position N.
Raft's core safety rule is that a newly elected leader always has all previously committed entries in its log. This is guaranteed by the voting rule: if a server has a more up-to-date log than the candidate, it refuses to vote for that candidate. So any candidate that wins a majority must already have all committed entries — because the majority of voters that confirmed those entries will refuse to elect anyone who is missing them.
Two-Phase Commit (2PC) enables atomic distributed transactions — all-or-nothing across multiple database shards. It is the workhorse protocol for cross-shard ACID, despite its well-known blocking limitation.
The coordinator (the server managing the transaction) sends a PREPARE message to every machine involved. Each machine does the actual work — runs the query, acquires locks, and writes a safety record to its journal so the work can be completed later — and then responds YES if it is ready to commit, or NO if something went wrong. After responding YES, each machine is in a waiting state: it has done everything except make the final commitment, and it is holding all its locks.
Once all participants respond YES, the coordinator writes a COMMIT record to its own journal and sends COMMIT to everyone. Each machine commits its piece of the transaction, releases its locks, and confirms. If even one participant said NO, or if a timeout occurs with no response, the coordinator sends ABORT instead. Every machine rolls back and releases locks. The transaction is canceled as if it never started.
Here is the fundamental flaw of 2PC. Imagine the coordinator crashes after collecting all the YES votes but before it can send the COMMIT signal. Every participant is stuck: they have done the work, they are holding their locks, and they cannot decide on their own whether to commit or roll back — they need the coordinator's final word. Until the coordinator recovers (which could take minutes), those locks are held and no other transaction can touch that data. This is called the blocking problem, and it is an inherent limitation of the two-phase design.
Three-Phase Commit (3PC) inserts an extra step to break the blocking scenario, but it adds more network round trips and makes assumptions about timing that are hard to guarantee in practice. Real production systems like Spanner and CockroachDB take a different approach: they run the coordinator itself on a Raft group. If the coordinator crashes, the Raft group elects a new leader that can read the coordinator's log and resume exactly where it left off. This effectively eliminates the blocking window, at the cost of more complex failure recovery logic.
Sharding splits data horizontally across multiple independent database instances. The shard key determines which node holds each row. Shard key selection is the most consequential design decision in a sharded architecture.
| Strategy | How it works | Pros | Cons | Used by |
|---|---|---|---|---|
| Range sharding | Partition by key ranges: shard 1 holds keys 0–999, shard 2 holds 1000–1999 | Efficient range queries; easy routing | Hot spots on sequential keys (all new writes go to last shard) | CockroachDB, TiDB, HBase |
| Hash sharding | shard = hash(key) mod N | Uniform load distribution; no hot spots | Range queries require scatter-gather to all shards | MongoDB, Vitess |
| Consistent hashing | Keys and nodes placed on a ring; key goes to next node clockwise | Adding/removing nodes moves only k/n keys; minimal resharding | Still no range queries; virtual nodes needed for uniform load | Cassandra, DynamoDB, Amazon S3 |
| Directory sharding | Separate lookup service maps each key to a shard | Arbitrary key placement; easy manual rebalancing | Lookup service is a bottleneck and SPOF if not replicated | Flickr (historical), some SaaS multi-tenant systems |
Basic consistent hashing places each machine at one point on a ring. When you remove a machine, all its data moves to the single neighbor clockwise — which can overwhelm that one machine. Virtual nodes (vnodes) fix this: instead of one position per machine, each physical machine gets many positions spread around the ring (Cassandra uses 150 by default). Now when a machine is removed, its data spreads evenly across many machines rather than piling onto one. Adding a machine works the same way: it claims small slices from many neighbors, spreading the impact.
Picking the right shard key is the most important design decision in a sharded system. A good shard key has enough distinct values that data spreads evenly — user_id, device_id, or tenant_id are typical good choices. It should rarely change (if the shard key changes, the row has to physically move to a different machine). Most queries should include the shard key, so the database can go directly to one machine rather than asking all of them. The worst choice is an auto-incrementing number — all new rows have the highest value, so all new writes pile onto the same shard. Use random UUIDs or time-bucketed IDs (like ULID) instead.
If a query cannot be answered from one machine, the database has to send the query to every machine (scatter), collect all their partial results, and merge them into a single answer (gather). This scatter-gather approach is much more expensive — if you have 10 shards, the query now involves 10 machines instead of 1. Cross-shard transactions are even heavier since they require 2PC coordination. The rule of thumb: design your data model so the queries you run most often touch only one shard. Related data that is frequently read together should share the same shard key.
| Topology | Write Path | RPO | RTO | Read Scale | Complexity |
|---|---|---|---|---|---|
| Single-leader (async) | Primary only; async to replicas | > 0 (replica lag) | Manual failover: minutes | Yes — from replicas | Low |
| Single-leader (sync) | Primary + wait for N replicas | 0 | Auto failover: seconds | Limited — sync replica is busy | Medium |
| Multi-leader | Any node; async conflict resolution | > 0 | Automatic | Yes — all leaders | High (conflict resolution) |
| Leaderless (quorum) | W nodes (W + R > N for strong) | Tunable | Automatic | Yes — quorum reads | High (sloppy quorum, hinted handoff) |
| Raft/Paxos group | Leader only; sync to majority | 0 | Auto election: <1s | Limited unless stale reads allowed | Medium (well-understood) |
-- Consistent hashing ring: adding a node moves only k/n keys -- Physical node → vnode positions on a 0..2^32 ring -- Each key maps to the nearest vnode clockwise -- CockroachDB: automatic range-based sharding -- Each range = Raft group with 3–5 replicas -- Ranges auto-split when they exceed 512MB -- Co-location: pin related rows together with zone configs ALTER TABLE orders CONFIGURE ZONE USING num_replicas = 5, constraints = '[+region=us-east1]'; -- Vitess (MySQL horizontal sharding) ALTER VSCHEMA ON orders ADD VINDEX hash_customer (customer_id) USING hash; -- All orders for a customer route to the same shard -- Cross-customer queries: scatter-gather across all shards -- PostgreSQL read replica with application-level routing -- Primary: writes; replica lag typically 10ms–2s -- Read replicas: dashboards, reports, analytics -- Use pg_is_in_recovery() to check if you're on a replica SELECT pg_is_in_recovery(); -- true on replica, false on primary -- 2PC in pseudocode -- Phase 1: PREPARE -- coordinator → shard_A: "PREPARE txn_id" -- coordinator → shard_B: "PREPARE txn_id" -- shard_A → coordinator: "YES" (work done, locks held, log written) -- shard_B → coordinator: "YES" -- Phase 2: COMMIT -- coordinator → shard_A: "COMMIT txn_id" -- coordinator → shard_B: "COMMIT txn_id" -- shard_A: commits, releases locks, responds OK -- shard_B: commits, releases locks, responds OK
Send a write to the leader and watch log replication to followers. Kill the leader to trigger an election — watch the election timer and the new leader emerge from the remaining nodes.
Raft elects one leader per term. The leader appends entries to its log then sends AppendEntries RPCs to followers. A write is committed once a majority (quorum) ACKs. If the leader crashes, followers whose election timer fires first request votes — whoever gets majority becomes the new leader.
CDC streams database changes (inserts, updates, deletes) to downstream consumers — analytics databases, search indexes, caches, and event streams. It works by reading the database's replication log (WAL in PostgreSQL, binlog in MySQL) rather than polling tables.
Instead of repeatedly querying the database to check what changed, CDC (Change Data Capture) reads directly from the database's internal journal — the WAL in PostgreSQL, or the binlog in MySQL. Every time a row is inserted, updated, or deleted, the journal records it. A CDC connector (like Debezium, AWS DMS, or Google Datastream) reads that journal and converts each change into a structured event containing the table name, the type of change, and the before and after values. Those events are published to a message queue like Kafka for other systems to consume.
CDC is how many real systems stay in sync across multiple data stores. When a product is updated in your database, CDC can automatically update your Elasticsearch search index without you writing any extra code to do so. When a row changes, CDC can tell Redis to invalidate the cached version. It can write every change to an immutable audit log. It can publish domain events to other microservices — replacing the fragile pattern of writing to the database and also publishing to a message queue at the same time (which could get out of sync if either step fails).
CDC connectors guarantee that every change event is delivered at least once. After a restart, the connector picks up from its last saved position and may re-send events it already delivered before the crash. This means the systems receiving those events must be able to handle duplicates gracefully: applying the same event twice should produce the same result as applying it once. This property is called idempotency. In practice, this means using patterns like INSERT … ON CONFLICT DO NOTHING, or conditional updates that only apply if the version number matches what you expect.
| Property | Paxos (Multi-Paxos) | Raft |
|---|---|---|
| Design goal | Mathematical correctness proof | Understandability — one correct implementation path |
| Leader | Distinguished proposer; can be complex | Explicit, durable leader state; one leader per term |
| Log management | Gaps allowed in log; complex to implement | No gaps — log always contiguous |
| Leader election | Prepare phase is reused for election | Separate RequestVote phase |
| Implementation complexity | High — many valid implementations, easy to get subtly wrong | Lower — Raft paper includes a full reference implementation |
| Used by | Spanner (Paxos), Chubby, Zookeeper (ZAB, Paxos variant) | CockroachDB, etcd, TiKV, CockroachDB |
| Performance (WAN) | Similar — both need majority quorum | Similar — one round-trip per commit with stable leader |
Sequential keys route all writes to the shard holding the latest range — a "write hot spot." One shard gets 100% of write traffic while others sit idle. Use random UUIDs (UUID v4), time-prefixed IDs (ULID/KSUID with random suffix), or hash(user_id) to distribute writes uniformly. In CockroachDB, use UUID primary keys instead of SERIAL.
2PC is expensive: 2 network round-trips, multiple synchronous log flushes, and blocking on coordinator failure. Design your data model so related entities share the same shard key (all rows for a user on the same shard, all rows for a tenant on the same shard). Cross-shard 2PC is unavoidable for some operations — minimize it, don't eliminate it at the cost of your data model's correctness.
A 2-node cluster provides no fault tolerance — one node failure loses quorum. Always use odd numbers: 3 nodes (tolerate 1 failure), 5 nodes (tolerate 2 failures). Do not add more than 7 nodes to a single Raft group — write latency grows proportionally because the majority quorum is larger (4 ACKs for a 7-node group). For larger clusters, use multiple Raft groups (shards), each with 3–5 replicas.
Async replication means read replicas have lag — typically milliseconds but can be seconds during high write load or network issues. Reads from replicas may return stale data. Application code must either: (1) always read from the primary for consistency-critical paths, (2) route to replica only for explicitly eventually-consistent reads (dashboards, analytics), or (3) implement "read-your-writes" routing (a session always reads from the same replica it just wrote to).
NTP-synchronized clocks on different nodes can drift by 100ms or more, and can go backwards. Using wall-clock timestamps to order distributed events gives wrong results: a write on Node A at time T and a write on Node B at time T+1ms may actually be concurrent, or Node B's write may causally precede Node A's. Use logical clocks (Lamport timestamps), vector clocks, or database-provided timestamps (Spanner's TrueTime, CockroachDB's hybrid logical clocks).