Topic 11

Distributed Databases

A single computer can only store so much data and handle so many requests. When one machine is not enough, databases spread their data across many machines. That sounds simple, but it creates hard problems: how do all those machines agree on what the latest data is? What happens when the network between them fails? This topic explains how modern databases solve those problems.

Spanner (Google)CockroachDBYugabyteDB VitessTiDB

At a Glance

Raft consensus — a protocol for making a group of servers agree on a sequence of changes. One server is elected leader and handles all writes. The others follow. A change is only considered committed once a majority of servers have acknowledged it. If the leader dies, the remaining servers hold an election and pick a new one.
CAP theorem — when the network between servers breaks, you face an unavoidable choice: either stop answering requests until the network is fixed (prioritizing Consistency), or keep answering requests even if some data might be stale (prioritizing Availability). No distributed system can avoid this trade-off.
Sharding — splitting a huge table across multiple machines, each holding a slice of the data. Range sharding groups nearby keys together (great for range queries, but all new inserts might pile onto one machine). Hash sharding spreads data evenly but makes range queries expensive.
2PC (Two-Phase Commit) — when a transaction spans multiple machines, this protocol coordinates them. First, every machine says "I'm ready to commit." Then, if all agree, the coordinator tells everyone to commit. The flaw: if the coordinator crashes between those two steps, the other machines are stuck waiting — holding locks indefinitely.
PACELC model — even when nothing is broken, distributed systems still face a trade-off between responding fast (possibly with slightly stale data) or responding slowly (but guaranteed fresh). PACELC extends CAP to cover this everyday trade-off, not just the failure case.

CAP Theorem

Consistency, Availability, and Partition Tolerance

The CAP theorem (Brewer, 2000) states that a distributed data store can guarantee at most two of three properties simultaneously. In practice, network partitions happen — so the real choice is between Consistency and Availability during a partition.

Consistency (C)

Every read returns the most recent write. If you update a value on one server, any other server in the cluster must immediately reflect that update. To achieve this, servers must synchronize before responding — which adds latency. The strongest form, where reads always reflect writes in real time, is called linearizability. Examples: etcd, ZooKeeper, Spanner.

Availability (A)

Every request gets a response — the system never says "I am unavailable." Even if some servers are unreachable, the remaining ones keep serving requests. The trade-off is that those responses might be based on slightly stale data that has not yet been updated on that particular server. Examples: Cassandra, DynamoDB, and CouchDB prioritize staying available over being perfectly up to date.

Partition Tolerance (P)

A network partition is when some servers in the cluster can no longer communicate with others — a cable gets cut, a router fails, a data center loses connectivity. Partition tolerance means the system keeps running despite this. Since network failures are inevitable at scale, partition tolerance is not a choice you can opt out of. Every real distributed system must be partition tolerant. The genuine choice is: when a partition happens, do you preserve Consistency or Availability?

PACELC Extension

CAP only tells you what happens during a network failure. But most of the time, nothing is broken — and even then, trade-offs exist. The PACELC model adds the everyday dimension: when there is no partition (E = "Else"), you still choose between Latency (fast but possibly stale) and Consistency (accurate but slower). Confirming a write on 3 servers before responding is correct but adds network round-trips. Confirming on 1 and updating the others in the background is fast but risks data loss if that one server crashes. Every distributed database lives somewhere on this spectrum.

Database	CAP Classification	During Partition	Without Partition	PACELC
Spanner / CockroachDB	CP	Rejects writes to minority shards	Linearizable reads (higher latency)	PC/EL (low latency traded for consistency)
Cassandra (quorum)	AP	Accepts writes to any live node	Tunable consistency (ONE to ALL)	PA/EL
DynamoDB (eventual)	AP	Serves stale reads	Low latency reads	PA/EL
PostgreSQL (sync standby)	CP	Blocks until quorum available	Strong consistency, higher write latency	PC/EC
MySQL (async replication)	AP	Primary keeps serving	Stale reads from replicas possible	PA/EL
etcd / ZooKeeper	CP	Refuses requests without quorum	Linearizable, low read latency	PC/EC

Core Concepts

Raft Consensus

Raft is an algorithm for getting a group of servers to agree on a shared log of changes. Think of it like a group of people trying to keep a shared notebook in sync. One person (the leader) writes all new entries. Everyone else copies them. A new entry is considered official only once more than half of the group has written it down. If the leader drops out, the group holds a quick election and picks someone else. Raft was designed to be easier to understand than its predecessor, Paxos.

Sharding

Imagine a library that has grown so large it cannot fit in one building. You split it across multiple buildings — "A-M" in one, "N-Z" in the other. Sharding does the same for databases: it splits rows across multiple machines based on a shard key. You need a strategy for how to split: Range sharding keeps rows in alphabetical or numerical order (good for range queries, but new rows might pile up at the end). Hash sharding scrambles the assignment (even distribution, but range queries become expensive).

Replication

Replication means keeping copies of the same data on multiple machines so that if one fails, the others can take over. The key question is timing. Synchronous replication waits for the copies to confirm they received the data before telling the client "done" — no data is ever lost, but every write takes longer. Asynchronous replication tells the client "done" immediately and updates the copies in the background — faster writes, but if the primary crashes before the copies catch up, the recent writes are gone.

Two-Phase Commit

When a single transaction needs to update data on two or more separate machines, those machines need to agree: either all commit or none do. Two-Phase Commit (2PC) coordinates this. Phase 1: the coordinator asks everyone "are you ready to commit?" Each machine does the work, holds the locks, and says "yes" or "no." Phase 2: if everyone said yes, the coordinator sends the final commit signal. The known flaw: if the coordinator crashes after all the "yes" votes but before sending the final signal, every participant is stuck indefinitely — holding locks with no way to proceed.

Raft Protocol

Raft Leader Election and Log Replication

Raft breaks consensus into three sub-problems: leader election, log replication, and safety. Its key design goal: understandability — one correct implementation path, not a proof of equivalence to Paxos.

Heartbeats and election timeouts

The leader keeps followers informed by sending periodic heartbeat messages — essentially "I am still alive" pings. Each follower has a countdown timer. Every time a heartbeat arrives, the timer resets. If the timer runs out before the next heartbeat, the follower concludes the leader is gone and starts an election. Timers are randomized (150–300ms) so that two followers do not start elections at the exact same moment, which would split the votes and delay the outcome.

RequestVote — becoming a candidate

When a follower's timer runs out, it becomes a candidate and asks the other servers to vote for it. A server will vote "yes" only if two conditions are met: it has not already voted in this election round, and the candidate's log is at least as complete as its own. This second condition is crucial — it prevents a server with outdated data from becoming leader and potentially overwriting committed changes with old ones.

Majority quorum and victory

If the candidate collects votes from more than half the cluster, it wins and immediately starts sending heartbeats to assert its leadership. With 3 servers, you need 2 votes. With 5, you need 3. With 7, you need 4. Larger clusters can survive more failures, but also require more ACKs before a write is considered committed — which adds latency. This is why most Raft clusters run with 3 or 5 nodes, not dozens.

Log replication

When a client sends a write to the leader, the leader appends it to its own log and then sends the entry to all followers. Once a majority of servers have written the entry to their logs and confirmed it, the entry is considered committed. The leader applies the change and responds to the client. Followers apply committed entries in the same order. The guarantee: once any server considers entry number N to be committed, no server will ever record a different entry at position N.

Leader completeness

Raft's core safety rule is that a newly elected leader always has all previously committed entries in its log. This is guaranteed by the voting rule: if a server has a more up-to-date log than the candidate, it refuses to vote for that candidate. So any candidate that wins a majority must already have all committed entries — because the majority of voters that confirmed those entries will refuse to elect anyone who is missing them.

2PC Protocol

Two-Phase Commit in Detail

Two-Phase Commit (2PC) enables atomic distributed transactions — all-or-nothing across multiple database shards. It is the workhorse protocol for cross-shard ACID, despite its well-known blocking limitation.

Phase 1: PREPARE

The coordinator (the server managing the transaction) sends a PREPARE message to every machine involved. Each machine does the actual work — runs the query, acquires locks, and writes a safety record to its journal so the work can be completed later — and then responds YES if it is ready to commit, or NO if something went wrong. After responding YES, each machine is in a waiting state: it has done everything except make the final commitment, and it is holding all its locks.

Phase 2: COMMIT or ABORT

Once all participants respond YES, the coordinator writes a COMMIT record to its own journal and sends COMMIT to everyone. Each machine commits its piece of the transaction, releases its locks, and confirms. If even one participant said NO, or if a timeout occurs with no response, the coordinator sends ABORT instead. Every machine rolls back and releases locks. The transaction is canceled as if it never started.

The blocking problem

Here is the fundamental flaw of 2PC. Imagine the coordinator crashes after collecting all the YES votes but before it can send the COMMIT signal. Every participant is stuck: they have done the work, they are holding their locks, and they cannot decide on their own whether to commit or roll back — they need the coordinator's final word. Until the coordinator recovers (which could take minutes), those locks are held and no other transaction can touch that data. This is called the blocking problem, and it is an inherent limitation of the two-phase design.

Overcoming the blocking problem

Three-Phase Commit (3PC) inserts an extra step to break the blocking scenario, but it adds more network round trips and makes assumptions about timing that are hard to guarantee in practice. Real production systems like Spanner and CockroachDB take a different approach: they run the coordinator itself on a Raft group. If the coordinator crashes, the Raft group elects a new leader that can read the coordinator's log and resume exactly where it left off. This effectively eliminates the blocking window, at the cost of more complex failure recovery logic.

Sharding

Sharding Strategies and Consistent Hashing

Sharding splits data horizontally across multiple independent database instances. The shard key determines which node holds each row. Shard key selection is the most consequential design decision in a sharded architecture.

Strategy	How it works	Pros	Cons	Used by
Range sharding	Partition by key ranges: shard 1 holds keys 0–999, shard 2 holds 1000–1999	Efficient range queries; easy routing	Hot spots on sequential keys (all new writes go to last shard)	CockroachDB, TiDB, HBase
Hash sharding	shard = hash(key) mod N	Uniform load distribution; no hot spots	Range queries require scatter-gather to all shards	MongoDB, Vitess
Consistent hashing	Keys and nodes placed on a ring; key goes to next node clockwise	Adding/removing nodes moves only k/n keys; minimal resharding	Still no range queries; virtual nodes needed for uniform load	Cassandra, DynamoDB, Amazon S3
Directory sharding	Separate lookup service maps each key to a shard	Arbitrary key placement; easy manual rebalancing	Lookup service is a bottleneck and SPOF if not replicated	Flickr (historical), some SaaS multi-tenant systems

Consistent hashing — virtual nodes

Basic consistent hashing places each machine at one point on a ring. When you remove a machine, all its data moves to the single neighbor clockwise — which can overwhelm that one machine. Virtual nodes (vnodes) fix this: instead of one position per machine, each physical machine gets many positions spread around the ring (Cassandra uses 150 by default). Now when a machine is removed, its data spreads evenly across many machines rather than piling onto one. Adding a machine works the same way: it claims small slices from many neighbors, spreading the impact.

Shard key selection criteria

Picking the right shard key is the most important design decision in a sharded system. A good shard key has enough distinct values that data spreads evenly — user_id, device_id, or tenant_id are typical good choices. It should rarely change (if the shard key changes, the row has to physically move to a different machine). Most queries should include the shard key, so the database can go directly to one machine rather than asking all of them. The worst choice is an auto-incrementing number — all new rows have the highest value, so all new writes pile onto the same shard. Use random UUIDs or time-bucketed IDs (like ULID) instead.

Cross-shard queries

If a query cannot be answered from one machine, the database has to send the query to every machine (scatter), collect all their partial results, and merge them into a single answer (gather). This scatter-gather approach is much more expensive — if you have 10 shards, the query now involves 10 machines instead of 1. Cross-shard transactions are even heavier since they require 2PC coordination. The rule of thumb: design your data model so the queries you run most often touch only one shard. Related data that is frequently read together should share the same shard key.

Replication

Replication Topologies

Topology	Write Path	RPO	RTO	Read Scale	Complexity
Single-leader (async)	Primary only; async to replicas	> 0 (replica lag)	Manual failover: minutes	Yes — from replicas	Low
Single-leader (sync)	Primary + wait for N replicas	0	Auto failover: seconds	Limited — sync replica is busy	Medium
Multi-leader	Any node; async conflict resolution	> 0	Automatic	Yes — all leaders	High (conflict resolution)
Leaderless (quorum)	W nodes (W + R > N for strong)	Tunable	Automatic	Yes — quorum reads	High (sloppy quorum, hinted handoff)
Raft/Paxos group	Leader only; sync to majority	0	Auto election: <1s	Limited unless stale reads allowed	Medium (well-understood)

Distributed — Sharding and Consistent Hashing Concepts

-- Consistent hashing ring: adding a node moves only k/n keys
-- Physical node → vnode positions on a 0..2^32 ring
-- Each key maps to the nearest vnode clockwise

-- CockroachDB: automatic range-based sharding
-- Each range = Raft group with 3–5 replicas
-- Ranges auto-split when they exceed 512MB
-- Co-location: pin related rows together with zone configs
ALTER TABLE orders CONFIGURE ZONE USING
  num_replicas = 5,
  constraints = '[+region=us-east1]';

-- Vitess (MySQL horizontal sharding)
ALTER VSCHEMA ON orders ADD VINDEX hash_customer (customer_id)
  USING hash;
-- All orders for a customer route to the same shard
-- Cross-customer queries: scatter-gather across all shards

-- PostgreSQL read replica with application-level routing
-- Primary: writes; replica lag typically 10ms–2s
-- Read replicas: dashboards, reports, analytics
-- Use pg_is_in_recovery() to check if you're on a replica
SELECT pg_is_in_recovery(); -- true on replica, false on primary

-- 2PC in pseudocode
-- Phase 1: PREPARE
--   coordinator → shard_A: "PREPARE txn_id"
--   coordinator → shard_B: "PREPARE txn_id"
--   shard_A → coordinator: "YES" (work done, locks held, log written)
--   shard_B → coordinator: "YES"
-- Phase 2: COMMIT
--   coordinator → shard_A: "COMMIT txn_id"
--   coordinator → shard_B: "COMMIT txn_id"
--   shard_A: commits, releases locks, responds OK
--   shard_B: commits, releases locks, responds OK

Interactive Demo

Raft Consensus Simulator

Send a write to the leader and watch log replication to followers. Kill the leader to trigger an election — watch the election timer and the new leader emerge from the remaining nodes.

Raft elects one leader per term. The leader appends entries to its log then sends AppendEntries RPCs to followers. A write is committed once a majority (quorum) ACKs. If the leader crashes, followers whose election timer fires first request votes — whoever gets majority becomes the new leader.

Raft Consensus (3-node cluster)

Leader = gold border. Green packets = AppendEntries RPCs. Election timer bar animates on leader failure. Log entries shown inside each node.

Range Sharding

Keys sorted — great for range queries within a shard
Hot spot risk on sequential keys (auto-increment PKs)
Used by CockroachDB, TiDB, HBase
Easy rebalancing by splitting a range
Uneven load if key distribution is skewed

Hash Sharding

Uniform load distribution — no hot spots
Range queries require scatter-gather (expensive)
Used by Cassandra, DynamoDB, Vitess, MongoDB
Consistent hashing minimizes key movement on resize
Resharding is operationally complex

Data Propagation

Change Data Capture (CDC)

CDC streams database changes (inserts, updates, deletes) to downstream consumers — analytics databases, search indexes, caches, and event streams. It works by reading the database's replication log (WAL in PostgreSQL, binlog in MySQL) rather than polling tables.

How CDC works

Instead of repeatedly querying the database to check what changed, CDC (Change Data Capture) reads directly from the database's internal journal — the WAL in PostgreSQL, or the binlog in MySQL. Every time a row is inserted, updated, or deleted, the journal records it. A CDC connector (like Debezium, AWS DMS, or Google Datastream) reads that journal and converts each change into a structured event containing the table name, the type of change, and the before and after values. Those events are published to a message queue like Kafka for other systems to consume.

Use cases

CDC is how many real systems stay in sync across multiple data stores. When a product is updated in your database, CDC can automatically update your Elasticsearch search index without you writing any extra code to do so. When a row changes, CDC can tell Redis to invalidate the cached version. It can write every change to an immutable audit log. It can publish domain events to other microservices — replacing the fragile pattern of writing to the database and also publishing to a message queue at the same time (which could get out of sync if either step fails).

At-least-once semantics

CDC connectors guarantee that every change event is delivered at least once. After a restart, the connector picks up from its last saved position and may re-send events it already delivered before the crash. This means the systems receiving those events must be able to handle duplicates gracefully: applying the same event twice should produce the same result as applying it once. This property is called idempotency. In practice, this means using patterns like INSERT … ON CONFLICT DO NOTHING, or conditional updates that only apply if the version number matches what you expect.

Consensus Comparison

Paxos vs Raft

Property	Paxos (Multi-Paxos)	Raft
Design goal	Mathematical correctness proof	Understandability — one correct implementation path
Leader	Distinguished proposer; can be complex	Explicit, durable leader state; one leader per term
Log management	Gaps allowed in log; complex to implement	No gaps — log always contiguous
Leader election	Prepare phase is reused for election	Separate RequestVote phase
Implementation complexity	High — many valid implementations, easy to get subtly wrong	Lower — Raft paper includes a full reference implementation
Used by	Spanner (Paxos), Chubby, Zookeeper (ZAB, Paxos variant)	CockroachDB, etcd, TiKV, CockroachDB
Performance (WAN)	Similar — both need majority quorum	Similar — one round-trip per commit with stable leader

Anti-patterns

Auto-increment PKs as shard keys

Sequential keys route all writes to the shard holding the latest range — a "write hot spot." One shard gets 100% of write traffic while others sit idle. Use random UUIDs (UUID v4), time-prefixed IDs (ULID/KSUID with random suffix), or hash(user_id) to distribute writes uniformly. In CockroachDB, use UUID primary keys instead of SERIAL.

Cross-shard transactions for every operation

2PC is expensive: 2 network round-trips, multiple synchronous log flushes, and blocking on coordinator failure. Design your data model so related entities share the same shard key (all rows for a user on the same shard, all rows for a tenant on the same shard). Cross-shard 2PC is unavoidable for some operations — minimize it, don't eliminate it at the cost of your data model's correctness.

Under-sized Raft groups

A 2-node cluster provides no fault tolerance — one node failure loses quorum. Always use odd numbers: 3 nodes (tolerate 1 failure), 5 nodes (tolerate 2 failures). Do not add more than 7 nodes to a single Raft group — write latency grows proportionally because the majority quorum is larger (4 ACKs for a 7-node group). For larger clusters, use multiple Raft groups (shards), each with 3–5 replicas.

Assuming replica reads are strongly consistent

Async replication means read replicas have lag — typically milliseconds but can be seconds during high write load or network issues. Reads from replicas may return stale data. Application code must either: (1) always read from the primary for consistency-critical paths, (2) route to replica only for explicitly eventually-consistent reads (dashboards, analytics), or (3) implement "read-your-writes" routing (a session always reads from the same replica it just wrote to).

Using wall-clock timestamps for distributed ordering

NTP-synchronized clocks on different nodes can drift by 100ms or more, and can go backwards. Using wall-clock timestamps to order distributed events gives wrong results: a write on Node A at time T and a write on Node B at time T+1ms may actually be concurrent, or Node B's write may causally precede Node A's. Use logical clocks (Lamport timestamps), vector clocks, or database-provided timestamps (Spanner's TrueTime, CockroachDB's hybrid logical clocks).

Quiz

Question 1 of 5

In Raft, a new leader is elected when:

AAny node explicitly requests a leadership change

BA follower's election timeout expires without receiving a leader heartbeat

C50%+1 nodes simultaneously agree to elect a new leader

DThe designated master fails over to a standby

Question 2 of 5

Two-Phase Commit (2PC) can block indefinitely when:

AAll participant nodes respond promptly to PREPARE

BThe coordinator fails after PREPARE but before sending COMMIT/ABORT

CThe network is fast and all nodes are healthy

DNo locks are held by any participant

Question 3 of 5

Consistent hashing reduces resharding cost because:

AAll keys are rehashed when the node count changes

BOnly k/n keys need remapping when adding or removing a node

CThe system is limited to exactly 3 nodes

DData is stored in B-trees internally, avoiding rehashing

Question 4 of 5

According to CAP theorem, during a network partition a distributed system must choose between:

ASpeed and safety

BConsistency and Availability (Partition Tolerance is mandatory)

CConsistency and Partition Tolerance

DAvailability and Partition Tolerance

Question 5 of 5

Why is using auto-increment IDs as a shard key a problem?

AThey cause key collisions across shards

BSequential keys route all writes to the shard holding the highest range (write hot spot)

CAuto-increment IDs cannot be used as primary keys in sharded systems

DThey prevent proper replication across nodes

Interview Q&A

How does the Raft consensus algorithm ensure linearizability?▾

Raft ensures linearizability through: (1) Leader completeness: a leader always has all committed log entries — votes are only granted to candidates whose log is at least as up-to-date as the voter's. (2) Majority commit: an entry is committed only after being replicated to a majority of nodes — any future majority quorum will include at least one node with the entry. (3) State machine safety: all nodes apply log entries in the same order — state machine determinism guarantees identical state. (4) Read linearizability: reads from the leader must verify leadership before serving (via a read-index round-trip or lease renewal) — a partitioned stale leader must not serve reads. CockroachDB uses leader leases (with expiry time) to serve reads without a heartbeat round-trip while still guaranteeing freshness.

What is the difference between synchronous and asynchronous replication?▾

Synchronous: before ACKing a write to the client, wait for at least one (or more) replicas to confirm they have durably written the data. Zero RPO (no committed data lost on primary failure). Tradeoff: write latency includes replica round-trip — if the replica is 50ms away, every write is at least 50ms. Used by Raft/Paxos (majority quorum), PostgreSQL synchronous_standby_names. Asynchronous: ACK the write immediately after writing to the primary. Replicas catch up in background. Lower write latency but RPO > 0 — failover to an async replica may lose recently committed transactions. MySQL binlog replication and PostgreSQL streaming replication are async by default. Rule: use synchronous for primary data stores where data loss is unacceptable; async for read replicas where slight staleness is explicitly tolerable.

How do you choose a shard key? What makes a bad shard key?▾

Good shard key criteria: (1) High cardinality: enough distinct values to distribute data evenly across shards. (2) Low update rate: changing a shard key requires moving the row — high update shard keys create constant data movement. (3) Query alignment: most queries filter on the shard key — avoid cross-shard scatter-gather. (4) No sequential hot spots: monotonically increasing keys (auto-increment IDs, timestamps) put all writes on the latest shard. Bad shard keys: auto-increment ID (sequential), status/enum (low cardinality, imbalanced), created_at alone (time-based hot spot). Good shard keys: user_id (high cardinality, user-centric queries), tenant_id (multi-tenant SaaS), hash(user_id) (if user_id is sequential).

How does Google Spanner achieve global ACID transactions?▾

Spanner combines four innovations: (1) TrueTime API: GPS + atomic clocks in every datacenter. TrueTime returns a time interval [earliest, latest] with bounded uncertainty (~7ms). Commit timestamps are chosen after the TrueTime interval's latest bound — ensuring external consistency: if Txn2 starts after Txn1 commits in real time, Txn2's commit timestamp is strictly greater. (2) Paxos groups: each data shard is a Paxos replication group for strong per-shard consistency. (3) 2PC over Paxos: cross-shard distributed transactions use 2PC, with each shard's Paxos group providing fault-tolerant coordinator state — eliminating the 2PC blocking window. (4) MVCC: multi-version storage enables consistent snapshots across all shards globally. Result: RPO=0, RTO ~seconds for regional failures, with true external consistency at global scale.

What is Change Data Capture and how does it differ from polling?▾

CDC reads the database's replication log (WAL in PostgreSQL, binlog in MySQL) to stream every committed change as a structured event. It differs from polling (SELECT * WHERE updated_at > last_check) in: (1) Completeness: CDC captures DELETEs; polling cannot detect what was deleted without keeping tombstones. (2) Low latency: CDC emits events within milliseconds; polling adds lag equal to the poll interval. (3) No DB load from polling: CDC reads a log; polling runs queries that must be indexed. (4) Ordering: CDC events arrive in commit order; polling results are unordered unless sorted. Tools: Debezium (open source, connects to PostgreSQL/MySQL/SQL Server), AWS DMS, Google Datastream, Confluent Platform. Use cases: search index sync, cache invalidation, audit log, event-driven microservice communication (outbox pattern). Limitation: CDC is at-least-once — consumers must be idempotent.

Explain the CAP theorem and what it means for real-world database design.▾

CAP (Consistency, Availability, Partition Tolerance) states that during a network partition, a distributed system can provide at most two of the three. Partition tolerance is not optional — network partitions happen in any real distributed system. So the real choice is C vs A during a partition. CP systems (etcd, Spanner, CockroachDB): refuse to serve requests to the minority side of a partition — no stale reads, but some nodes are unavailable. AP systems (Cassandra, DynamoDB, Riak): continue serving all nodes during a partition, accepting that different nodes may return different values — stale reads possible. Practical implication: most OLTP databases choose CP (strong consistency). Cassandra lets you tune per-operation: read ONE (AP) or read QUORUM (CP-like). The PACELC extension adds: even without partitions, there is a latency vs consistency tradeoff. All systems live somewhere on this spectrum — no free lunch.

What is Vitess and when would you use it over a native distributed database?▾

Vitess is a horizontal sharding and connection pooling layer for MySQL — originally built at YouTube, now used by GitHub, Slack, Square. It sits in front of multiple MySQL instances and handles: query routing (rewrite SQL for the correct shard), connection pooling (reduce MySQL connections from thousands to hundreds), cross-shard queries (scatter-gather), online schema changes (blocking-free ALTER TABLE), and automatic failover. Use Vitess when: your MySQL database is hitting write limits (disk, CPU, connections), you want to stay on MySQL for operational familiarity and existing tooling, and you don't need global consistency across shards. Prefer CockroachDB/YugabyteDB when: you're greenfield, need strong ACID across shards, want a simpler operational model (no Vitess orchestration layer), or need geographic distribution with zone-aware reads.

How does multi-leader replication work and what problems does it create?▾

In multi-leader replication, multiple nodes can accept writes independently and replicate to each other asynchronously. Used in: multi-datacenter setups (Galera Cluster for MySQL, PostgreSQL BDR), collaborative editing (Google Docs), offline-capable apps (CouchDB). Benefit: writes are local to the datacenter — low write latency. Problem: write conflicts. If two leaders concurrently update the same row, both updates are "correct" locally but the replicated state is inconsistent. Conflict resolution strategies: Last Write Wins (LWW) — use timestamp to pick one, discarding the other. Silently loses data. Merge — attempt to union the changes (works for counters and sets, not arbitrary fields). Application-defined resolution — propagate conflict to application to resolve. CRDTs (Conflict-Free Replicated Data Types) — mathematically designed data structures where any merge is valid and commutative (counters, sets, maps). Multi-leader is best used when conflicts are rare and any conflict resolution strategy is acceptable — not for inventory, financial, or booking systems where correctness is critical.