Topic 01

Introduction to DBMS

Imagine storing all your app's data in a spreadsheet. It works fine until two people try to edit it at the same time, the file gets corrupted, or you need to answer complex questions about the data. A Database Management System (DBMS) exists to solve exactly those problems.

MySQLPostgreSQLSQLite Oracle DBMariaDBIBM Db2

At a Glance

DBMS vs. flat files — a spreadsheet breaks when two people edit it at once; a DBMS handles that automatically, plus enforces data rules and recovers from crashes.
Three-schema architecture — think of it like a building: visitors see the lobby (external views), architects see the floor plan (logical schema), and engineers see the pipes (physical storage). Change the pipes without redrawing the floor plan.
ACID — four promises a database makes. Think of a bank transfer: the money leaves one account AND enters the other (Atomicity), rules are never broken (Consistency), two people don't interfere (Isolation), and once it's done it stays done even if power dies (Durability).
Query pipeline — your SQL is like a recipe: first it's read and understood (Parse), then the best cooking method is chosen (Optimize), then the food is actually made (Execute).
Write-Ahead Log (WAL) — before changing any data, the database writes a note in a journal. If the power goes out, it re-reads the journal and picks up where it left off.

Core Concepts

Data Independence

Imagine your app reads data from a filing cabinet. If you reorganize the cabinet (add dividers, switch folders), your app should not need to be rewritten. That is data independence. Physical independence means you can change how data is stored on disk without touching your queries. Logical independence means you can restructure your tables without breaking existing code.

ACID Properties

Think about transferring $100 between bank accounts. Atomicity: if the debit succeeds but the credit fails, the whole thing is cancelled — money never disappears into thin air. Consistency: the database only allows valid states, so your balance can never go negative if you set that rule. Isolation: two people booking the same last seat see each other's changes correctly. Durability: once the bank says "transfer complete," it stays complete even if the server crashes a second later.

Three-Schema Architecture

Think of a hospital. Nurses see patient records (their external view). The hospital's records department sees all tables and relationships (the conceptual schema). The IT department worries about hard drives and backups (the internal schema). Each group sees only what they need. If the IT department upgrades the servers, the nurses' screens do not change. That separation is the three-schema architecture.

Concurrency Control

Imagine two people editing the same Google Doc at the same time. Databases face a harder version of this problem: two people buying the last ticket, or two bank transfers hitting the same account. The DBMS coordinates these automatically. One approach is locking (like putting a "do not disturb" sign on a row — called Two-Phase Locking or 2PL). Another is keeping multiple versions of data so readers never wait on writers (called MVCC). You do not need to implement this yourself — the database handles it.

How It Works

DBMS Architecture

Query Parser

When you type a SQL query, the first thing the database does is read it like a teacher reading an essay — checking for spelling mistakes (syntax errors), confirming the table names exist, and verifying you have permission to access that data. If anything is wrong here, you get an error message and nothing else happens yet.

Query Rewriter

Before the database decides how to run your query, it cleans it up. If you queried a view (a saved named query), it replaces the view name with the actual query behind it. If there are security rules (for example, "this user can only see their own rows"), those get baked in here. You write one thing; the database quietly rewrites it into a complete, safe version before proceeding.

Query Optimizer

This is the brain of the operation. The optimizer looks at your query and thinks about all the ways it could be answered. Should it scan every row? Use an index to jump straight to the right rows? Which table should it read first? It estimates how much work each approach would take, using statistics about how many rows are in each table and how the data is distributed, then picks the fastest plan. A good plan can be 1000x faster than a bad one for the same query.

Execution Engine

The execution works like an assembly line. The plan is a chain of steps — filter rows, join tables, sort results. Each step asks the step below it "give me the next row," processes it, then passes it up. This pull-based design is called the Iterator model (sometimes called the Volcano model). The engine fetches the actual data from a RAM cache called the buffer pool, which is much faster than reading from disk every time.

Transaction Manager

Every query runs inside a transaction — a unit of work that either fully completes or fully undoes. The Transaction Manager keeps track of what is in-progress. Before any data is changed, it writes a note to a journal (the Write-Ahead Log, or WAL) saying what is about to happen. It also manages locks so that two queries do not corrupt the same row at the same time. When you run COMMIT, the transaction is sealed; ROLLBACK throws it all away.

Recovery Manager

When the database crashes (power outage, server restart), the Recovery Manager takes over. It works like reading a journal after a disaster: first it figures out where things stood at the time of the crash, then it replays all the completed work from the journal (called the WAL) to restore it, and finally it cancels any work that was still in progress. This process is called ARIES recovery. The key insight is that because every change was logged before it happened, nothing is ever truly lost.

SQL

One SQL statement, six internal system interactions
SELECT e.name, d.dept_name
FROM   employees e
JOIN   departments d ON e.dept_id = d.dept_id
WHERE  e.salary > 80000
ORDER BY e.name;

Step 1: Parser validates table names and column types
Step 2: Optimizer chooses IndexScan on salary vs SeqScan
Step 3: Execution engine fetches pages via Buffer Manager
Step 4: Lock Manager grants shared locks on scanned rows
Step 5: Join algorithm executes (hash join or nested loop)
Step 6: Sort node materializes result, applies ORDER BY

ACID in action: a bank transfer
BEGIN;
  UPDATE accounts SET balance = balance - 100 WHERE id = 1;
  UPDATE accounts SET balance = balance + 100 WHERE id = 2;
COMMIT; atomic: both succeed or both rollback

DBMS Families

Choosing the Right Engine

Relational (RDBMS)

SQL query language
ACID transactions
Schema enforced by DDL
Normalized data model
PostgreSQL, MySQL, Oracle, SQL Server

Document Store

JSON/BSON documents
Flexible, schema-optional
Denormalized by design
Tunable consistency
MongoDB, CouchDB, Firestore

Key-Value Store

O(1) get/set operations
Often in-memory
Extremely low latency
No query language
Redis, DynamoDB, etcd

Wide-Column (Column-Family)

Sparse rows, column families
Write-optimized (LSM tree)
Horizontal scalability
Tunable consistency (AP by default)
Cassandra, HBase, Bigtable

Write-Ahead Logging

Durability via WAL

Log Before Disk

Before the database changes any data, it first writes a description of that change to a journal on disk (called the Write-Ahead Log, or WAL). Think of it like a chef writing down a recipe step before actually cooking it. If the kitchen catches fire mid-recipe, you still know exactly what was being done and can restart from the right point. The log always stays one step ahead of the actual data — that is the "write-ahead" guarantee.

In-Memory Modification

After logging, the database makes the change in memory (in the buffer pool — its RAM workspace). The updated data does not go to disk right away. This is intentional: writing to RAM is thousands of times faster than writing to disk. If the server crashes at this point, no data is lost — the WAL journal already has a record of every change, so recovery can reconstruct the correct state.

Commit Flushes Log

When you run COMMIT, the database forces its journal entries to physical storage before telling you "success." This is the moment durability is guaranteed. The actual data pages in RAM can still be updated lazily in the background — but the log is already safe. In PostgreSQL, setting synchronous_commit = off skips this final safety step, making writes faster but risking the loss of a few recent commits if the server crashes.

Checkpoint and Recovery

Periodically the database takes a "checkpoint" — it flushes all in-memory changes to disk and marks a safe starting point in the journal. Think of it like saving your progress in a game. After a crash, recovery does not replay the entire journal from the beginning — just from the last checkpoint. The ARIES recovery process then replays completed work (redo), and cancels anything that was still in-progress when the crash happened (undo). More frequent checkpoints mean faster recovery.

Buffer Pool: RAM Cache Between Your Query and Disk

The buffer pool is a fixed-size pool of RAM pages managed by the DBMS. When a disk page is needed, the DBMS checks the buffer pool first (buffer hit — fast) or fetches from disk (buffer miss — slow, 0.1ms vs 100ns). InnoDB buffer pool is typically set to 70-80% of server RAM. PostgreSQL's shared_buffers is set to 25% of RAM, relying on the OS page cache for the rest. A buffer pool hit rate above 95% is normal for OLTP workloads — below 90% indicates insufficient RAM or a missing index forcing full scans.

Architecture Diagram

Three-Schema Architecture

Every DBMS provides three layers of abstraction. Queries flow from the external view through the logical schema down to physical storage — and results travel back up.

Logical independence: change the conceptual schema without rewriting external queries. Physical independence: reorganize storage without touching the conceptual layer.

Interactive Demo

ER Diagram Explorer

Click any entity to highlight its relationships and attributes. The ER diagram models a university registration system.

Entity-Relationship Diagram

Click entity to explore

Click STUDENT, COURSE, or ENROLLMENT to highlight connected elements

Anti-patterns

Using flat files instead of a DBMS

CSV files have no atomicity, no query language, no concurrent access control, and no recovery. Fine for scripts; catastrophic for production data with multiple writers or concurrent readers.

Ignoring normalization from day one

Denormalized schemas feel easy initially but create update anomalies, data redundancy, and painful migrations as the application grows. Start normalized; denormalize deliberately when access patterns demand it.

Treating the DBMS as a dumb storage layer

Not using constraints (FK, NOT NULL, CHECK) shifts validation to application code — the worst of both worlds. Constraints in the DB enforce invariants across all callers simultaneously.

Not monitoring buffer pool hit rate

A dropping hit rate is the first warning sign of memory pressure or new table scans. Check pg_stat_bgwriter in PostgreSQL or SHOW ENGINE INNODB STATUS in MySQL regularly.

Skipping EXPLAIN ANALYZE before deploying queries

A query that runs in 10ms on your dev database with 1,000 rows can take 30 seconds on production with 10 million rows. Always verify the execution plan against production-scale data volumes before deploying.

Quiz

Check Your Understanding

Question 1 of 5

What is the primary advantage of a DBMS over a file system?

AFaster file I/O in all cases

BData independence, integrity, concurrency, and recovery

CSimpler code for all use cases

DBuilt-in transactions in all storage solutions

Question 2 of 5

Which DBMS component generates multiple execution strategies and picks the cheapest?

ABuffer Manager

BQuery Optimizer

CTransaction Manager

DRecovery Manager

Question 3 of 5

In the three-schema architecture, the "conceptual schema" represents:

AThe physical storage layout on disk

BThe logical structure of all data — tables, relationships, constraints

CUser-specific views and permissions

DThe query execution plan at runtime

Question 4 of 5

In Write-Ahead Logging (WAL), when is the log record guaranteed to be durable?

AWhen the dirty data page is flushed to disk

BWhen transaction locks are released

CAt COMMIT time — the WAL record is fsynced to durable storage

DAt the next checkpoint when dirty pages are flushed

Question 5 of 5

Which isolation property guarantees that a committed transaction's changes survive a server crash?

AAtomicity

BConsistency

CIsolation

DDurability

Interview Q&A

Common Interview Questions

What is the difference between a DBMS and a file system?▾

A file system provides raw byte storage organized in files and directories — no query language, no constraints, no transactions, no concurrent access control. A DBMS adds: a data model (relational, document, etc.), a declarative query language (SQL), integrity constraints (FK, NOT NULL), ACID transactions, concurrent multi-user access with locking/MVCC, and crash recovery via WAL. Use a file system for blobs and configuration; use a DBMS for structured, transactional data.

Explain the three-schema architecture.▾

Three levels: (1) External schema — user views, each showing a subset of data tailored to a role (HR sees salaries, developers see user IDs). (2) Conceptual schema — the global logical description of all tables, relationships, and constraints, independent of storage. (3) Internal schema — physical storage details: file organization, index structures, page sizes, compression. Mappings between levels enable independence: change disk layout without altering the logical schema, and vice versa.

What are the four ACID properties? Give a real-world example of each.▾

Atomicity: A bank transfer deducts $100 from Account A and credits Account B — if the credit fails, the debit is rolled back. No partial state. Consistency: A foreign key constraint ensures you can never have an order referencing a non-existent customer. The DB moves from one valid state to another. Isolation: Two users booking the last seat on a flight don't both succeed — the second sees the updated state after the first commits. Durability: After a COMMIT returns, the transaction survives a power failure. WAL ensures committed data is on durable storage.

What is data independence and why does it matter?▾

Data independence means application code doesn't need to change when storage or schema changes. Physical independence: you can switch from a heap file to a clustered B+tree index without modifying any SQL. The optimizer adapts. Logical independence: you can add a column or split a table (providing a view) without breaking existing queries. This matters because storage decisions and schema evolution happen constantly; coupling them to application code creates a maintenance disaster.

What is the role of a DBA in a production system?▾

A Database Administrator is responsible for: schema design and evolution (DDL migrations), performance tuning (index selection, query optimization, buffer pool sizing), backup and recovery planning (RPO/RTO targets, WAL archiving, PITR), security (user roles, column-level encryption, audit logging), capacity planning (storage growth, replication lag), and HA setup (primary/replica failover). In cloud-native environments the DBA role increasingly overlaps with SRE and data engineering.

How does MVCC differ from lock-based concurrency control?▾

Lock-based (2PL): readers block writers, writers block readers. High contention on hot rows. MVCC (Multi-Version Concurrency Control): each transaction sees a consistent snapshot of the database at its start time. Readers never block writers; writers never block readers. The DB keeps multiple versions of each row (xmin/xmax in PostgreSQL). Old versions are garbage-collected by VACUUM. PostgreSQL and Oracle use MVCC. MySQL InnoDB uses MVCC for reads but 2PL for writes. The tradeoff: MVCC needs periodic cleanup (VACUUM) and uses more storage for row versions.

What is a checkpoint in database recovery?▾

A checkpoint is a point at which all dirty pages in the buffer pool are flushed to disk and a checkpoint record is written to the WAL. On crash recovery, the DBMS only needs to replay WAL records after the last checkpoint — reducing recovery time. Without checkpoints, recovery would need to replay the entire WAL from the beginning. In PostgreSQL, checkpoint_completion_target and checkpoint_timeout control frequency. Frequent checkpoints = faster recovery but more I/O. Infrequent checkpoints = slower recovery but less write amplification.

When would you choose a NoSQL database over PostgreSQL?▾

Choose NoSQL when: (1) Schema flexibility is critical — document stores handle evolving, heterogeneous data structures without migrations. (2) Extreme scale with simple access patterns — Cassandra handles millions of writes per second by sacrificing joins and ACID. (3) Low-latency key lookups — Redis at sub-millisecond latency for caching, sessions, leaderboards. (4) Graph relationships — Neo4j handles multi-hop traversals more efficiently than SQL joins. PostgreSQL with JSONB handles most document-store use cases; start with PostgreSQL and switch NoSQL when you hit a specific limitation.