Topic 06

Database Indexing

Without an index, a database query scans every single row in a table — like searching a phone book by reading every entry. An index is like the alphabetical tabs in that phone book: it lets the database jump directly to the right place. This topic explains how indexes work, when to use them, and why using the wrong one can make things worse.

PostgreSQL GiST/GIN/BRINMySQL InnoDB B+tree Redis Sorted SetElasticsearchMongoDB

At a Glance

Index Types

Choosing the Right Index

Not all indexes are B+trees. Choosing the right structure for the access pattern is the difference between a query that uses the index and one that silently falls back to a full scan.

B+Tree (default)

The most common index type. Imagine a very tall, well-organized filing cabinet that always stays balanced — no matter how many files you add or remove. Looking up a specific value is fast because you just navigate the tree top-to-bottom (typically 3–4 hops for millions of rows). Range queries (like "all orders between January and March") are also fast because the leaf level is sorted and linked together. Works for almost everything: equality checks, ranges, sorting, and prefix text searches like LIKE 'john%'.

Hash

A hash index works like a locker room: you hand over a key (your value), it instantly gives you the locker number (the row). Single lookups are very fast — but only for exact equality matches. You cannot ask "which lockers are between 50 and 100" — there is no ordering. Use hash indexes only when you need exact equality lookups on things like tokens or UUIDs and never need to search a range.

GIN (Inverted)

GIN (Generalized Inverted Index) is built for searching inside complex data like JSON, arrays, or text documents. Think of it like the index at the back of a book — it maps individual words or elements to which pages (rows) they appear in. If you store product metadata as JSON and want to find all products where color is "red," a GIN index can scan that in milliseconds. It takes longer to build and update than a B+tree, but it handles things a B+tree simply cannot.

BRIN (Block Range)

BRIN (Block Range Index) is an extremely small index — just a few kilobytes even for tables with billions of rows. It works by recording the minimum and maximum value for each physical section of the table on disk. If you query for a date range, it can quickly rule out sections of the table that fall outside your range without reading them. This only works if the column is naturally ordered on disk — like an auto-incrementing ID or timestamps in an append-only log table. It is useless for randomly ordered data like UUIDs.

B+Tree Internals

B+Tree Mechanics

The B+tree is the workhorse of database indexing. Unlike a binary search tree, each node in a B+tree can have dozens to hundreds of children (the "fan-out"), which keeps the height short even for billion-row tables. A typical B+tree on a modern database has a height of 3–4 for 100 million rows.

1

Node structure and fan-out

Think of the B+tree as a building directory. Each floor (level) has multiple offices (nodes), and each office has multiple doors pointing to the next floor down. The number of doors per office is called the fan-out — in a database, it is typically 100 to 500. Because each level branches that many ways, the tree stays shallow even for enormous tables: a tree with a fan-out of 500 can index 125 million rows in just three levels. This is why looking up a row takes only 3–4 disk reads regardless of table size.

2

Point lookup (exact match)

To find a specific value — like looking up a user by ID — the database starts at the top of the tree and follows pointers down toward the matching leaf. At each level, it checks which direction to go (left for smaller, right for larger). By the time it reaches the bottom (the leaf level), it has the exact row location. The whole journey is only 3–4 steps for most tables. The top levels of the tree are usually cached in memory, so often only the final step reads from disk.

3

Range scan

For a range query — like "all orders placed in January" — the database first finds the leaf where January starts (a normal point lookup), then walks forward through the sorted leaf level reading the next entry, and the next, until the range is exhausted. The leaf nodes are linked together like a chain, so this is very efficient. The amount of work is proportional to how many rows match, not the total size of the table. This is why range queries are fast with a B+tree but impossible with a hash index.

4

Insert and rebalancing

When a new row is inserted, the database finds the right leaf position in the tree and adds the new key there. If the leaf is full, it splits in two: the middle key moves up to the parent level, and the entries are divided between the two new leaves. If the parent is also full, the split ripples upward. In practice, splits are uncommon because each node has reserved some empty space (typically 10%) to absorb new entries without immediately splitting. Deletes leave small gaps, and the tree is not immediately reorganized.

5

Index fill factor and bloat

Over time, lots of deletes can leave index pages half-empty — a problem called index bloat. A half-empty page still takes the same amount of time to read as a full one, so bloat means you are doing extra work for no benefit. To clean this up, you can rebuild the index from scratch (REINDEX in PostgreSQL, OPTIMIZE TABLE in MySQL). On a busy production table this matters — a table with high delete rates (like a job queue or audit log) should have its index bloat monitored and periodically cleaned up.

Clustered Index

Clustered vs. Non-Clustered

The single most important structural distinction in indexing: whether the index IS the table, or whether it is a separate data structure that points to the table.

PropertyClustered IndexNon-Clustered Index
Row storageLeaf pages ARE the data pagesLeaf pages store (key, PK pointer)
Count per tableExactly oneUnlimited (within reason)
Range scans on index keyExtremely fast — sequential pagesFast on index, then random for heap fetch
Secondary lookup costNone — data is at the leafExtra hop to clustered index (key lookup)
Insert orderMust maintain sorted order — random PKs cause page splitsIndex maintains order; heap is unordered
MySQL InnoDBAlways the PRIMARY KEYAll other indexes (store PK, not row pointer)
PostgreSQLHeap is unordered; CLUSTER command physically reorders onceAll indexes point to heap tuple (ctid)

PostgreSQL note: PostgreSQL does not have "clustered indexes" in the InnoDB sense — all indexes point to the heap. The CLUSTER command physically reorders heap rows by an index at a single point in time, but new writes are not clustered. InnoDB's clustered index is a fundamental design choice that affects all secondary index lookups.

Composite Indexes

Column Order in Composite Indexes

A composite index on (A, B, C) sorts first by A, then B within each A value, then C within each (A,B) group. This structure has direct implications for which query predicates the index can satisfy.

1

Leftmost prefix rule

An index on (dept, hire_date, salary) is sorted first by department, then by hire date within each department, then by salary within each hire date. You can only use it efficiently if you are filtering by the columns from the left. So filtering by dept works. Filtering by dept AND hire_date works. But filtering by hire_date alone does not — the index is not sorted by hire_date globally, only within each department. Think of it like a phone book sorted by last name, then first name: you can find all "Smiths" quickly, but you cannot find all "Johns" without reading the whole thing.

2

Equality columns before range columns

When designing a composite index, put columns used in exact-match filters before columns used in range filters. For example: if you always filter by dept = 'Engineering' AND hire_date between two dates, put dept first. The exact match on dept narrows the search to one section of the index, then the range scan on hire_date within that section is small. If you reversed it (hire_date first), the range would span all hire dates across all departments — a much bigger search.

3

ORDER BY and GROUP BY with indexes

If your query sorts by the same columns that are in an index (in the same order), the database can read the index in sorted order and skip the sorting step entirely. For example, if you have an index on (dept, hire_date) and your query says ORDER BY dept, hire_date — the results come out already sorted as the database traverses the index. This can be a big win because sorting is expensive when it requires loading a lot of rows into memory.

4

Index skip scan

Modern databases have a trick: if the first column of a composite index only has a few distinct values (like a status column with "pending," "active," "closed"), they can scan the index multiple times — once per value — to answer a query that only filters on the second column. This "skip scan" avoids a full table scan when a full composite match is not available. It works best when the first column has very few distinct values; if there are thousands of distinct values, the trick becomes more expensive than just scanning the whole table.

Covering Index

Covering Indexes and INCLUDE Columns

A covering index is one that contains all the columns a query needs — the query can be answered entirely from the index without touching the table. This eliminates the most expensive part of many queries: the random I/O lookup back to the heap.

1

Traditional covering vs. INCLUDE

To make a covering index, you add extra columns to the index so the query never has to touch the actual table. One way is to put those columns directly in the index key — but that changes the sort order and can add overhead to every write. PostgreSQL offers a cleaner option: INCLUDE. Columns in INCLUDE are stored in the leaf pages (so a query can read them without going back to the table) but they do not affect the sort order of the index. Think of them as passengers in the index — they are there when needed but do not drive.

2

Index-only scan

When a covering index exists, the database can answer the entire query from the index without ever reading the original table rows. In EXPLAIN output, you will see "Index Only Scan" — this is the fastest possible outcome. The database does need to do a quick side check to confirm that all the rows in the index are visible to your transaction (because of MVCC), but running VACUUM regularly keeps that check cheap. An index-only scan is often 10x faster than even a normal index scan.

3

When to use INCLUDE

Use INCLUDE when a query always filters on certain columns AND always reads a few extra columns. For example, if your dashboard always queries orders by user and date, and always needs to display the status and total, add status and total with INCLUDE. This makes the index covering for that specific query. The index key stays clean (just user_id and created_at), but the leaf nodes carry status and total along so no table lookup is needed.

Partial & Expression Indexes

Targeted Index Strategies

1

Partial indexes

A partial index is an index on a subset of rows — only rows where a condition is true. For example, if a jobs table has millions of completed jobs but only a few thousand pending ones, you can create an index that only covers the pending ones. This index is tiny, always selective, and very fast to scan. Any query that filters on pending jobs can use it. The key insight: you index only what you actually query frequently, not the whole table.

2

Expression (functional) indexes

Sometimes you need to search by a transformed version of a column — like searching email addresses case-insensitively. If you store emails in mixed case and always query with LOWER(email) = '...', a regular index on email will not help — the index stores the original case, but your search uses the lowercase version. An expression index solves this: instead of indexing the raw column, you index the result of LOWER(email). Then queries that use LOWER(email) in their filter can use that index.

3

Unique partial indexes

You can combine partial indexes with a uniqueness constraint. For example, if you soft-delete users (keeping old records with a deleted_at timestamp), you want to enforce that no two active users share the same email — but it is fine if a deleted user's email is reused later. A unique partial index on email WHERE deleted_at IS NULL enforces uniqueness only among active rows. A regular UNIQUE constraint would reject the reuse entirely, even for deleted accounts.

Reading EXPLAIN

Understanding EXPLAIN ANALYZE

EXPLAIN shows the planner's estimated execution plan. EXPLAIN ANALYZE actually runs the query and shows real row counts and times. The gap between estimated and actual rows reveals stale statistics or planning mistakes.

SQL — Index Creation and EXPLAIN
-- Composite index: equality column first, range column second
CREATE INDEX idx_emp_dept_hire ON employees (dept, hire_date);

-- Covering index with INCLUDE (PG 11+)
-- For: SELECT salary WHERE dept='Eng' ORDER BY hire_date
CREATE INDEX idx_emp_covering ON employees (dept, hire_date) INCLUDE (salary);

-- Partial index: only index active, high-priority jobs
CREATE INDEX idx_pending_jobs ON jobs (priority DESC, created_at)
  WHERE status = 'pending';

-- Expression index for case-insensitive email lookup
CREATE INDEX idx_email_lower ON users (LOWER(email));

-- Hash index: equality-only, smaller than B+tree
CREATE INDEX idx_token_hash ON sessions USING HASH (token);

-- BRIN index: tiny, good for time-series tables
CREATE INDEX idx_events_brin ON events USING BRIN (created_at);

-- GIN for JSONB containment
CREATE INDEX idx_meta_gin ON products USING GIN (metadata);
-- Query: SELECT * FROM products WHERE metadata @> '{"color":"red"}'

-- EXPLAIN ANALYZE to verify index usage
EXPLAIN (FORMAT TEXT, ANALYZE, BUFFERS)
SELECT salary FROM employees
WHERE dept = 'Engineering'
  AND hire_date > '2022-01-01'
ORDER BY hire_date;
-- Good output: "Index Only Scan using idx_emp_covering"
-- Bad output:  "Seq Scan on employees" (index not used)
-- Also check: rows (estimated) vs. rows (actual) — if far apart, run ANALYZE

-- Find unused indexes (waste write performance)
SELECT schemaname, tablename, indexname, idx_scan
FROM pg_stat_user_indexes
WHERE idx_scan = 0
ORDER BY pg_relation_size(indexrelid) DESC;
Interactive Demo

B+Tree Traversal

Enter a search key and watch the B+tree traversal animate — root to internal node to leaf. Toggle the index off to see the full table scan comparison.

Root [ 40 | 80 ] Internal [ 20 | 30 ] Internal [ 50 | 65 ] Internal [ 85 | 95 ] Leaf [10|15|18] Leaf [20|25|30] Leaf [40|50|60] Leaf [65|70|75] Leaf [80|85|95|99]

B+tree search is O(log N) — at most 3-4 page reads for a billion-row table. All values live in leaves, linked in sorted order for efficient range scans. Internal nodes only store routing keys.

B+Tree Index Search

Purple = traversal path. Green border = leaf nodes. Linked leaves enable range scan without re-traversing the tree.
Comparison

Index Type Reference

TypeEqualityRangeORDER BYPrefix LIKEArray / JSONBSize
B+TreeYesYesYesYesNoMedium
HashYes (O(1))NoNoNoNoSmall
GINYesNoNoNoYes (containment)Large
GiSTYesSpatialNoNoGeometricMedium
BRINApproxApproxNoNoNoTiny
Full-text (tsvector)YesNoBy rankYesNoLarge
Anti-patterns

Indexing low-selectivity columns

An index on a boolean column with 50/50 distribution is rarely used — the planner will choose a full scan when more than ~5–10% of rows match. Partial indexes on the rare value are effective: CREATE INDEX WHERE is_deleted = true indexes only the small minority of deleted rows, giving high selectivity for queries that filter on deleted items.

Functions on indexed columns in WHERE

WHERE LOWER(email) = 'alice@example.com' — the B+tree index on email is NOT used because the index stores the raw value, not LOWER(email). Fix: create an expression index CREATE INDEX ON users (LOWER(email)), or store the pre-lowercased value in a column and index that.

Over-indexing write-heavy tables

Each index on a table requires a B+tree update on every INSERT, UPDATE (to indexed columns), and DELETE. A table with 15 indexes can have 5–10x slower write throughput. Profile writes under load before adding indexes. Remove unused indexes using pg_stat_user_indexes.idx_scan = 0.

Wrong composite column order

Index (status, created_at) serves queries filtering by status. Index (created_at, status) serves range queries on created_at. Reversing the order for your actual query pattern means the optimizer falls back to a full scan. Rule: equality predicates come first, range predicates last, ORDER BY columns last if they align.

Using LIKE '%prefix' (leading wildcard)

WHERE name LIKE '%smith' cannot use a B+tree index because the leading wildcard means the search key is not bounded from the left — the index sorts by the leftmost character. Only LIKE 'smith%' (trailing wildcard) is B+tree-compatible. Full-text or trigram indexes (pg_trgm in PostgreSQL) handle leading wildcard searches efficiently.

Quiz
Question 1 of 5
In a B+tree index, where are actual data pointers (row locations) stored?
ARoot node only
BInternal nodes
CLeaf nodes
DRoot and leaf nodes equally
Question 2 of 5
A covering index means:
AThe index covers all tables in the schema
BAll columns needed by a query exist in the index — no table lookup required
CThe index has 100% fill factor
DThe index only covers primary key columns
Question 3 of 5
An index on (dept, salary). Which query will use this index efficiently?
AWHERE salary > 80000
BWHERE dept = 'Engineering' AND salary > 80000
CWHERE salary LIKE '9%'
DORDER BY salary DESC
Question 4 of 5
Which index type is best for a JSONB containment query like WHERE metadata @> '{"color":"red"}'?
AB+tree
BHash
CGIN
DBRIN
Question 5 of 5
Why does WHERE LOWER(email) = 'alice@example.com' NOT use a B+tree index on the email column?
AEmail columns cannot be indexed
BThe index stores raw values; LOWER() transforms the search key so the index key ordering no longer matches
CLOWER() is not supported in SQL WHERE clauses
DB+tree indexes are case-sensitive and cannot handle lowercase comparisons
Interview Q&A
What is the difference between a clustered and non-clustered index?
A clustered index determines the physical order of rows on disk — the B+tree leaf pages ARE the data pages. In InnoDB, the primary key is always the clustered index. There can be only one per table. Accessing a row by PK requires no extra lookup — the data is right at the leaf. A non-clustered index is a separate B+tree whose leaf pages store (indexed column, PK) pairs. To get row data, you first look up the index entry (getting the PK), then do a second B+tree traversal of the clustered index to get the row. This second lookup is a "key lookup" or "bookmark lookup." Covering indexes avoid this second hop by including all needed columns in the index. PostgreSQL does not have clustered indexes in this sense — all indexes are heap pointers (ctids).
When should you use a composite index versus separate single-column indexes?
Use a composite index when queries frequently filter on multiple columns together (e.g., always filtering by dept AND hire_date). The composite index supports the combined predicate in a single B+tree traversal and can sort the result. Separate indexes on each column require the optimizer to use "index merge" — scanning both indexes and intersecting the result sets — which is often slower than a full table scan. Use separate indexes when: columns are queried independently in different queries with no consistent pairing; or cardinality of one column is so high that a single-column index gives excellent selectivity alone. The key question: do your queries always combine these columns, or sometimes use them independently?
How does the query planner decide whether to use an index or do a full table scan?
The planner uses a cost model: it estimates the cost of each candidate plan and picks the cheapest. Cost is in abstract units combining CPU cost and I/O cost. For an index scan: cost = (index height traversals) + (number of matching rows × random_page_cost). For a seq scan: cost = (total table pages × seq_page_cost). Key factors: (1) Selectivity — if the condition matches 40% of rows, random I/O for each row is more expensive than a sequential scan. Typical crossover is 5–15% of rows. (2) Correlation — if a column is physically ordered by the index (e.g., auto-increment ID), random I/O cost is lower because pages are read sequentially. PostgreSQL tracks correlation in pg_stats.correlation. (3) Available memory — if the result fits in work_mem, a sort + scan may beat an index. Use EXPLAIN to verify: always look at estimated vs. actual rows — if they diverge, statistics are stale (run ANALYZE).
Explain partial indexes with a practical example.
A partial index stores only rows matching a WHERE predicate. Example: an orders table with 100M rows, where 99% have status = 'completed' and 1% have status = 'pending'. An index on status has terrible selectivity (two values, 50/50 for pending/completed). A partial index: CREATE INDEX idx_pending ON orders (created_at) WHERE status = 'pending' — now the index is tiny (1M rows), highly selective, and fast for the most time-critical query (finding pending orders). The query must include WHERE status = 'pending' for the planner to consider this index. Other partial index patterns: index only non-null values, index only undeleted rows, index only the hot partition of a time-series table.
What is index bloat and how do you fix it?
Index bloat occurs when deleted or updated rows leave "dead" entries in the B+tree that are never compacted. A page with 50% dead entries still costs a full I/O to read. PostgreSQL's VACUUM marks dead tuples and makes space reusable, but does not compact pages or remove empty pages from the B+tree (VACUUM only re-links free space within existing pages). To actually shrink the index: run REINDEX TABLE tablename (locks the table) or REINDEX CONCURRENTLY INDEX indexname (no lock, PG 12+). MySQL's OPTIMIZE TABLE rebuilds clustered and secondary indexes. Detect bloat: in PostgreSQL use pgstatindex() — if avg_leaf_density is below 50%, bloat is significant. Monitor tables with high DELETE rates (log tables, soft-delete tables, job queues) especially.
What are GiST and GIN indexes? When would you choose each?
GiST (Generalized Search Tree) is a framework for building balanced indexes for complex data types. It supports geometric operations (PostGIS point/polygon intersection), range types (int4range, tsrange overlap), full-text search (tsvector). GiST supports ordering operators and nearest-neighbor queries (ORDER BY point <-> target LIMIT 10). GIN (Generalized Inverted Index) builds an inverted index — maps element values to the set of rows containing them. Ideal for JSONB containment (@>), array overlap (&&), and full-text search. GIN queries faster than GiST for containment. GIN is larger and slower to update (writes are deferred to a "pending list" and merged in background). Choose GIN for text search and JSONB; GiST for geometric/spatial queries, range types, and nearest-neighbor.
How do you find and remove unused indexes?
PostgreSQL tracks index usage in pg_stat_user_indexes.idx_scan — the cumulative count of times this index was used for a scan since last statistics reset. Indexes with idx_scan = 0 after representative workload are candidates for removal. Also check: pg_relation_size(indexrelid) to know how much storage unused indexes consume; and pg_stat_user_tables.n_tup_ins/upd/del to understand the write overhead each unused index is adding. Caution before dropping: (1) statistics reset after PostgreSQL restart, so a short observation window may miss infrequent queries; (2) some indexes enforce uniqueness and are vital even with idx_scan = 0; (3) partial indexes for maintenance jobs may scan rarely but are still needed. Safe removal: rename the index first (ALTER INDEX name RENAME TO name_to_drop), wait 1–2 weeks, then drop if no complaints.
What is a BRIN index and when is it appropriate?
BRIN (Block Range Index) stores the minimum and maximum value of a column for each range of disk blocks (default 128 pages per range). The index is tiny — just a few kilobytes even for billion-row tables — because it only stores summaries. For a range query, BRIN eliminates block ranges whose min/max cannot contain the search value, then scans the remaining blocks. This is only useful when column values are correlated with physical disk order — e.g., auto-increment IDs (always increasing, so old rows are on old blocks), or append-only timestamp columns in a time-series table. For randomly distributed values (UUID PKs, hashed values), BRIN provides no filtering benefit — every block range's min/max will overlap every query range. Do not use BRIN where a B+tree would be used; use it for large append-only tables where you need very fast writes and can tolerate sequential-scan-like reads bounded by block ranges.
Further Reading
Previous
Transactions & ACID