Without an index, a database query scans every single row in a table — like searching a phone book by reading every entry. An index is like the alphabetical tabs in that phone book: it lets the database jump directly to the right place. This topic explains how indexes work, when to use them, and why using the wrong one can make things worse.
Not all indexes are B+trees. Choosing the right structure for the access pattern is the difference between a query that uses the index and one that silently falls back to a full scan.
The most common index type. Imagine a very tall, well-organized filing cabinet that always stays balanced — no matter how many files you add or remove. Looking up a specific value is fast because you just navigate the tree top-to-bottom (typically 3–4 hops for millions of rows). Range queries (like "all orders between January and March") are also fast because the leaf level is sorted and linked together. Works for almost everything: equality checks, ranges, sorting, and prefix text searches like LIKE 'john%'.
A hash index works like a locker room: you hand over a key (your value), it instantly gives you the locker number (the row). Single lookups are very fast — but only for exact equality matches. You cannot ask "which lockers are between 50 and 100" — there is no ordering. Use hash indexes only when you need exact equality lookups on things like tokens or UUIDs and never need to search a range.
GIN (Generalized Inverted Index) is built for searching inside complex data like JSON, arrays, or text documents. Think of it like the index at the back of a book — it maps individual words or elements to which pages (rows) they appear in. If you store product metadata as JSON and want to find all products where color is "red," a GIN index can scan that in milliseconds. It takes longer to build and update than a B+tree, but it handles things a B+tree simply cannot.
BRIN (Block Range Index) is an extremely small index — just a few kilobytes even for tables with billions of rows. It works by recording the minimum and maximum value for each physical section of the table on disk. If you query for a date range, it can quickly rule out sections of the table that fall outside your range without reading them. This only works if the column is naturally ordered on disk — like an auto-incrementing ID or timestamps in an append-only log table. It is useless for randomly ordered data like UUIDs.
The B+tree is the workhorse of database indexing. Unlike a binary search tree, each node in a B+tree can have dozens to hundreds of children (the "fan-out"), which keeps the height short even for billion-row tables. A typical B+tree on a modern database has a height of 3–4 for 100 million rows.
Think of the B+tree as a building directory. Each floor (level) has multiple offices (nodes), and each office has multiple doors pointing to the next floor down. The number of doors per office is called the fan-out — in a database, it is typically 100 to 500. Because each level branches that many ways, the tree stays shallow even for enormous tables: a tree with a fan-out of 500 can index 125 million rows in just three levels. This is why looking up a row takes only 3–4 disk reads regardless of table size.
To find a specific value — like looking up a user by ID — the database starts at the top of the tree and follows pointers down toward the matching leaf. At each level, it checks which direction to go (left for smaller, right for larger). By the time it reaches the bottom (the leaf level), it has the exact row location. The whole journey is only 3–4 steps for most tables. The top levels of the tree are usually cached in memory, so often only the final step reads from disk.
For a range query — like "all orders placed in January" — the database first finds the leaf where January starts (a normal point lookup), then walks forward through the sorted leaf level reading the next entry, and the next, until the range is exhausted. The leaf nodes are linked together like a chain, so this is very efficient. The amount of work is proportional to how many rows match, not the total size of the table. This is why range queries are fast with a B+tree but impossible with a hash index.
When a new row is inserted, the database finds the right leaf position in the tree and adds the new key there. If the leaf is full, it splits in two: the middle key moves up to the parent level, and the entries are divided between the two new leaves. If the parent is also full, the split ripples upward. In practice, splits are uncommon because each node has reserved some empty space (typically 10%) to absorb new entries without immediately splitting. Deletes leave small gaps, and the tree is not immediately reorganized.
Over time, lots of deletes can leave index pages half-empty — a problem called index bloat. A half-empty page still takes the same amount of time to read as a full one, so bloat means you are doing extra work for no benefit. To clean this up, you can rebuild the index from scratch (REINDEX in PostgreSQL, OPTIMIZE TABLE in MySQL). On a busy production table this matters — a table with high delete rates (like a job queue or audit log) should have its index bloat monitored and periodically cleaned up.
The single most important structural distinction in indexing: whether the index IS the table, or whether it is a separate data structure that points to the table.
| Property | Clustered Index | Non-Clustered Index |
|---|---|---|
| Row storage | Leaf pages ARE the data pages | Leaf pages store (key, PK pointer) |
| Count per table | Exactly one | Unlimited (within reason) |
| Range scans on index key | Extremely fast — sequential pages | Fast on index, then random for heap fetch |
| Secondary lookup cost | None — data is at the leaf | Extra hop to clustered index (key lookup) |
| Insert order | Must maintain sorted order — random PKs cause page splits | Index maintains order; heap is unordered |
| MySQL InnoDB | Always the PRIMARY KEY | All other indexes (store PK, not row pointer) |
| PostgreSQL | Heap is unordered; CLUSTER command physically reorders once | All indexes point to heap tuple (ctid) |
PostgreSQL note: PostgreSQL does not have "clustered indexes" in the InnoDB sense — all indexes point to the heap. The CLUSTER command physically reorders heap rows by an index at a single point in time, but new writes are not clustered. InnoDB's clustered index is a fundamental design choice that affects all secondary index lookups.
A composite index on (A, B, C) sorts first by A, then B within each A value, then C within each (A,B) group. This structure has direct implications for which query predicates the index can satisfy.
An index on (dept, hire_date, salary) is sorted first by department, then by hire date within each department, then by salary within each hire date. You can only use it efficiently if you are filtering by the columns from the left. So filtering by dept works. Filtering by dept AND hire_date works. But filtering by hire_date alone does not — the index is not sorted by hire_date globally, only within each department. Think of it like a phone book sorted by last name, then first name: you can find all "Smiths" quickly, but you cannot find all "Johns" without reading the whole thing.
When designing a composite index, put columns used in exact-match filters before columns used in range filters. For example: if you always filter by dept = 'Engineering' AND hire_date between two dates, put dept first. The exact match on dept narrows the search to one section of the index, then the range scan on hire_date within that section is small. If you reversed it (hire_date first), the range would span all hire dates across all departments — a much bigger search.
If your query sorts by the same columns that are in an index (in the same order), the database can read the index in sorted order and skip the sorting step entirely. For example, if you have an index on (dept, hire_date) and your query says ORDER BY dept, hire_date — the results come out already sorted as the database traverses the index. This can be a big win because sorting is expensive when it requires loading a lot of rows into memory.
Modern databases have a trick: if the first column of a composite index only has a few distinct values (like a status column with "pending," "active," "closed"), they can scan the index multiple times — once per value — to answer a query that only filters on the second column. This "skip scan" avoids a full table scan when a full composite match is not available. It works best when the first column has very few distinct values; if there are thousands of distinct values, the trick becomes more expensive than just scanning the whole table.
A covering index is one that contains all the columns a query needs — the query can be answered entirely from the index without touching the table. This eliminates the most expensive part of many queries: the random I/O lookup back to the heap.
To make a covering index, you add extra columns to the index so the query never has to touch the actual table. One way is to put those columns directly in the index key — but that changes the sort order and can add overhead to every write. PostgreSQL offers a cleaner option: INCLUDE. Columns in INCLUDE are stored in the leaf pages (so a query can read them without going back to the table) but they do not affect the sort order of the index. Think of them as passengers in the index — they are there when needed but do not drive.
When a covering index exists, the database can answer the entire query from the index without ever reading the original table rows. In EXPLAIN output, you will see "Index Only Scan" — this is the fastest possible outcome. The database does need to do a quick side check to confirm that all the rows in the index are visible to your transaction (because of MVCC), but running VACUUM regularly keeps that check cheap. An index-only scan is often 10x faster than even a normal index scan.
Use INCLUDE when a query always filters on certain columns AND always reads a few extra columns. For example, if your dashboard always queries orders by user and date, and always needs to display the status and total, add status and total with INCLUDE. This makes the index covering for that specific query. The index key stays clean (just user_id and created_at), but the leaf nodes carry status and total along so no table lookup is needed.
A partial index is an index on a subset of rows — only rows where a condition is true. For example, if a jobs table has millions of completed jobs but only a few thousand pending ones, you can create an index that only covers the pending ones. This index is tiny, always selective, and very fast to scan. Any query that filters on pending jobs can use it. The key insight: you index only what you actually query frequently, not the whole table.
Sometimes you need to search by a transformed version of a column — like searching email addresses case-insensitively. If you store emails in mixed case and always query with LOWER(email) = '...', a regular index on email will not help — the index stores the original case, but your search uses the lowercase version. An expression index solves this: instead of indexing the raw column, you index the result of LOWER(email). Then queries that use LOWER(email) in their filter can use that index.
You can combine partial indexes with a uniqueness constraint. For example, if you soft-delete users (keeping old records with a deleted_at timestamp), you want to enforce that no two active users share the same email — but it is fine if a deleted user's email is reused later. A unique partial index on email WHERE deleted_at IS NULL enforces uniqueness only among active rows. A regular UNIQUE constraint would reject the reuse entirely, even for deleted accounts.
EXPLAIN shows the planner's estimated execution plan. EXPLAIN ANALYZE actually runs the query and shows real row counts and times. The gap between estimated and actual rows reveals stale statistics or planning mistakes.
-- Composite index: equality column first, range column second CREATE INDEX idx_emp_dept_hire ON employees (dept, hire_date); -- Covering index with INCLUDE (PG 11+) -- For: SELECT salary WHERE dept='Eng' ORDER BY hire_date CREATE INDEX idx_emp_covering ON employees (dept, hire_date) INCLUDE (salary); -- Partial index: only index active, high-priority jobs CREATE INDEX idx_pending_jobs ON jobs (priority DESC, created_at) WHERE status = 'pending'; -- Expression index for case-insensitive email lookup CREATE INDEX idx_email_lower ON users (LOWER(email)); -- Hash index: equality-only, smaller than B+tree CREATE INDEX idx_token_hash ON sessions USING HASH (token); -- BRIN index: tiny, good for time-series tables CREATE INDEX idx_events_brin ON events USING BRIN (created_at); -- GIN for JSONB containment CREATE INDEX idx_meta_gin ON products USING GIN (metadata); -- Query: SELECT * FROM products WHERE metadata @> '{"color":"red"}' -- EXPLAIN ANALYZE to verify index usage EXPLAIN (FORMAT TEXT, ANALYZE, BUFFERS) SELECT salary FROM employees WHERE dept = 'Engineering' AND hire_date > '2022-01-01' ORDER BY hire_date; -- Good output: "Index Only Scan using idx_emp_covering" -- Bad output: "Seq Scan on employees" (index not used) -- Also check: rows (estimated) vs. rows (actual) — if far apart, run ANALYZE -- Find unused indexes (waste write performance) SELECT schemaname, tablename, indexname, idx_scan FROM pg_stat_user_indexes WHERE idx_scan = 0 ORDER BY pg_relation_size(indexrelid) DESC;
Enter a search key and watch the B+tree traversal animate — root to internal node to leaf. Toggle the index off to see the full table scan comparison.
B+tree search is O(log N) — at most 3-4 page reads for a billion-row table. All values live in leaves, linked in sorted order for efficient range scans. Internal nodes only store routing keys.
| Type | Equality | Range | ORDER BY | Prefix LIKE | Array / JSONB | Size |
|---|---|---|---|---|---|---|
| B+Tree | Yes | Yes | Yes | Yes | No | Medium |
| Hash | Yes (O(1)) | No | No | No | No | Small |
| GIN | Yes | No | No | No | Yes (containment) | Large |
| GiST | Yes | Spatial | No | No | Geometric | Medium |
| BRIN | Approx | Approx | No | No | No | Tiny |
| Full-text (tsvector) | Yes | No | By rank | Yes | No | Large |
An index on a boolean column with 50/50 distribution is rarely used — the planner will choose a full scan when more than ~5–10% of rows match. Partial indexes on the rare value are effective: CREATE INDEX WHERE is_deleted = true indexes only the small minority of deleted rows, giving high selectivity for queries that filter on deleted items.
WHERE LOWER(email) = 'alice@example.com' — the B+tree index on email is NOT used because the index stores the raw value, not LOWER(email). Fix: create an expression index CREATE INDEX ON users (LOWER(email)), or store the pre-lowercased value in a column and index that.
Each index on a table requires a B+tree update on every INSERT, UPDATE (to indexed columns), and DELETE. A table with 15 indexes can have 5–10x slower write throughput. Profile writes under load before adding indexes. Remove unused indexes using pg_stat_user_indexes.idx_scan = 0.
Index (status, created_at) serves queries filtering by status. Index (created_at, status) serves range queries on created_at. Reversing the order for your actual query pattern means the optimizer falls back to a full scan. Rule: equality predicates come first, range predicates last, ORDER BY columns last if they align.
WHERE name LIKE '%smith' cannot use a B+tree index because the leading wildcard means the search key is not bounded from the left — the index sorts by the leftmost character. Only LIKE 'smith%' (trailing wildcard) is B+tree-compatible. Full-text or trigram indexes (pg_trgm in PostgreSQL) handle leading wildcard searches efficiently.
pg_stats.correlation. (3) Available memory — if the result fits in work_mem, a sort + scan may beat an index. Use EXPLAIN to verify: always look at estimated vs. actual rows — if they diverge, statistics are stale (run ANALYZE).CREATE INDEX idx_pending ON orders (created_at) WHERE status = 'pending' — now the index is tiny (1M rows), highly selective, and fast for the most time-critical query (finding pending orders). The query must include WHERE status = 'pending' for the planner to consider this index. Other partial index patterns: index only non-null values, index only undeleted rows, index only the hot partition of a time-series table.REINDEX TABLE tablename (locks the table) or REINDEX CONCURRENTLY INDEX indexname (no lock, PG 12+). MySQL's OPTIMIZE TABLE rebuilds clustered and secondary indexes. Detect bloat: in PostgreSQL use pgstatindex() — if avg_leaf_density is below 50%, bloat is significant. Monitor tables with high DELETE rates (log tables, soft-delete tables, job queues) especially.ORDER BY point <-> target LIMIT 10). GIN (Generalized Inverted Index) builds an inverted index — maps element values to the set of rows containing them. Ideal for JSONB containment (@>), array overlap (&&), and full-text search. GIN queries faster than GiST for containment. GIN is larger and slower to update (writes are deferred to a "pending list" and merged in background). Choose GIN for text search and JSONB; GiST for geometric/spatial queries, range types, and nearest-neighbor.pg_stat_user_indexes.idx_scan — the cumulative count of times this index was used for a scan since last statistics reset. Indexes with idx_scan = 0 after representative workload are candidates for removal. Also check: pg_relation_size(indexrelid) to know how much storage unused indexes consume; and pg_stat_user_tables.n_tup_ins/upd/del to understand the write overhead each unused index is adding. Caution before dropping: (1) statistics reset after PostgreSQL restart, so a short observation window may miss infrequent queries; (2) some indexes enforce uniqueness and are vital even with idx_scan = 0; (3) partial indexes for maintenance jobs may scan rarely but are still needed. Safe removal: rename the index first (ALTER INDEX name RENAME TO name_to_drop), wait 1–2 weeks, then drop if no complaints.