When you type a SQL query, the database does not just run it blindly. It reads the query, validates it, thinks about several ways it could be answered, picks the fastest approach, and then executes it. This topic pulls back the curtain on that process — from the moment you press Enter to the moment rows come back.
Think of the query optimizer as a travel planner. You say "get me from A to B." It considers multiple routes (execution plans), estimates how long each would take based on current conditions (table statistics — like traffic data), and picks the fastest route. A good optimizer can make the difference between a query that takes 10 seconds and one that takes 10 milliseconds. It uses statistics about how many rows are in each table and how data is distributed to make its estimates.
When two tables need to be joined, the database has three main strategies. Nested Loop: for each row on one side, look up the matching row on the other — simple, but slow for large tables unless there is an index. Hash Join: build a lookup table from the smaller side in memory, then quickly find matches — great for large joins. Merge Join: if both sides are already sorted in the same order, walk through them in sync — very efficient when the sort is "free" from an existing index.
The optimizer needs to guess how many rows a filter will return — for example, how many employees earn over $80,000. This guess is called a cardinality estimate. To make good guesses, the database maintains statistics: rough summaries of how data is distributed in each column (like a histogram). When statistics are stale (out of date after large inserts or deletes), the guesses are wrong, the optimizer picks a bad plan, and queries get slow. Running ANALYZE updates the statistics and usually fixes this.
A regular view is just a saved query — every time you query it, the database runs the full computation. A materialized view is a view whose results are pre-computed and stored on disk like a table. Querying it is instant. The trade-off: the stored results can become stale when the underlying tables change. You need to refresh the materialized view periodically. Best for expensive reports or dashboards where slightly outdated data is acceptable and the query itself takes seconds to compute.
The database reads your SQL text like a compiler reads source code. It checks for syntax errors, confirms that the table names and column names actually exist, verifies the data types make sense, and checks that you have permission to access those tables. If anything is wrong at this stage, you get an error before any data is even looked at.
Before planning, the database silently rewrites your query to make it complete and safe. If you referenced a view (a saved query with a name), it replaces the view name with the actual query behind it. If there are row-level security rules (for example, "this user can only see their own data"), those get injected here. The final query may look quite different internally from what you wrote — but the result is always the same.
This is where the database decides how to answer your query. It generates multiple candidate plans and estimates the cost of each one — counting disk reads, CPU work, and memory. It picks the cheapest plan. For queries that join many tables, the number of possible plans explodes exponentially, so the optimizer uses smart search strategies (like dynamic programming) to find a good plan quickly without trying every possibility.
The execution works like an assembly line. The plan is a tree of operations — scan a table, filter rows, join, sort, aggregate. Each step asks the step below it "give me the next row," processes it, and passes it up. This pull-based approach is called the Iterator model (sometimes called the Volcano model). It is memory-efficient because only a few rows are in-flight at any given moment. Modern analytical databases process larger batches at once for better speed.
The simplest join strategy: for each row on one side, go look up the matching rows on the other side. If you are looking up a customer's orders, this is like reading each customer's row, then going to find their orders. Without an index on the orders table, you scan the entire orders table for every customer — very slow. With an index, each lookup is fast. The optimizer uses this when one side is small and the other side has an index on the join column.
A smarter approach for large tables without indexes. In the first step (Build), the database reads the smaller table and builds a hash table in memory — like creating a quick-lookup dictionary. In the second step (Probe), it reads the larger table row by row and looks each row up in that dictionary. This is much faster than nested loop for large data because each lookup is nearly instant. The catch: the hash table (dictionary) needs to fit in memory. If it is too large, the database has to spill some of it to disk, which is slower.
If both tables are already sorted by the join column (because of an index), this strategy is very efficient. Imagine merging two sorted decks of cards — you just look at the top of each deck, match pairs, and advance. You walk through both tables simultaneously in order, picking up matching pairs. If the tables are not already sorted, the database has to sort them first, which adds cost. But if an index provides the sorted order for free, this join type needs minimal extra work and produces results in sorted order — potentially eliminating a later ORDER BY step.
You do not need to tell the database which join algorithm to use — the optimizer decides based on the situation. As a rough guide: Nested Loop with an index is best when one side is small (a few hundred rows) and the other has an index. Hash Join is best for large tables without useful indexes. Sort-Merge is best when both sides are already sorted or when sorted output is needed downstream. If the optimizer makes a bad choice (usually because its statistics are outdated), refresh them with ANALYZE and the plan usually improves.
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) SELECT d.dept_name, COUNT(*), AVG(e.salary) FROM employees e JOIN departments d ON e.dept_id = d.dept_id WHERE e.hire_date > '2020-01-01' GROUP BY d.dept_name; -- Annotated output: -- HashAggregate (cost=245.20..247.70 rows=200 width=44) -- (actual time=12.4..12.6 rows=15 loops=1) -- -> Hash Join (cost=42.00..220.00 rows=1040 width=28) -- (actual time=1.2..9.8 rows=1040 loops=1) -- Hash Cond: (e.dept_id = d.dept_id) -- -> Seq Scan on employees (cost=0.00..160.00 rows=1040) -- (actual rows=1040) -- Filter: (hire_date > '2020-01-01') -- Rows Removed by Filter: 8960 -- -> Hash (cost=22.00..22.00 rows=200) -- Buckets: 1024 Batches: 1 Memory Usage: 18kB -- -> Seq Scan on departments (rows=200) -- -- Key signals: -- estimated 1040 = actual 1040: good estimate, plan is correct -- Rows Removed by Filter: 8960 of 10000 = 89.6% selectivity -- Memory Usage: 18kB — well within work_mem, no spill Force stats refresh if estimates are wrong ANALYZE employees; ANALYZE departments; Extended statistics for correlated columns CREATE STATISTICS emp_dept_stats ON dept_id, hire_date FROM employees;
Pick a query preset. Watch data packets flow upward through the plan tree — from table scans through joins to the final result.
The planner builds a tree where rows flow upward — leaves produce tuples, parent operators consume and transform them. The optimizer chooses operators (hash vs. merge join, seq vs. index scan) by minimizing estimated cost using statistics from pg_statistic.
Predicate pushdown is an optimization where the database moves your WHERE filter as early as possible in the execution plan — ideally right when reading the table. The idea is to throw away irrelevant rows as soon as possible, so expensive operations like joins and aggregations have less data to process. The database applies this automatically — you just write a good WHERE clause.
Imagine joining 1 million employees with 200 departments, then filtering for only the Engineering department. Without pushdown, the join produces 200 million row combinations and then throws most away. With pushdown, the database first filters down to the ~50,000 Engineering employees, then joins only those 50,000 with the 200 departments. The join is 20 times smaller. Same final result — but the work is dramatically reduced.
In large-scale analytics tools (like Spark, BigQuery, or Snowflake), predicate pushdown becomes even more powerful. If your data is partitioned by date (meaning each month's data is in a separate file), and you query only January, the database only reads January's file — it skips the rest entirely. This can reduce the amount of data read from gigabytes to megabytes. Designing your data partitioning around your most common filters is one of the highest-leverage optimizations in large-scale analytics.
If ANALYZE has not run after a large data load, the optimizer uses outdated row count estimates. A table that grew from 100K to 10M rows without ANALYZE may get Nested Loop joins instead of Hash joins — orders of magnitude slower. Run ANALYZE after bulk loads.
WHERE user_id = '12345' when user_id is BIGINT forces a cast on every row during a sequential scan. The optimizer cannot use an index because it does not know the cast will preserve order. Match parameter types to column types exactly.
EXPLAIN (without ANALYZE) shows only estimated costs. EXPLAIN ANALYZE shows actual row counts and time. A query with estimated 100 rows but actual 100,000 rows has bad statistics and will have a bad plan in production.
Hash joins and sorts that exceed work_mem spill to disk, often causing a 10-100x slowdown. Check for "Batches: N" in EXPLAIN output where N > 1 — this means disk spill. Increase work_mem for the session on expensive queries: SET work_mem = '256MB'.
GetNext() interface. The top-level operator (Result) calls GetNext() on its child, which calls GetNext() on its child, and so on. Data flows upward one tuple at a time. Benefits: memory-efficient (one tuple in flight per operator), composable, simple to implement. Drawback: poor CPU cache utilization due to virtual function calls per tuple. Modern databases (DuckDB, Snowflake) use vectorized execution: process batches of 1024 rows using SIMD, dramatically improving cache efficiency and throughput for OLAP workloads.USE INDEX in MySQL), join order, or join type. They should be a last resort — used only when: (1) you have confirmed with EXPLAIN ANALYZE that the optimizer chose a bad plan, (2) you have refreshed statistics and it still makes the wrong choice, (3) you understand exactly why the optimizer is wrong. Hints are brittle: they survive schema changes silently (the hint may force a plan that becomes even worse after a new index is added). Prefer fixing statistics, adding or removing indexes, or rewriting the query. PostgreSQL does not support hints natively (use the pg_hint_plan extension); MySQL does.REFRESH MATERIALIZED VIEW CONCURRENTLY in PostgreSQL (does not lock reads), scheduled job, or on-demand. Trade-off: write amplification (every base table update may trigger a refresh) vs read speed. Alternatives: Redis/Memcached cache, denormalized summary tables updated by triggers, or incremental view maintenance (supported in some OLAP systems).total_cost = seq_page_cost x pages_read + random_page_cost x random_reads + cpu_tuple_cost x rows + cpu_operator_cost x operations. Key parameters: seq_page_cost = 1.0 (baseline), random_page_cost = 4.0 (SSDs: lower to 1.1-2.0), cpu_tuple_cost = 0.01. If random_page_cost is too high for your SSD storage, the optimizer will prefer SeqScans over IndexScans even when the index is faster. Tune these constants to match your hardware: SET random_page_cost = 1.1 for NVMe. The planner then uses table statistics (pg_statistic) to estimate selectivity at each node.