Topic 07

Query Processing

When you type a SQL query, the database does not just run it blindly. It reads the query, validates it, thinks about several ways it could be answered, picks the fastest approach, and then executes it. This topic pulls back the curtain on that process — from the moment you press Enter to the moment rows come back.

PostgreSQLMySQLSQL Server BigQueryPresto

At a Glance

Core Concepts

Query Optimizer

Think of the query optimizer as a travel planner. You say "get me from A to B." It considers multiple routes (execution plans), estimates how long each would take based on current conditions (table statistics — like traffic data), and picks the fastest route. A good optimizer can make the difference between a query that takes 10 seconds and one that takes 10 milliseconds. It uses statistics about how many rows are in each table and how data is distributed to make its estimates.

Join Algorithms

When two tables need to be joined, the database has three main strategies. Nested Loop: for each row on one side, look up the matching row on the other — simple, but slow for large tables unless there is an index. Hash Join: build a lookup table from the smaller side in memory, then quickly find matches — great for large joins. Merge Join: if both sides are already sorted in the same order, walk through them in sync — very efficient when the sort is "free" from an existing index.

Statistics and Cardinality

The optimizer needs to guess how many rows a filter will return — for example, how many employees earn over $80,000. This guess is called a cardinality estimate. To make good guesses, the database maintains statistics: rough summaries of how data is distributed in each column (like a histogram). When statistics are stale (out of date after large inserts or deletes), the guesses are wrong, the optimizer picks a bad plan, and queries get slow. Running ANALYZE updates the statistics and usually fixes this.

Materialized Views

A regular view is just a saved query — every time you query it, the database runs the full computation. A materialized view is a view whose results are pre-computed and stored on disk like a table. Querying it is instant. The trade-off: the stored results can become stale when the underlying tables change. You need to refresh the materialized view periodically. Best for expensive reports or dashboards where slightly outdated data is acceptable and the query itself takes seconds to compute.

How It Works

Query Lifecycle

1

Parse

The database reads your SQL text like a compiler reads source code. It checks for syntax errors, confirms that the table names and column names actually exist, verifies the data types make sense, and checks that you have permission to access those tables. If anything is wrong at this stage, you get an error before any data is even looked at.

2

Rewrite (Rule System)

Before planning, the database silently rewrites your query to make it complete and safe. If you referenced a view (a saved query with a name), it replaces the view name with the actual query behind it. If there are row-level security rules (for example, "this user can only see their own data"), those get injected here. The final query may look quite different internally from what you wrote — but the result is always the same.

3

Plan (Optimize)

This is where the database decides how to answer your query. It generates multiple candidate plans and estimates the cost of each one — counting disk reads, CPU work, and memory. It picks the cheapest plan. For queries that join many tables, the number of possible plans explodes exponentially, so the optimizer uses smart search strategies (like dynamic programming) to find a good plan quickly without trying every possibility.

4

Execute (Iterator Model)

The execution works like an assembly line. The plan is a tree of operations — scan a table, filter rows, join, sort, aggregate. Each step asks the step below it "give me the next row," processes it, and passes it up. This pull-based approach is called the Iterator model (sometimes called the Volcano model). It is memory-efficient because only a few rows are in-flight at any given moment. Modern analytical databases process larger batches at once for better speed.

Join Algorithm Deep Dive

Choosing the Right Algorithm

1

Nested Loop Join

The simplest join strategy: for each row on one side, go look up the matching rows on the other side. If you are looking up a customer's orders, this is like reading each customer's row, then going to find their orders. Without an index on the orders table, you scan the entire orders table for every customer — very slow. With an index, each lookup is fast. The optimizer uses this when one side is small and the other side has an index on the join column.

2

Hash Join

A smarter approach for large tables without indexes. In the first step (Build), the database reads the smaller table and builds a hash table in memory — like creating a quick-lookup dictionary. In the second step (Probe), it reads the larger table row by row and looks each row up in that dictionary. This is much faster than nested loop for large data because each lookup is nearly instant. The catch: the hash table (dictionary) needs to fit in memory. If it is too large, the database has to spill some of it to disk, which is slower.

3

Sort-Merge Join

If both tables are already sorted by the join column (because of an index), this strategy is very efficient. Imagine merging two sorted decks of cards — you just look at the top of each deck, match pairs, and advance. You walk through both tables simultaneously in order, picking up matching pairs. If the tables are not already sorted, the database has to sort them first, which adds cost. But if an index provides the sorted order for free, this join type needs minimal extra work and produces results in sorted order — potentially eliminating a later ORDER BY step.

4

Optimizer Choice Heuristics

You do not need to tell the database which join algorithm to use — the optimizer decides based on the situation. As a rough guide: Nested Loop with an index is best when one side is small (a few hundred rows) and the other has an index. Hash Join is best for large tables without useful indexes. Sort-Merge is best when both sides are already sorted or when sorted output is needed downstream. If the optimizer makes a bad choice (usually because its statistics are outdated), refresh them with ANALYZE and the plan usually improves.

Hash Join

  • Build hash table from smaller relation
  • Probe with larger relation rows
  • O(n+m) time, O(min(n,m)) space
  • Best when one side fits in memory
  • Unsorted output
  • Spills to disk if memory exceeded

Merge Join

  • Both inputs must be sorted on join key
  • Walk both sorted lists in lockstep
  • O(n log n + m log m) if sorting needed
  • Best with sorted inputs or needed sort
  • Produces sorted output — eliminates ORDER BY
  • Can exploit existing index order

Nested Loop Join

  • For each outer row, scan inner table
  • O(n x m) without index, O(n log m) with index
  • Best when outer is tiny and inner has index
  • No memory requirement
  • Bad for large unsorted tables
  • Default when optimizer has no better option

Vectorized Execution

  • Process batches of 1024 rows at once
  • SIMD instructions operate on whole vectors
  • Better CPU cache utilization than Volcano
  • DuckDB, Snowflake, ClickHouse, Velox
  • 10-100x throughput vs tuple-at-a-time
  • Best for analytical (OLAP) workloads
PostgreSQL EXPLAIN ANALYZE
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT d.dept_name, COUNT(*), AVG(e.salary)
FROM  employees e JOIN departments d ON e.dept_id = d.dept_id
WHERE e.hire_date > '2020-01-01'
GROUP BY d.dept_name;

-- Annotated output:
-- HashAggregate  (cost=245.20..247.70 rows=200 width=44)
--                (actual time=12.4..12.6 rows=15 loops=1)
--   ->  Hash Join  (cost=42.00..220.00 rows=1040 width=28)
--                  (actual time=1.2..9.8 rows=1040 loops=1)
--         Hash Cond: (e.dept_id = d.dept_id)
--         ->  Seq Scan on employees  (cost=0.00..160.00 rows=1040)
--                                    (actual rows=1040)
--               Filter: (hire_date > '2020-01-01')
--               Rows Removed by Filter: 8960
--         ->  Hash  (cost=22.00..22.00 rows=200)
--               Buckets: 1024  Batches: 1  Memory Usage: 18kB
--               ->  Seq Scan on departments (rows=200)
--
-- Key signals:
-- estimated 1040 = actual 1040: good estimate, plan is correct
-- Rows Removed by Filter: 8960 of 10000 = 89.6% selectivity
-- Memory Usage: 18kB — well within work_mem, no spill

Force stats refresh if estimates are wrong
ANALYZE employees;
ANALYZE departments;

Extended statistics for correlated columns
CREATE STATISTICS emp_dept_stats ON dept_id, hire_date
FROM employees;
Interactive Demo

Query Execution Plan Visualizer

Pick a query preset. Watch data packets flow upward through the plan tree — from table scans through joins to the final result.

Result Set returned to client Aggregate SUM / COUNT / GROUP BY Hash Join build hash table on smaller rel Seq Scan employees (full) Index Scan departments (dept_id) cost: 1200 cost: 980 cost: 42

The planner builds a tree where rows flow upward — leaves produce tuples, parent operators consume and transform them. The optimizer chooses operators (hash vs. merge join, seq vs. index scan) by minimizing estimated cost using statistics from pg_statistic.

Execution Plan Tree

Colored particles flow from leaf nodes (scans) upward through operators to Result. Cost shown on each node.
Predicate Pushdown

Filter Early, Join Less

1

What It Is

Predicate pushdown is an optimization where the database moves your WHERE filter as early as possible in the execution plan — ideally right when reading the table. The idea is to throw away irrelevant rows as soon as possible, so expensive operations like joins and aggregations have less data to process. The database applies this automatically — you just write a good WHERE clause.

2

Example: 20x Smaller Join

Imagine joining 1 million employees with 200 departments, then filtering for only the Engineering department. Without pushdown, the join produces 200 million row combinations and then throws most away. With pushdown, the database first filters down to the ~50,000 Engineering employees, then joins only those 50,000 with the 200 departments. The join is 20 times smaller. Same final result — but the work is dramatically reduced.

3

Distributed Systems Impact

In large-scale analytics tools (like Spark, BigQuery, or Snowflake), predicate pushdown becomes even more powerful. If your data is partitioned by date (meaning each month's data is in a separate file), and you query only January, the database only reads January's file — it skips the rest entirely. This can reduce the amount of data read from gigabytes to megabytes. Designing your data partitioning around your most common filters is one of the highest-leverage optimizations in large-scale analytics.

Anti-patterns

Stale statistics causing bad plans

If ANALYZE has not run after a large data load, the optimizer uses outdated row count estimates. A table that grew from 100K to 10M rows without ANALYZE may get Nested Loop joins instead of Hash joins — orders of magnitude slower. Run ANALYZE after bulk loads.

Implicit type casts defeating index usage

WHERE user_id = '12345' when user_id is BIGINT forces a cast on every row during a sequential scan. The optimizer cannot use an index because it does not know the cast will preserve order. Match parameter types to column types exactly.

Not using EXPLAIN ANALYZE before deploying

EXPLAIN (without ANALYZE) shows only estimated costs. EXPLAIN ANALYZE shows actual row counts and time. A query with estimated 100 rows but actual 100,000 rows has bad statistics and will have a bad plan in production.

Setting work_mem too low

Hash joins and sorts that exceed work_mem spill to disk, often causing a 10-100x slowdown. Check for "Batches: N" in EXPLAIN output where N > 1 — this means disk spill. Increase work_mem for the session on expensive queries: SET work_mem = '256MB'.

Quiz
Question 1 of 5
Hash join is preferred over nested loop join when:
ABoth tables are sorted on the join key
BNeither table fits in memory
COne side fits in memory and tables are large and unsorted
DAn index is available on the join column
Question 2 of 5
The query optimizer uses statistics to:
AGuarantee query correctness
BEstimate cardinality and choose the lowest-cost plan
CLock the relevant resources before execution
DCache query results for future identical queries
Question 3 of 5
A materialized view differs from a regular view in that it:
AIs recomputed on every query
BStores precomputed query results on disk
CCannot be indexed
DOnly works with GROUP BY aggregations
Question 4 of 5
In the Volcano/Iterator execution model, how do rows flow through the query plan?
AFrom the root node downward to leaf scans
BAll rows are loaded into memory, then processed
CEach operator pulls one row at a time from its children — bottom up
DEach operator pushes rows to its parent node
Question 5 of 5
What does predicate pushdown accomplish in a query plan?
ARewrites WHERE clauses into HAVING clauses
BMoves filters as early in the plan as possible, reducing rows entering expensive join nodes
CAutomatically creates indexes on filtered columns
DConverts joins to correlated subqueries for performance
Interview Q&A
What is the Volcano/Iterator model of query execution?
The Volcano model implements query execution as a tree of operators, each implementing a GetNext() interface. The top-level operator (Result) calls GetNext() on its child, which calls GetNext() on its child, and so on. Data flows upward one tuple at a time. Benefits: memory-efficient (one tuple in flight per operator), composable, simple to implement. Drawback: poor CPU cache utilization due to virtual function calls per tuple. Modern databases (DuckDB, Snowflake) use vectorized execution: process batches of 1024 rows using SIMD, dramatically improving cache efficiency and throughput for OLAP workloads.
How does the query optimizer handle join ordering for 3+ tables?
With n tables, there are n!/2 possible left-deep join orderings. For n=4 that is 12; for n=10 that is 1.8M. The optimizer uses dynamic programming (as in System R and PostgreSQL): start with single-table costs, then for each pair, triple, etc. find the cheapest plan using previously computed sub-plans. PostgreSQL uses this up to 8 tables, then switches to a genetic algorithm (GEQO) for larger queries. Heuristics: push filters early to reduce cardinality fast, join on equality predicates before inequality predicates, prefer smaller-to-larger ordering.
What is predicate pushdown and why is it important?
Predicate pushdown moves WHERE conditions as early as possible in the query plan — ideally into the table scan. Instead of joining 1M employees x 200 departments then filtering for dept='Engineering', the optimizer pushes the filter into the employees scan first — reducing input to the join from 1M to ~50K rows. The join is 20x smaller. In distributed systems (Spark, BigQuery), predicate pushdown to the storage layer (Parquet column pruning, partition elimination) can avoid reading irrelevant files entirely — reducing I/O by orders of magnitude.
What is a query plan hint and when should you use one?
Plan hints let you override the optimizer's choice: force an index (USE INDEX in MySQL), join order, or join type. They should be a last resort — used only when: (1) you have confirmed with EXPLAIN ANALYZE that the optimizer chose a bad plan, (2) you have refreshed statistics and it still makes the wrong choice, (3) you understand exactly why the optimizer is wrong. Hints are brittle: they survive schema changes silently (the hint may force a plan that becomes even worse after a new index is added). Prefer fixing statistics, adding or removing indexes, or rewriting the query. PostgreSQL does not support hints natively (use the pg_hint_plan extension); MySQL does.
When should you use a materialized view?
Use a materialized view when: (1) A complex query (multi-table join + aggregation) is run frequently and takes seconds. (2) The underlying data changes infrequently enough that stale reads are acceptable. (3) You need to index the result of a query (not possible with regular views). Refresh strategies: REFRESH MATERIALIZED VIEW CONCURRENTLY in PostgreSQL (does not lock reads), scheduled job, or on-demand. Trade-off: write amplification (every base table update may trigger a refresh) vs read speed. Alternatives: Redis/Memcached cache, denormalized summary tables updated by triggers, or incremental view maintenance (supported in some OLAP systems).
What is vectorized execution and how does it differ from Volcano?
Volcano processes one tuple at a time through the operator tree — high function call overhead per row. Vectorized execution (columnar) processes a batch (typically 1024 rows) per GetNext() call. Benefits: (1) CPU SIMD instructions can operate on arrays of values in parallel. (2) Better cache locality — a column of 1024 integers fits in L1/L2 cache. (3) Branch prediction improves for tight loops over arrays. (4) Operator overhead amortized over the batch. Result: 10-100x throughput improvement for scan-heavy OLAP workloads. DuckDB, ClickHouse, Snowflake, and Velox all use vectorized execution. For OLTP workloads with small result sets, the difference is smaller since few rows are processed per query.
What causes a plan regression and how do you diagnose one?
A plan regression is when the optimizer switches to a worse execution plan, often after: (1) A new index is added (optimizer may now choose it even when the old plan was better). (2) Data distribution changed significantly (stale statistics cause wrong cardinality estimates). (3) A PostgreSQL or MySQL version upgrade (optimizer behavior changed). (4) Parameter changes (work_mem, enable_hashjoin). Diagnosis: compare EXPLAIN ANALYZE before and after — look for plan node changes (SeqScan vs IndexScan, HashJoin vs NestedLoop) and check estimated vs actual row counts. Fix: run ANALYZE, create extended statistics for correlated columns, or use plan hints as a last resort.
How does the cost model in PostgreSQL work?
PostgreSQL's cost model estimates: total_cost = seq_page_cost x pages_read + random_page_cost x random_reads + cpu_tuple_cost x rows + cpu_operator_cost x operations. Key parameters: seq_page_cost = 1.0 (baseline), random_page_cost = 4.0 (SSDs: lower to 1.1-2.0), cpu_tuple_cost = 0.01. If random_page_cost is too high for your SSD storage, the optimizer will prefer SeqScans over IndexScans even when the index is faster. Tune these constants to match your hardware: SET random_page_cost = 1.1 for NVMe. The planner then uses table statistics (pg_statistic) to estimate selectivity at each node.
Further Reading
Previous
Indexing