SQL is the language you use to ask questions of a database. Think of it like talking to a very precise librarian who can fetch any combination of records, sort them, group them, or summarize them — as long as you phrase your question correctly. This topic covers how to write those questions well.
A SQL query is not run in the order you write it. The database first figures out which table to read (FROM), then filters rows (WHERE), then groups and counts (GROUP BY), then filters groups (HAVING), then selects the columns you want (SELECT), then sorts (ORDER BY), then limits the output (LIMIT). This is the logical execution order — and it explains why you cannot use a column alias from SELECT inside a WHERE clause. The WHERE runs before SELECT, so the alias does not exist yet.
Imagine you want each employee's salary AND the average salary for their department in the same row. GROUP BY cannot do this — it collapses all employees into one row per department. Window functions solve this: they calculate across a group of rows but keep every individual row intact. Think of PARTITION BY as "draw a box around each department," and ORDER BY as "rank people within that box." The result: every employee's row now also includes a department average, a rank, or a running total.
A CTE (Common Table Expression) is a way to give a subquery a name and define it at the top of your query, before the main SELECT. Think of it like assigning a variable in code — instead of burying a complex subquery inside your SELECT, you write it once at the top with a clear name and then use that name in the query below. This makes complex queries much easier to read. CTEs also unlock recursive queries, which can walk through tree structures like an org chart or a set of folder paths.
When a query is slow, you need to see what the database is actually doing. EXPLAIN shows the plan the database intends to follow — like the blueprint before the building. EXPLAIN ANALYZE actually runs the query and shows what really happened: how many rows were read, how long each step took, and whether the database's guesses about row counts were accurate. This is the most important tool for diagnosing why a query is slow.
The very first thing the database does is figure out which tables to read and stitch them together. All the joins happen here, creating one big combined table in memory. This is why you can reference tables from a JOIN in your WHERE clause — they already exist. But a column alias you define in SELECT does not exist yet — SELECT has not happened yet.
Now the database goes through every row and throws out ones that do not match your filter condition. The sooner you filter, the less work the later steps have to do. This is the most impactful place to speed up a query — a good WHERE clause with a good index can reduce 10 million rows to 100 before any grouping happens. Note: you cannot filter on counts or averages here — that is what HAVING is for.
When you want totals, counts, or averages — this is the step where rows get bucketed into groups. All employees in the same department become one group. Then COUNT(*), SUM(salary), AVG(salary) and similar functions calculate one result per group. In SELECT, every column must either be part of the grouping or be inside an aggregate — otherwise the database would not know which row's value to show.
WHERE filters individual rows. HAVING filters groups — it runs after GROUP BY, so it can see the results of aggregations. For example, "show me only departments where the average salary is above $70,000" — that average does not exist until after GROUP BY, so it belongs in HAVING, not WHERE.
Only now does the database pick which columns to include in the result. Window functions also run at this stage — they can see all the grouped data and add things like "rank within group" or "running total" to each row without collapsing them into fewer rows. This is why window functions are so powerful: they add context to each row without losing the row.
Finally, the result is sorted and trimmed to the requested size. LIMIT without ORDER BY is dangerous — the database picks rows in whatever order is convenient, which can change between runs. Always pair LIMIT with ORDER BY if you care about which rows you get. And if you are paginating through many pages, avoid skipping thousands of rows — that is slow. Instead, remember the last ID you saw and filter from there next time.
Running salary total per department, rank within each dept SELECT emp_id, name, dept, salary, SUM(salary) OVER (PARTITION BY dept ORDER BY hire_date) AS running_total, RANK() OVER (PARTITION BY dept ORDER BY salary DESC) AS salary_rank, LAG(salary,1) OVER (PARTITION BY dept ORDER BY hire_date) AS prev_salary, salary - AVG(salary) OVER (PARTITION BY dept) AS diff_from_avg FROM employees; CTE + recursive (org chart — depth-first traversal) WITH RECURSIVE org AS ( SELECT emp_id, name, manager_id, 0 AS depth FROM employees WHERE manager_id IS NULL anchor: root UNION ALL SELECT e.emp_id, e.name, e.manager_id, o.depth + 1 FROM employees e JOIN org o ON e.manager_id = o.emp_id recursive step ) SELECT repeat(' ', depth) || name AS org_tree, depth FROM org ORDER BY depth, name; EXISTS: customers who made a high-value order (correlated) SELECT * FROM customers c WHERE EXISTS ( SELECT 1 FROM orders o WHERE o.customer_id = c.id references outer query: correlated AND o.total > 1000 ); Keyset pagination: O(log n) at any page depth SELECT id, name, created_at FROM orders WHERE id > :last_seen_id index seek ORDER BY id LIMIT 20;
The EXPLAIN output is a tree structure. The deepest, most-indented lines are the first steps — usually reading from a table or using an index. The top line is the last step — returning the final result. Execution flows from the bottom up. When looking for where time is being wasted, start at the bottom and look for steps that read many more rows than expected, or that have a long execution time.
EXPLAIN ANALYZE shows two numbers for each step: how many rows the database guessed it would process (estimated), and how many it actually processed. If the estimate is 100 but the actual is 95,000, the database was wildly wrong — and it probably made bad decisions because of that. The fix is to update the statistics the database uses to make estimates, by running ANALYZE on the table. Fresh statistics help the optimizer make better plans.
A "Seq Scan" means the database is reading every single row in the table from start to finish — like reading an entire book to find one sentence. On a small table this is fine. On a table with millions of rows, this is slow. If you see a Seq Scan on a large table, it usually means there is no index for the column in your WHERE clause, or you are using a function on the column that prevents the index from being used. Adding an appropriate index typically fixes this.
The database has a few strategies for joining two tables. A Hash Join is good when both tables are large — it builds an in-memory hash table and looks things up quickly. A Nested Loop join is efficient when one side is small and there is an index on the join column. If you see a Nested Loop join between two large tables with no index, that is usually the source of a slow query. Adding an index on the join column is the fix.
Every SQL statement passes through five stages before returning rows. The optimizer consults the statistics catalog — row counts, histograms, index metadata — to estimate the cheapest plan.
Fetches all columns including BLOBs and unused fields. Breaks when columns are added or removed. The optimizer can use covering indexes only if it knows which columns you need. Always list explicit columns.
Fetching N rows then running one query per row in application code. Replace with a single JOIN or a batch IN clause. N+1 is almost always the top performance issue in ORMs — add ORM query logging to detect it.
OFFSET N scans and discards N rows every time. For page 1000 with 20 rows/page, you discard 20,000 rows. Use keyset pagination: WHERE id > last_seen_id ORDER BY id LIMIT 20.
WHERE user_id = '123' when user_id is INT causes a cast on every row, disabling index usage. Match parameter types to column types exactly — use bind parameters with the correct types in your ORM or driver.
WHERE UPPER(email) = 'FOO@EXAMPLE.COM' prevents index usage on the email column. The index stores original values. Use expression indexes (CREATE INDEX ON users (UPPER(email))) or store data in the correct case to begin with.
COUNT(*) or AVG(salary). A common mistake: WHERE COUNT(*) > 5 fails because WHERE runs before aggregation. The correct form is HAVING COUNT(*) > 5. Performance tip: push filters into WHERE whenever possible so fewer rows are grouped.AGG() OVER (PARTITION BY col ORDER BY col ROWS/RANGE frame). Use window functions when you need both individual row data AND aggregate data in the same result — e.g., each employee's salary alongside the department average. GROUP BY collapses to one row per group; window functions keep all rows. Common functions: ROW_NUMBER (unique sequential), RANK (gaps on ties), DENSE_RANK (no gaps), LAG/LEAD (prev/next row value), SUM/AVG OVER (running totals).WITH name AS (...) before the main query. Key differences from subqueries: (1) CTEs can be referenced multiple times in the same query without re-execution (in most DBMS). (2) CTEs support recursion (WITH RECURSIVE) for hierarchical data — impossible with regular subqueries. (3) CTEs improve readability for complex queries. (4) In PostgreSQL, CTEs are optimization fences (cannot be pushed into or merged with the outer query) — this can hurt performance. Use WITH ... AS MATERIALIZED/NOT MATERIALIZED to control this behavior in PG 12+.WHERE parent_id IN (...) query for children. (3) Eager loading in ORMs (e.g., includes in ActiveRecord, joinedload in SQLAlchemy). (4) For very large ID lists, a JOIN is usually faster than a big IN clause. Always verify with EXPLAIN ANALYZE and ORM debug logging.WHERE id > :last_id ORDER BY id LIMIT 20. The index seeks directly to last_id — O(log n) cost regardless of page depth. Limitation: you cannot jump to an arbitrary page number — only prev/next. For most feeds and lists this is fine. For UIs needing "go to page X" use hybrid approaches or accept OFFSET's cost for the rare arbitrary-page case.WHERE YEAR(order_date) = 2024 — rewrite as WHERE order_date BETWEEN '2024-01-01' AND '2024-12-31'. (2) Leading wildcard — WHERE name LIKE '%smith' forces a full scan unless a GIN/trigram index exists. (3) Arithmetic — WHERE price * 1.1 > 100 — rewrite as WHERE price > 90.9. The rule: keep indexed columns bare (unmodified) on the left side of the predicate.NULL = NULL is UNKNOWN — use IS NULL / IS NOT NULL. (2) NOT IN with a list containing NULL returns no rows (UNKNOWN propagates). (3) COUNT(*) counts all rows; COUNT(col) ignores NULLs. (4) COALESCE(val, default) is the standard NULL replacement. Always be explicit about NULL handling — it is the most common source of subtle bugs in SQL.SELECT salary FROM (SELECT salary, DENSE_RANK() OVER (ORDER BY salary DESC) AS rnk FROM employees) sub WHERE rnk = 2. (2) Subquery: SELECT MAX(salary) FROM employees WHERE salary < (SELECT MAX(salary) FROM employees). (3) OFFSET: SELECT DISTINCT salary FROM employees ORDER BY salary DESC LIMIT 1 OFFSET 1. The DENSE_RANK approach is most general and handles ties correctly — multiple employees with the same salary all rank the same. The subquery approach is intuitive but requires a nested scan.