Topic 03

SQL Fundamentals

SQL is the language you use to ask questions of a database. Think of it like talking to a very precise librarian who can fetch any combination of records, sort them, group them, or summarize them — as long as you phrase your question correctly. This topic covers how to write those questions well.

PostgreSQLMySQLSQLite BigQuerySnowflakeDuckDB

Core Concepts

SELECT Pipeline

A SQL query is not run in the order you write it. The database first figures out which table to read (FROM), then filters rows (WHERE), then groups and counts (GROUP BY), then filters groups (HAVING), then selects the columns you want (SELECT), then sorts (ORDER BY), then limits the output (LIMIT). This is the logical execution order — and it explains why you cannot use a column alias from SELECT inside a WHERE clause. The WHERE runs before SELECT, so the alias does not exist yet.

Window Functions

Imagine you want each employee's salary AND the average salary for their department in the same row. GROUP BY cannot do this — it collapses all employees into one row per department. Window functions solve this: they calculate across a group of rows but keep every individual row intact. Think of PARTITION BY as "draw a box around each department," and ORDER BY as "rank people within that box." The result: every employee's row now also includes a department average, a rank, or a running total.

CTEs (WITH clause)

A CTE (Common Table Expression) is a way to give a subquery a name and define it at the top of your query, before the main SELECT. Think of it like assigning a variable in code — instead of burying a complex subquery inside your SELECT, you write it once at the top with a clear name and then use that name in the query below. This makes complex queries much easier to read. CTEs also unlock recursive queries, which can walk through tree structures like an org chart or a set of folder paths.

EXPLAIN / EXPLAIN ANALYZE

When a query is slow, you need to see what the database is actually doing. EXPLAIN shows the plan the database intends to follow — like the blueprint before the building. EXPLAIN ANALYZE actually runs the query and shows what really happened: how many rows were read, how long each step took, and whether the database's guesses about row counts were accurate. This is the most important tool for diagnosing why a query is slow.

How It Works

Logical Query Execution Order

FROM + JOIN

The very first thing the database does is figure out which tables to read and stitch them together. All the joins happen here, creating one big combined table in memory. This is why you can reference tables from a JOIN in your WHERE clause — they already exist. But a column alias you define in SELECT does not exist yet — SELECT has not happened yet.

WHERE

Now the database goes through every row and throws out ones that do not match your filter condition. The sooner you filter, the less work the later steps have to do. This is the most impactful place to speed up a query — a good WHERE clause with a good index can reduce 10 million rows to 100 before any grouping happens. Note: you cannot filter on counts or averages here — that is what HAVING is for.

GROUP BY + Aggregates

When you want totals, counts, or averages — this is the step where rows get bucketed into groups. All employees in the same department become one group. Then COUNT(*), SUM(salary), AVG(salary) and similar functions calculate one result per group. In SELECT, every column must either be part of the grouping or be inside an aggregate — otherwise the database would not know which row's value to show.

HAVING

WHERE filters individual rows. HAVING filters groups — it runs after GROUP BY, so it can see the results of aggregations. For example, "show me only departments where the average salary is above $70,000" — that average does not exist until after GROUP BY, so it belongs in HAVING, not WHERE.

SELECT + Window Functions

Only now does the database pick which columns to include in the result. Window functions also run at this stage — they can see all the grouped data and add things like "rank within group" or "running total" to each row without collapsing them into fewer rows. This is why window functions are so powerful: they add context to each row without losing the row.

ORDER BY + LIMIT

Finally, the result is sorted and trimmed to the requested size. LIMIT without ORDER BY is dangerous — the database picks rows in whatever order is convenient, which can change between runs. Always pair LIMIT with ORDER BY if you care about which rows you get. And if you are paginating through many pages, avoid skipping thousands of rows — that is slow. Instead, remember the last ID you saw and filter from there next time.

JOIN Types

INNER JOIN

Only matching rows from both tables
Drops non-matching rows silently
Equivalent to intersection on join key
Most common join type in OLTP queries

LEFT JOIN

All rows from left, matching from right
NULLs for right cols when no match
Use for optional relationships
More common than RIGHT JOIN

FULL OUTER JOIN

All rows from both tables
NULLs where no match on either side
Good for reconciliation queries
Expensive — full scan of both sides

SELF JOIN

Table joined to itself via aliases
Classic use: hierarchies (employee/manager)
Also for finding pairs within the same table
Same as any join — one table used twice

Subqueries vs CTEs

When to Use Which

Correlated Subquery

References outer query columns
Re-executes once per outer row — O(n)
Use: EXISTS checks, scalar lookups per row
Optimizer may rewrite as JOIN automatically

Uncorrelated Subquery

Self-contained, runs once
Result cached and reused
Use: filtering against a derived set
Example: WHERE id IN (SELECT id FROM ...)

CTE (WITH clause)

Named, readable, defined before main query
Can be referenced multiple times
Supports WITH RECURSIVE for trees and graphs
PostgreSQL: optimization fence (may inhibit pushdown)

Derived Table (inline view)

Subquery in FROM clause
Not named — one-time use
Optimizer can often merge with outer query
Use: simple one-off transformations

SQL — Window Functions and CTEs

Running salary total per department, rank within each dept
SELECT
  emp_id,
  name,
  dept,
  salary,
  SUM(salary)   OVER (PARTITION BY dept ORDER BY hire_date) AS running_total,
  RANK()        OVER (PARTITION BY dept ORDER BY salary DESC) AS salary_rank,
  LAG(salary,1) OVER (PARTITION BY dept ORDER BY hire_date) AS prev_salary,
  salary - AVG(salary) OVER (PARTITION BY dept)            AS diff_from_avg
FROM employees;

CTE + recursive (org chart — depth-first traversal)
WITH RECURSIVE org AS (
  SELECT emp_id, name, manager_id, 0 AS depth
  FROM   employees WHERE manager_id IS NULL          anchor: root
  UNION ALL
  SELECT e.emp_id, e.name, e.manager_id, o.depth + 1
  FROM   employees e
  JOIN   org o ON e.manager_id = o.emp_id           recursive step
)
SELECT repeat('  ', depth) || name AS org_tree, depth
FROM  org ORDER BY depth, name;

EXISTS: customers who made a high-value order (correlated)
SELECT * FROM customers c
WHERE EXISTS (
  SELECT 1 FROM orders o
  WHERE o.customer_id = c.id   references outer query: correlated
    AND o.total > 1000
);

Keyset pagination: O(log n) at any page depth
SELECT id, name, created_at
FROM   orders
WHERE  id > :last_seen_id        index seek
ORDER BY id
LIMIT  20;

At a Glance

SQL has three jobs: DDL defines the shape of your tables (CREATE, ALTER). DML reads and writes data (SELECT, INSERT, UPDATE, DELETE). DCL controls who can do what (GRANT, REVOKE).
Window functions (ROW_NUMBER, RANK, LAG) let you rank or compare rows within a group without losing the individual rows. GROUP BY collapses rows; window functions keep them.
EXPLAIN ANALYZE shows you exactly what the database did to answer your query. If a query is slow, run this first — it shows where the time went and whether an index was used.
CTEs (WITH clause) let you name a subquery and reuse it. WITH RECURSIVE lets you walk tree structures like org charts or category hierarchies.
Avoid OFFSET for large tables. Skipping 10,000 rows to show "page 500" makes the database read all those rows just to throw them away. Instead, track the last ID you saw and filter from there.

Reading EXPLAIN Output

Diagnosing Slow Queries

Read Bottom-Up

The EXPLAIN output is a tree structure. The deepest, most-indented lines are the first steps — usually reading from a table or using an index. The top line is the last step — returning the final result. Execution flows from the bottom up. When looking for where time is being wasted, start at the bottom and look for steps that read many more rows than expected, or that have a long execution time.

Compare Estimated vs Actual Rows

EXPLAIN ANALYZE shows two numbers for each step: how many rows the database guessed it would process (estimated), and how many it actually processed. If the estimate is 100 but the actual is 95,000, the database was wildly wrong — and it probably made bad decisions because of that. The fix is to update the statistics the database uses to make estimates, by running ANALYZE on the table. Fresh statistics help the optimizer make better plans.

Spot Seq Scans on Large Tables

A "Seq Scan" means the database is reading every single row in the table from start to finish — like reading an entire book to find one sentence. On a small table this is fine. On a table with millions of rows, this is slow. If you see a Seq Scan on a large table, it usually means there is no index for the column in your WHERE clause, or you are using a function on the column that prevents the index from being used. Adding an appropriate index typically fixes this.

Check Join Types

The database has a few strategies for joining two tables. A Hash Join is good when both tables are large — it builds an in-memory hash table and looks things up quickly. A Nested Loop join is efficient when one side is small and there is an index on the join column. If you see a Nested Loop join between two large tables with no index, that is usually the source of a slow query. Adding an index on the join column is the fix.

Every SQL statement passes through five stages before returning rows. The optimizer consults the statistics catalog — row counts, histograms, index metadata — to estimate the cheapest plan.

Live SQL Filter

Rows animate one-by-one as the engine scans. Green = matches predicate.

Anti-patterns

SELECT * in production queries

Fetches all columns including BLOBs and unused fields. Breaks when columns are added or removed. The optimizer can use covering indexes only if it knows which columns you need. Always list explicit columns.

The N+1 query problem

Fetching N rows then running one query per row in application code. Replace with a single JOIN or a batch IN clause. N+1 is almost always the top performance issue in ORMs — add ORM query logging to detect it.

Using OFFSET for pagination on large tables

OFFSET N scans and discards N rows every time. For page 1000 with 20 rows/page, you discard 20,000 rows. Use keyset pagination: WHERE id > last_seen_id ORDER BY id LIMIT 20.

Implicit type conversions in WHERE

WHERE user_id = '123' when user_id is INT causes a cast on every row, disabling index usage. Match parameter types to column types exactly — use bind parameters with the correct types in your ORM or driver.

Functions on indexed columns in WHERE

WHERE UPPER(email) = 'FOO@EXAMPLE.COM' prevents index usage on the email column. The index stores original values. Use expression indexes (CREATE INDEX ON users (UPPER(email))) or store data in the correct case to begin with.

Quiz

Question 1 of 5

Which SQL clause filters results AFTER GROUP BY (on aggregated values)?

AWHERE

BGROUP BY

CHAVING

DORDER BY

Question 2 of 5

A correlated subquery differs from a regular subquery in that it:

AReferences a column from the outer query, re-executing per outer row

BRuns once and its result is cached

CIs always faster than an equivalent JOIN

DCannot return multiple rows

Question 3 of 5

EXPLAIN ANALYZE in PostgreSQL differs from EXPLAIN in that it:

AOnly shows estimated costs, not actual execution

BActually executes the query and shows real vs estimated statistics

CRuns the query but discards result rows to avoid side effects

DForces the optimizer to choose a different plan

Question 4 of 5

What is the difference between RANK() and DENSE_RANK() window functions?

ARANK() starts from 0; DENSE_RANK() starts from 1

BRANK() leaves gaps after ties; DENSE_RANK() does not leave gaps

CDENSE_RANK() cannot use PARTITION BY

DRANK() handles ties; DENSE_RANK() does not

Question 5 of 5

Which statement about WITH RECURSIVE CTEs is correct?

AWITH RECURSIVE is only useful for aggregations

BWITH RECURSIVE runs indefinitely unless you add LIMIT

CIt has an anchor query and a recursive step; terminates when recursive step returns no rows

DWITH RECURSIVE is not supported by standard SQL databases

Interview Q&A

What is the difference between WHERE and HAVING?▾

WHERE filters rows before they are grouped. It can reference only raw column values, not aggregate functions. HAVING filters groups after GROUP BY aggregation. It can reference aggregate expressions like COUNT(*) or AVG(salary). A common mistake: WHERE COUNT(*) > 5 fails because WHERE runs before aggregation. The correct form is HAVING COUNT(*) > 5. Performance tip: push filters into WHERE whenever possible so fewer rows are grouped.

Explain window functions and when you would use them instead of GROUP BY.▾

Window functions compute a value across a set of rows related to the current row, without collapsing them. Syntax: AGG() OVER (PARTITION BY col ORDER BY col ROWS/RANGE frame). Use window functions when you need both individual row data AND aggregate data in the same result — e.g., each employee's salary alongside the department average. GROUP BY collapses to one row per group; window functions keep all rows. Common functions: ROW_NUMBER (unique sequential), RANK (gaps on ties), DENSE_RANK (no gaps), LAG/LEAD (prev/next row value), SUM/AVG OVER (running totals).

What is a CTE and how does it differ from a subquery?▾

A CTE (Common Table Expression) is a named temporary result set defined with WITH name AS (...) before the main query. Key differences from subqueries: (1) CTEs can be referenced multiple times in the same query without re-execution (in most DBMS). (2) CTEs support recursion (WITH RECURSIVE) for hierarchical data — impossible with regular subqueries. (3) CTEs improve readability for complex queries. (4) In PostgreSQL, CTEs are optimization fences (cannot be pushed into or merged with the outer query) — this can hurt performance. Use WITH ... AS MATERIALIZED/NOT MATERIALIZED to control this behavior in PG 12+.

How do you solve the N+1 query problem?▾

N+1 occurs when you fetch N rows and then execute one query per row to fetch related data (1 + N queries total). Solutions: (1) JOIN — fetch parent and children in one query. (2) IN clause — fetch all parents, collect IDs, do one WHERE parent_id IN (...) query for children. (3) Eager loading in ORMs (e.g., includes in ActiveRecord, joinedload in SQLAlchemy). (4) For very large ID lists, a JOIN is usually faster than a big IN clause. Always verify with EXPLAIN ANALYZE and ORM debug logging.

What is keyset pagination and why is it better than OFFSET?▾

OFFSET n skips n rows, which requires scanning and discarding them. On page 1000 of 20 rows/page you throw away 20,000 rows — O(n) cost that grows with page depth. Keyset pagination uses the last-seen value: WHERE id > :last_id ORDER BY id LIMIT 20. The index seeks directly to last_id — O(log n) cost regardless of page depth. Limitation: you cannot jump to an arbitrary page number — only prev/next. For most feeds and lists this is fine. For UIs needing "go to page X" use hybrid approaches or accept OFFSET's cost for the rare arbitrary-page case.

What is a non-sargable predicate?▾

SARG = Search ARGument able. A sargable predicate allows the optimizer to use a B+tree index to seek directly to matching rows. Non-sargable predicates force a full scan. Common non-sargable patterns: (1) Function on column — WHERE YEAR(order_date) = 2024 — rewrite as WHERE order_date BETWEEN '2024-01-01' AND '2024-12-31'. (2) Leading wildcard — WHERE name LIKE '%smith' forces a full scan unless a GIN/trigram index exists. (3) Arithmetic — WHERE price * 1.1 > 100 — rewrite as WHERE price > 90.9. The rule: keep indexed columns bare (unmodified) on the left side of the predicate.

Explain SQL's three-valued logic and how NULL affects queries.▾

SQL uses three-valued logic: TRUE, FALSE, and UNKNOWN. Any comparison with NULL returns UNKNOWN, not TRUE or FALSE. UNKNOWN in a WHERE clause means the row is filtered out. Key consequences: (1) NULL = NULL is UNKNOWN — use IS NULL / IS NOT NULL. (2) NOT IN with a list containing NULL returns no rows (UNKNOWN propagates). (3) COUNT(*) counts all rows; COUNT(col) ignores NULLs. (4) COALESCE(val, default) is the standard NULL replacement. Always be explicit about NULL handling — it is the most common source of subtle bugs in SQL.

How do you write a query to find the second-highest salary?▾

Multiple approaches: (1) DENSE_RANK window function (best): SELECT salary FROM (SELECT salary, DENSE_RANK() OVER (ORDER BY salary DESC) AS rnk FROM employees) sub WHERE rnk = 2. (2) Subquery: SELECT MAX(salary) FROM employees WHERE salary < (SELECT MAX(salary) FROM employees). (3) OFFSET: SELECT DISTINCT salary FROM employees ORDER BY salary DESC LIMIT 1 OFFSET 1. The DENSE_RANK approach is most general and handles ties correctly — multiple employees with the same salary all rank the same. The subquery approach is intuitive but requires a nested scan.