Topic 02

The Relational Model

Every database you have ever used — Gmail, Instagram, your bank — stores data in tables. This topic explains the rules behind those tables: how rows and columns are defined, how tables link to each other, and the mathematical operations that make SQL work.

PostgreSQLMySQLOracle SQL ServerSQLite

Core Concepts

Relation (Table)

Think of a relation as a spreadsheet tab. Each row is one record — in database language, a row is called a tuple. Each column is one piece of data about that record — a column is called an attribute. All rows in the same table share the same columns. There is no "first row" or "last row" — order is not guaranteed, just like a deck of cards you have shuffled.

Keys Hierarchy

Every row needs a way to be uniquely identified — like a student ID number. Any column (or combination of columns) that can identify a row is called a superkey. If you strip that down to the minimum needed (no extra columns), it becomes a candidate key. The one you officially designate as the row's identifier is the primary key. A foreign key is a column that points to the primary key of a different table — it is how tables link to each other.

Referential Integrity

If every employee must belong to a department, the database enforces that you cannot delete a department that still has employees. This enforcement is called referential integrity. When the parent row is deleted, you decide: should child rows be deleted too (CASCADE)? Should their link be cleared (SET NULL)? Or should the deletion be blocked until the children are gone first (RESTRICT)? Each option encodes a real business decision.

Relational Algebra

Before SQL existed, mathematicians described database operations using symbols. Pick only certain rows: sigma. Pick only certain columns: pi. Combine two tables: join. Combine two result sets: union. Everything you write in SQL compiles down to these six basic operations. You do not need to memorize the symbols — but understanding them explains why SQL works the way it does.

Key Taxonomy

From Superkey to Foreign Key

Superkey

Any set of columns uniquely identifying a row
{user_id}, {email}, {user_id,email} are all superkeys
May have redundant columns
Every candidate key is a superkey

Candidate Key

A minimal superkey — no subset also identifies rows
{user_id} and {email} are candidate keys
{user_id, email} is NOT one (user_id alone suffices)
A table may have multiple candidate keys

Primary Key

One chosen candidate key per table
Cannot be NULL (implicit NOT NULL)
Physically creates a clustered B+tree in InnoDB
Other candidate keys become UNIQUE constraints

Foreign Key

Attribute(s) referencing a PK in another table
Value must be NULL or match a referenced PK
Enforces referential integrity across tables
ON DELETE: CASCADE / SET NULL / RESTRICT

Integrity Constraints

Enforcing Correct Data

Domain Constraint

Every column has a declared type and set of legal values — that is its domain. For example, an age column should not allow -5 or the word "banana." Domain constraints let you declare rules like "age must be a non-negative integer" or "status can only be 'active' or 'inactive'." The database rejects any insert or update that violates these rules before the data is ever saved.

Entity Integrity

Every row must be uniquely identifiable — you cannot have a row with no identity. This means the primary key column can never be empty (NULL) and can never be duplicated. In practice, many tables use a generated ID number (like BIGSERIAL or UUID) as the primary key rather than something like an email address — email addresses can change, be shared, or be left blank, which would break this rule.

Referential Integrity

When one table has a column pointing to another table's primary key (a foreign key), the database checks that the reference is valid. You cannot have an order row pointing to a customer that does not exist. When you delete the parent (the customer), you choose what happens to the children (the orders): delete them too (CASCADE), blank out the link (SET NULL), or refuse the deletion until you manually remove the children first (RESTRICT).

User-Defined Constraints

Sometimes you need business rules that go beyond basic types and keys. For example, "end date must be after start date," or "each user can only have one review per product." These are user-defined constraints. You can declare them directly in the table definition, and the database will enforce them automatically — no application code needed. For more complex rules (like checking data across multiple tables at once), you can use triggers.

How It Works

Relational Algebra — Six Core Operations

sigma — Selection (WHERE)

This operation picks certain rows and throws away the rest. The symbol sigma (looks like a capital E rotated) represents "filter rows where this condition is true." In SQL, this is your WHERE clause. For example, "give me only employees earning over $80,000." The result has the same columns as before — you have only reduced the number of rows.

pi — Projection (SELECT cols)

This operation keeps only certain columns and hides the rest. The symbol pi represents "project columns." In SQL, this is the list of columns after SELECT — for example, "I only want the name and salary columns, not the full employee record." In strict math, projection also removes duplicate rows. In SQL, duplicates are kept by default for speed — add DISTINCT if you want to remove them.

Join — Combining Relations

A join takes two tables and stitches them together based on a matching column. Think of it like looking up a person's department name by matching the department ID on the employee row with the department ID on the department row. In SQL, you use JOIN ... ON to say exactly which columns should match. Always use an explicit ON condition in real code — letting the database guess (natural join) is risky and error-prone.

Set Operations (UNION, INTERSECT, EXCEPT)

Sometimes you want to combine two query results rather than two tables. UNION stacks them together (like combining two lists). INTERSECT gives you only rows that appear in both lists. EXCEPT gives you rows from the first list that do not appear in the second. Both queries must return the same number of columns. UNION removes duplicates by default; use UNION ALL to keep them (which is faster when you know there are no duplicates).

Relational Algebra and SQL

Relational algebra: pi(name,dept_name)(sigma(salary>80000)(employees JOIN departments))
SELECT   e.name, d.dept_name
FROM     employees e
JOIN     departments d ON e.dept_id = d.dept_id
WHERE    e.salary > 80000;

Key constraints in DDL
CREATE TABLE orders (
  order_id    BIGSERIAL    PRIMARY KEY,
  customer_id INT         NOT NULL
                           REFERENCES customers(customer_id)
                           ON DELETE RESTRICT,
  status      VARCHAR(20) NOT NULL
                           CHECK (status IN ('pending','paid','shipped')),
  total       NUMERIC(10,2) CHECK (total > 0),
  created_at  TIMESTAMPTZ  DEFAULT now()
);

UNION: customers who ordered OR registered this year
SELECT customer_id FROM orders
UNION
SELECT id FROM customers WHERE created_at >= '2024-01-01';

EXCEPT: customers who registered but never ordered
SELECT id FROM customers
EXCEPT
SELECT DISTINCT customer_id FROM orders;

At a Glance

Relation = table. Tuple = row. Attribute = column. Every value in every cell must be a single, indivisible piece of data — no lists inside a cell.
Primary key — the unique ID for every row. Foreign key — a column that points to the primary key of another table. This is how you link tables together.
INNER JOIN — only rows that match in both tables. LEFT JOIN — all rows from the left table, even when there is no match on the right (missing data shows as NULL).
Relational algebra — think of it as the math version of SQL. Pick rows (sigma), pick columns (pi), link tables (join). SQL is just a friendlier way to write this.
The key rules you learn here (candidate keys, how data depends on other data) are the foundation of normalization — the next topic.

JOIN Types Reference

INNER JOIN

Returns only matching rows from both sides
Non-matching rows are silently dropped
Most common join type in OLTP
Use when you need data from both tables

LEFT JOIN

All rows from left, matching from right
NULLs for right side when no match
Common for optional relationships
Use: orders with or without invoices

FULL OUTER JOIN

All rows from both sides
NULLs where no match on either side
Rare in practice, powerful for reconciliation
Use: finding records missing from either table

CROSS JOIN

Cartesian product: every row times every row
n * m rows — very expensive at scale
Useful for generating test data or combinations
Missing ON clause in INNER = accidental CROSS

A foreign key in employees references the primary key in departments. The JOIN operation walks that reference to stitch the two relations together into a flat result set.

Relational Join Demo

Matching FK values (dept_id) are highlighted. Result table shows below.

Closure Property: Why SQL Subqueries Work

Every relational algebra operation takes one or more relations as input and produces a relation as output — this is closure. It enables arbitrary composition: the output of a SELECT is itself a relation that can be used as input to another SELECT. This is why subqueries, CTEs, and derived tables all work naturally in SQL. Every query result is a "virtual table" with the same interface as a stored table — that is why you can write SELECT * FROM (SELECT ...) AS sub.

Anti-patterns

Choosing the wrong join type

Using INNER JOIN when you need LEFT JOIN silently drops rows. Always verify expected result cardinality with a COUNT(*) sanity check. A missing row in the result is often worse than an error because it is invisible.

Missing referential integrity constraints

Storing FK values without declaring FOREIGN KEY lets orphaned rows accumulate — child rows pointing to deleted parents. This silent data corruption is expensive to find and fix later, especially across millions of rows.

Using NATURAL JOIN in production

NATURAL JOIN matches on ALL common column names. Adding a new column to a table can silently change join semantics. Always use explicit ON conditions in production code.

NULL in primary key columns

NULLs in PK columns violate entity integrity. NULL means "unknown" — two NULLs are not equal in SQL's three-valued logic. Always make PKs NOT NULL. Use surrogate keys (BIGSERIAL, UUID) for stability.

Quiz

Question 1 of 5

A relation in the relational model is formally defined as:

AA foreign key linking two tables

BA set of tuples sharing the same attribute schema

CAn index on a primary key column

DA SQL SELECT query result

Question 2 of 5

Which constraint ensures that a foreign key value either matches a primary key in the referenced table or is NULL?

APRIMARY KEY

BNOT NULL

CFOREIGN KEY (referential integrity)

DUNIQUE

Question 3 of 5

The relational algebra pi (projection) operation does what?

AFilters rows based on a predicate

BJoins two relations on a common attribute

CKeeps only specified columns, removes duplicates

DComputes aggregate functions like COUNT and SUM

Question 4 of 5

What distinguishes a candidate key from a superkey?

AA candidate key must be a single column

BA candidate key is a minimal superkey — no subset also uniquely identifies rows

CA candidate key cannot contain NULL values

DA candidate key can only exist once per table

Question 5 of 5

ON DELETE CASCADE on a foreign key means:

APrevent deletion of the parent row if child rows exist

BSet the FK column to NULL when the parent is deleted

CAutomatically delete child rows when the parent row is deleted

DSet the FK to a default value when the parent is deleted

Interview Q&A

What is the difference between a superkey, candidate key, and primary key?▾

A superkey is any set of attributes that uniquely identifies a tuple. A candidate key is a minimal superkey — no proper subset is also a superkey. A primary key is the one candidate key chosen by the designer as the principal identifier. Example: in a users table, {user_id}, {email}, and {user_id, email} are all superkeys. {user_id} and {email} are candidate keys (both minimal). The designer picks {user_id} as PK and adds UNIQUE on email.

Explain referential integrity and ON DELETE options.▾

Referential integrity means every non-null FK value must exist as a PK in the referenced table. When the parent row is deleted: CASCADE deletes all child rows (useful for orders and items). SET NULL sets FK to NULL (useful for optional relationships). RESTRICT / NO ACTION rejects the delete if child rows exist (safest default). SET DEFAULT sets to a default value. Choose based on the business rule: is a child row meaningful without its parent?

What is the difference between INNER JOIN and CROSS JOIN?▾

A CROSS JOIN is the Cartesian product — every row in table A combined with every row in table B. If A has 1,000 rows and B has 1,000 rows, the result has 1,000,000 rows. No ON clause. An INNER JOIN is a cross join plus a filter: only tuples where the join predicate is true survive. CROSS JOINs are rarely intentional in production — a missing ON clause in an INNER JOIN produces a CROSS JOIN accidentally. Useful for generating test data or date/product combinations.

What does the relational algebra closure property mean?▾

Closure means every relational algebra operation takes one or more relations as input and produces a relation as output. This enables composition: you can nest operations arbitrarily, using the output of one as input to the next. SQL exploits this: subqueries, CTEs, and derived tables all rely on closure. Every SQL query produces a "virtual table" that can be treated as a relation for further querying — that is why SELECT * FROM (SELECT ...) AS sub works.

Why does SQL use DISTINCT? Does not the relational model already eliminate duplicates?▾

Strictly, yes — a relation is a set, so no duplicate tuples are allowed. But SQL is based on bags (multisets), not sets, for performance: eliminating duplicates requires sorting or hashing, which is expensive. So SELECT without DISTINCT can return duplicates. SELECT DISTINCT adds the deduplication step. This is a deliberate SQL deviation from pure relational theory — the tradeoff is performance over theoretical purity. For aggregations, duplicates are often intentional (you want to COUNT all rows, not just unique ones).

What is entity integrity and why is NULL problematic in PKs?▾

Entity integrity requires primary key columns be non-null and unique. NULL means "unknown value" — two NULLs are not considered equal (NULL != NULL in SQL's three-valued logic). If a PK column is NULL, the row cannot be uniquely identified. The DBMS rejects NULL in PK columns (NOT NULL is implicit). In practice, use surrogate keys (BIGSERIAL, UUID v7) rather than natural keys that might be null or change over time such as email addresses.

How is a LEFT JOIN different from a subquery with NOT EXISTS?▾

Both express "left side with optional right side" but serve different patterns. LEFT JOIN retrieves data from both tables — you can SELECT columns from both. Common pattern: "give me all orders and their payments if they exist." NOT EXISTS finds rows in one table with no match in another: "give me customers who have never ordered" — you test for absence, not select from the second table. Performance: LEFT JOIN + WHERE right.col IS NULL is equivalent to NOT EXISTS but the optimizer may choose different plans. NOT EXISTS often optimizes better in PostgreSQL; NOT IN can behave unexpectedly if the subquery returns NULLs.

What is a self-join and when would you use one?▾

A self-join joins a table to itself using aliases. Classic use: hierarchical or adjacency list structures. Example: SELECT e.name, m.name AS manager FROM employees e JOIN employees m ON e.manager_id = m.id. This retrieves each employee alongside their manager's name from the same table. Other uses: finding pairs of employees in the same department, comparing rows within the same table, or finding records that share a value. The DBMS treats the two aliases as two logical relations — same physical table, two logical names.