Topic 04

Database Normalization

Imagine storing every order as a single row with the customer's name, city, and every product they ordered — all jammed together. What happens when the customer moves cities? You have to update every single row. Normalization is the process of designing your tables so that each piece of information lives in exactly one place, making updates safe and data consistent.

MySQLPostgreSQLOracle SQL Server

At a Glance

1NF — every cell holds one value, not a list. No "apple, banana, cherry" in a single column. No columns like phone1, phone2, phone3.
2NF — every non-key column depends on the entire primary key. If your primary key is two columns combined, a non-key column should not be determined by just one of them — that information belongs in its own table.
3NF — non-key columns do not determine each other. If zip code determines city, city does not belong in the orders table — it belongs in a zip_codes table. Store each fact where it is directly known.
BCNF — a stricter version of 3NF that handles a rare edge case when a table has multiple overlapping ways to uniquely identify rows. In practice, reaching 3NF covers almost all real designs.
Denormalization is not wrong — it is a deliberate choice you make after measuring. Always start normalized, then add redundancy only where you can prove it solves a real performance problem.

Motivation

Update Anomalies — Why Redundancy Is Dangerous

Redundancy in a relational schema is not just wasteful storage — it creates three types of update anomalies that silently corrupt data. Every normal form exists to eliminate one of these anomaly types.

Insert anomaly

Imagine a table that stores order rows with the customer's name and city baked in. If you want to add a new customer to the system before they have placed any order, you cannot — there is nowhere to put them. The customer information is tied to an order row, so a customer without an order simply cannot exist. This is an insert anomaly: you are forced to create unrelated data just to record the thing you actually want.

Update anomaly

If a customer's city is stored in every order row, and that customer moves to a new city, you have to update hundreds of rows at once. If you update 99 out of 100 rows and miss one, your database now has contradicting information — the same customer living in two different cities at the same time. This is an update anomaly, and it is a silent data corruption problem. You may not even notice it until much later.

Delete anomaly

If the only record of a customer's address is embedded in order rows, then deleting the customer's last order also deletes their address — even though you only meant to delete the order. Information about the real world disappears as a side effect of an unrelated deletion. This is a delete anomaly: two pieces of information that should be independent are so tangled together that you cannot remove one without accidentally removing the other.

Foundation

Functional Dependencies

A functional dependency (FD) A→B says: for any two rows in the table with the same value of A, they must have the same value of B. A uniquely determines B. FDs are the mathematical foundation for all normal forms. Normalization is the process of decomposing a relation so that each FD is "enforced" in exactly one table.

Full FD

A functional dependency (FD) just means "knowing this column tells you that column." For example, knowing a student ID tells you their name. A full FD is one where you need all the pieces of a composite key — you need both the order ID and the product ID together to know the quantity ordered. Neither alone is enough.

Partial FD

A partial FD is a problem: part of your composite primary key determines a non-key column by itself, without needing the other part. For example, if your primary key is (order_id, product_id) but customer_name is determined by order_id alone — not by the combination — then customer_name does not belong in this table. It depends on only half the key. This violates 2NF and causes the customer name to be repeated in every row for the same order.

Transitive FD

A transitive FD is an indirect chain: A determines B, and B determines C, so A indirectly determines C through B. For example, an order has a zip code, and a zip code determines the city. So the order indirectly determines the city — but via the zip code, not directly. The city does not belong in the orders table; it belongs in a zip_codes table. Storing it in orders creates a 3NF violation.

Armstrong's Axioms

Armstrong's Axioms are a set of logical rules for reasoning about which columns determine which other columns. You do not need to memorize the formal symbols, but the ideas are intuitive. Reflexivity: a column trivially determines itself. Augmentation: if A determines B, then A combined with C also determines B combined with C. Transitivity: if A determines B and B determines C, then A determines C. These rules are used to figure out the full set of things any given set of columns can determine — helpful when checking whether a design is fully normalized.

1NF

First Normal Form — Atomic Values

A table is in 1NF if every column contains only atomic (single, indivisible) values and every row is unique. Violations: storing a list in a cell ("Apple, Banana"), repeating column groups (product1, product2, product3), or nested tables.

Violation (Not 1NF)	Fix (1NF)
tags = "sql,nosql,postgres" (list in one cell)	One row per tag: (post_id, tag) pairs in a separate table
item1, item2, item3 columns (repeating groups)	One item per row in an order_items table
address = "123 Main St, Springfield, IL" (compound)	street, city, state, zip as separate columns
phone_numbers = ["555-1234", "555-5678"] (array)	Separate user_phones table with (user_id, phone)

2NF

Second Normal Form — No Partial Dependencies

A table in 1NF is also in 2NF if every non-key attribute depends on the FULL primary key — not just part of it. This only matters when the primary key is composite.

Identify the violation

Table: order_products(order_id, product_id, quantity, product_name, unit_price). PK is (order_id, product_id). But product_name and unit_price depend only on product_id — not on the full PK. That is a partial dependency: part of the composite key determines non-key attributes.

Decompose

Extract the partially dependent attributes into a new table with the partial key as PK. Keep only the attributes that depend on the full key in the original table.

Result

order_products(order_id, product_id, quantity) — PK is full (order_id, product_id), and quantity genuinely needs both. products(product_id, product_name, unit_price) — product_id determines name and price independently. Now product_name changes in one place only.

3NF

Third Normal Form — No Transitive Dependencies

A table in 2NF is also in 3NF if no non-key attribute depends on another non-key attribute. Transitivity chain: PK → A → B means B is transitively dependent on the PK through A. Extract A and B into a separate table.

Identify the chain

Table: orders(order_id, customer_id, customer_name, customer_zip, city, state). order_id → customer_zip (direct). customer_zip → city (transitive via zip!). customer_zip → state (transitive). City and state depend on zip, not directly on order_id.

Decompose at the transitive link

Create zip_codes(zip, city, state) where zip is the PK. Remove city and state from orders. Keep only zip in the orders table as a FK reference.

Result

orders(order_id, customer_id, customer_zip). zip_codes(zip, city, state). When a city name changes (e.g., a city renaming), update exactly one row in zip_codes — not thousands of order rows. The update anomaly is eliminated.

BCNF

Boyce-Codd Normal Form

BCNF is stricter than 3NF: for every non-trivial FD X→Y, X must be a superkey. 3NF allows Y to be a prime attribute (part of a candidate key); BCNF does not. Most tables in 3NF are also in BCNF — violations are rare but occur when a table has multiple overlapping candidate keys.

Classic BCNF violation example

Relation R(student, course, teacher) where: {student, course} is a candidate key; {student, teacher} is also a candidate key (each teacher teaches only one course, each student-teacher pair is unique). FD: teacher → course. But teacher is NOT a superkey — it only determines course, not the full tuple. This violates BCNF. In 3NF, this is allowed because course is a prime attribute.

BCNF decomposition

Split into teacher_course(teacher, course) (with teacher as PK — the determinant) and student_teacher(student, teacher). Both are in BCNF. Tradeoff: the FD {student, course} → teacher is no longer enforceable in a single table — it must be re-derived via a join. BCNF may sacrifice dependency preservation, but never lossless-join decomposition.

3NF vs BCNF tradeoff

3NF always preserves all functional dependencies — useful when you need to enforce FDs via key constraints. BCNF eliminates all redundancy but may require application logic or triggers to enforce some FDs. In practice, target 3NF for OLTP schemas; BCNF only when specific anomalies are observed.

SQL — Before and After 3NF Normalization

-- BEFORE: denormalized order table (multiple anomaly types)
-- order_data(order_id, cust_id, cust_name, cust_zip, city, state, product, price, qty)
-- FDs: order_id → cust_id, cust_name, cust_zip
--      cust_zip → city, state   (transitive via zip)
--      product → price          (partial dep on product alone)
--      (order_id, product) → qty (full dep on composite key)

-- AFTER: normalized to 3NF
CREATE TABLE zip_codes (
  zip    CHAR(5) PRIMARY KEY,
  city   VARCHAR(100) NOT NULL,
  state  CHAR(2) NOT NULL
);

CREATE TABLE customers (
  customer_id  SERIAL PRIMARY KEY,
  name         VARCHAR(100) NOT NULL,
  zip          CHAR(5) REFERENCES zip_codes(zip)
);

CREATE TABLE products (
  product_id   SERIAL PRIMARY KEY,
  name         VARCHAR(200) NOT NULL,
  unit_price   NUMERIC(10,2) NOT NULL
);

CREATE TABLE orders (
  order_id     SERIAL PRIMARY KEY,
  customer_id  INT NOT NULL REFERENCES customers(customer_id),
  created_at   TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE TABLE order_items (
  order_id     INT REFERENCES orders(order_id),
  product_id   INT REFERENCES products(product_id),
  quantity     INT NOT NULL CHECK (quantity > 0),
  PRIMARY KEY (order_id, product_id)
);

-- Now city update: one row in zip_codes (was: thousands of order rows)
UPDATE zip_codes SET city = 'New Springfield' WHERE zip = '62701';
-- Can add a customer before they place any order (no insert anomaly)
INSERT INTO customers (name, zip) VALUES ('Alice', '62701');
-- Deleting an order doesn't lose customer data (no delete anomaly)
DELETE FROM orders WHERE order_id = 42;

OLAP vs OLTP

Star Schema and Denormalization for Analytics

Normalization is ideal for OLTP (Online Transaction Processing) — frequent, small updates where redundancy causes anomalies. OLAP (Online Analytical Processing) workloads — bulk reads, aggregations, complex filters — benefit from controlled denormalization to reduce join counts.

Property	Normalized (3NF) — OLTP	Star Schema — OLAP	Snowflake Schema — OLAP
Redundancy	None — each fact stored once	Dimension tables denormalized	Dimension tables normalized
Join count	Many (4–10 per report query)	Few (fact + dimension tables)	More than star (dimension hierarchies)
Update anomalies	None	Possible in dimensions	Reduced vs star
Query speed	Slower (more joins)	Fast (fewer joins)	Between star and normalized
Storage size	Smallest	Larger (dimension data repeated)	Between
Typical use	Transactional apps, RDBMS core	Snowflake, BigQuery, Redshift fact tables	Enterprise data warehouses

Star schema structure

One central fact table (measurements: sales_amount, quantity_sold) with foreign keys to multiple dimension tables (date, product, customer, store). Dimension tables are intentionally denormalized — they store all attributes of the dimension, even if those attributes are themselves functionally related. Example: dim_product stores (product_id, name, category, subcategory, brand, manufacturer) even though subcategory → category is a transitive dependency. Fewer joins means BI tools respond faster.

Slowly Changing Dimensions (SCD)

When dimension data changes (customer changes address, product changes category), you have a choice: Type 1 — overwrite (lose history), Type 2 — add a new row with valid_from/valid_to timestamps (full history), Type 3 — add a new column for previous value (limited history). SCD Type 2 is most common in enterprise DW — every historical sale records the product's category as it was at the time of sale.

Interactive Demo

Normalization Animator

Start with a denormalized table with 1NF violations (red cells). Step through normalization. Each step shows the table decomposition animated on the canvas.

Each normal form eliminates a specific class of data anomaly. 1NF removes multi-valued cells; 2NF removes partial dependency on a composite key; 3NF removes transitive dependencies; BCNF handles edge cases where non-trivial determinants are not superkeys.

Normalization Steps

Steps: D → 1NF → 2NF → 3NF

Red values = 1NF violations

Denormalization

When to Denormalize Deliberately

Denormalization is the deliberate introduction of redundancy for a performance reason. It is not a failure of design — it is a conscious trade-off. The prerequisite is always a measured performance problem, not an assumption.

Cached aggregates

Store a pre-computed count or sum on the parent row. Example: users.order_count updated by a trigger or application code on every order insert/delete. Queries that need order_count avoid a COUNT(*) aggregate across all orders. Maintained via triggers or background jobs; document the denormalization contract clearly.

Materialized views

A materialized view stores the result of a query on disk. Queries hit the materialized view instead of the base tables — no join overhead. PostgreSQL's REFRESH MATERIALIZED VIEW CONCURRENTLY updates it without blocking readers. Use for expensive aggregations and reports that run frequently but whose data changes infrequently (dashboards, leaderboards, reporting tables).

Embedding for NoSQL

In MongoDB/Cassandra, related data is embedded in a single document/row rather than normalized across collections. This trades update overhead for read speed — the entire object is retrieved in one I/O. Use embedding when the embedded data is always read with the parent, rarely updated independently, and bounded in size. Use references (normalization equivalent) when the embedded data has its own identity and lifecycle.

Anti-patterns

Comma-separated values in a single column

tags = "sql,database,postgres" violates 1NF. You cannot index individual values, enforce FK constraints, or query a specific tag without string parsing. Use a separate tags table with (entity_id, tag) rows. If using PostgreSQL, an array column with a GIN index is the pragmatic middle ground.

Denormalizing without profiling first

The JOIN between two indexed tables is usually fast (milliseconds). The update anomaly from storing customer city in every order row is always costly (potentially thousands of rows per update, inconsistency risk). Prove the JOIN is actually the bottleneck with EXPLAIN ANALYZE before adding redundancy.

Over-normalizing OLAP schemas

Pushing an analytical schema to 3NF or beyond means dozens of JOINs per reporting query. BI tools and analysts struggle with complex join paths. Star schemas intentionally stop at controlled denormalization — dimension tables are not normalized. Know your workload before choosing the schema pattern.

Encoding multiple facts in one column

status = "paid_shipped" encodes two facts (payment status and shipping status) in one column. When payment or shipping logic changes independently, you must parse and rewrite the encoding. Use two separate boolean or enum columns — one for payment_status, one for shipping_status.

Quiz

Question 1 of 5

First Normal Form (1NF) requires that all attribute values are:

AFree of partial key dependencies

BBased on a single-column primary key

CAtomic — each cell contains a single, indivisible value

DEqual in count to the primary key attributes

Question 2 of 5

Which normal form eliminates transitive functional dependencies?

A1NF

B2NF

C3NF

DBCNF

Question 3 of 5

A functional dependency A → B means:

AB determines A

BEach A value uniquely determines the corresponding B value

CA and B always contain identical values in every row

DB is always the primary key of its relation

Question 4 of 5

Table: (order_id, product_id, quantity, product_name). PK is (order_id, product_id). What normal form violation exists?

A1NF — non-atomic values

B2NF — partial dependency (product_name depends only on product_id)

C3NF — transitive dependency

DBCNF — determinant is not a superkey

Question 5 of 5

In a star schema (used in data warehousing), dimension tables are intentionally:

ANormalized to 4NF

BWithout primary keys

CDenormalized to reduce join count in analytical queries

DSplit per normal form violation into hundreds of tables

Interview Q&A

What are the three types of update anomalies in a denormalized table?▾

Insert anomaly: you cannot add a fact without adding an unrelated fact. In a table that stores order + customer data in the same row, you cannot add a new customer until they place an order — the customer's existence is coupled to the order existence. Update anomaly: the same fact stored in multiple rows must be updated in all of them. Change a customer's city and you must update every order row for that customer. Miss one and the database becomes inconsistent. Delete anomaly: deleting a row destroys unrelated information. Delete the last order for a customer and you lose the customer's address — a fact that has nothing to do with orders. Normalization solves all three by storing each fact in exactly one place.

Explain the difference between 3NF and BCNF with an example.▾

3NF allows a non-key attribute to determine a prime attribute (part of a candidate key) — as long as the determinant is not a non-prime attribute. BCNF does not — every determinant must be a superkey, period. Example: R(student, course, teacher) where each teacher teaches one course. Candidate keys: {student, course} and {student, teacher}. FD: teacher → course. Teacher is not a superkey (teacher alone doesn't uniquely identify a row). This violates BCNF. But 3NF allows it because course is a prime attribute (part of the first CK). BCNF decomposition: split into (teacher, course) and (student, teacher). Tradeoff: the FD {student, course} → teacher is no longer enforceable via a single table constraint — you lose dependency preservation. 3NF always preserves dependencies; BCNF may not.

When would you intentionally denormalize and how do you manage the risks?▾

Denormalize when: (1) EXPLAIN ANALYZE proves the JOIN is the actual bottleneck, not a hypothetical one. (2) Read-heavy OLAP workloads where update anomalies are rare and tolerable. (3) Caching computed aggregates — e.g., storing order_count on the users table updated by triggers or background jobs. Management strategies: use database triggers to keep denormalized columns in sync; use materialized views with scheduled refreshes; document the denormalization explicitly in the schema DDL (comments); write integration tests that verify the denormalized data matches the normalized source. Never denormalize based on intuition — the JOIN is almost always fast enough for OLTP. Denormalization is a last resort, not a starting point.

What is Armstrong's Axioms and how are they used in practice?▾

Armstrong's Axioms are a sound and complete set of inference rules for deriving all FDs from a given set F: (1) Reflexivity: if B ⊆ A then A → B (trivial FD). (2) Augmentation: if A → B then AC → BC. (3) Transitivity: if A → B and B → C then A → C. Derived rules: Union (A→B and A→C ⇒ A→BC), Decomposition, Pseudo-transitivity. In practice, they are used to compute attribute closure F+ — given a set of attributes X, what is the set of all attributes that X functionally determines? The closure X+ is used to: determine if X is a candidate key (if X+ = all attributes); identify all candidate keys; verify whether a proposed decomposition is in 3NF or BCNF.

What is a lossless-join decomposition and why does it matter?▾

A decomposition of R into R1 and R2 is lossless-join if natural joining R1 and R2 always gives back exactly R — no spurious (extra phantom) tuples, no missing tuples. If a decomposition is lossy, you cannot reconstruct the original relation — information is permanently destroyed. The lossless-join condition: the intersection of R1 and R2 attributes must be a superkey of either R1 or R2. All standard normal form decompositions (1NF through BCNF) must be lossless-join — that is the fundamental requirement. A related but separate property is dependency preservation — all FDs from the original relation can be enforced in at least one decomposed relation without needing a join. 3NF decomposition is always lossless-join AND dependency-preserving. BCNF is always lossless-join but may sacrifice dependency preservation.

What is a star schema and how does it differ from a normalized OLTP schema?▾

A star schema has one central fact table (sales, events, orders) surrounded by dimension tables (date, product, customer, geography). Dimension tables are intentionally denormalized — all attributes of a dimension are stored in one wide table, even if they have transitive dependencies internally (e.g., subcategory → category → department stored in one dim_product table). This reduces join count: a query needs to join the fact table with only a few dimension tables, not dozens of normalized tables. An OLTP normalized schema has many small tables with clean FDs — optimized for INSERT/UPDATE/DELETE correctness and avoiding anomalies. OLAP schemas optimize for read query performance over update correctness. Key insight: the choice of schema is driven by the workload, not by a universal "best practice."

What is 4NF and when does it apply?▾

4NF eliminates multi-valued dependencies (MVDs). An MVD A ↡ B means: for each value of A, the set of B values is independent of the set of C values (all other attributes). Example: R(employee, skill, language) where employee has a set of skills and a set of languages independently. If an employee has skills {Python, SQL} and languages {English, French}, we must store all 4 combinations — (emp, Python, English), (emp, Python, French), etc. This is a multi-valued dependency, not an FD. 4NF: decompose into (employee, skill) and (employee, language) — now each fact is independent. 4NF is rare in practice — most schema design doesn't encounter true MVDs — but it occasionally appears in many-to-many junction tables with multiple independent dimensions.