Imagine storing every order as a single row with the customer's name, city, and every product they ordered — all jammed together. What happens when the customer moves cities? You have to update every single row. Normalization is the process of designing your tables so that each piece of information lives in exactly one place, making updates safe and data consistent.
Redundancy in a relational schema is not just wasteful storage — it creates three types of update anomalies that silently corrupt data. Every normal form exists to eliminate one of these anomaly types.
Imagine a table that stores order rows with the customer's name and city baked in. If you want to add a new customer to the system before they have placed any order, you cannot — there is nowhere to put them. The customer information is tied to an order row, so a customer without an order simply cannot exist. This is an insert anomaly: you are forced to create unrelated data just to record the thing you actually want.
If a customer's city is stored in every order row, and that customer moves to a new city, you have to update hundreds of rows at once. If you update 99 out of 100 rows and miss one, your database now has contradicting information — the same customer living in two different cities at the same time. This is an update anomaly, and it is a silent data corruption problem. You may not even notice it until much later.
If the only record of a customer's address is embedded in order rows, then deleting the customer's last order also deletes their address — even though you only meant to delete the order. Information about the real world disappears as a side effect of an unrelated deletion. This is a delete anomaly: two pieces of information that should be independent are so tangled together that you cannot remove one without accidentally removing the other.
A functional dependency (FD) A→B says: for any two rows in the table with the same value of A, they must have the same value of B. A uniquely determines B. FDs are the mathematical foundation for all normal forms. Normalization is the process of decomposing a relation so that each FD is "enforced" in exactly one table.
A functional dependency (FD) just means "knowing this column tells you that column." For example, knowing a student ID tells you their name. A full FD is one where you need all the pieces of a composite key — you need both the order ID and the product ID together to know the quantity ordered. Neither alone is enough.
A partial FD is a problem: part of your composite primary key determines a non-key column by itself, without needing the other part. For example, if your primary key is (order_id, product_id) but customer_name is determined by order_id alone — not by the combination — then customer_name does not belong in this table. It depends on only half the key. This violates 2NF and causes the customer name to be repeated in every row for the same order.
A transitive FD is an indirect chain: A determines B, and B determines C, so A indirectly determines C through B. For example, an order has a zip code, and a zip code determines the city. So the order indirectly determines the city — but via the zip code, not directly. The city does not belong in the orders table; it belongs in a zip_codes table. Storing it in orders creates a 3NF violation.
Armstrong's Axioms are a set of logical rules for reasoning about which columns determine which other columns. You do not need to memorize the formal symbols, but the ideas are intuitive. Reflexivity: a column trivially determines itself. Augmentation: if A determines B, then A combined with C also determines B combined with C. Transitivity: if A determines B and B determines C, then A determines C. These rules are used to figure out the full set of things any given set of columns can determine — helpful when checking whether a design is fully normalized.
A table is in 1NF if every column contains only atomic (single, indivisible) values and every row is unique. Violations: storing a list in a cell ("Apple, Banana"), repeating column groups (product1, product2, product3), or nested tables.
| Violation (Not 1NF) | Fix (1NF) |
|---|---|
| tags = "sql,nosql,postgres" (list in one cell) | One row per tag: (post_id, tag) pairs in a separate table |
| item1, item2, item3 columns (repeating groups) | One item per row in an order_items table |
| address = "123 Main St, Springfield, IL" (compound) | street, city, state, zip as separate columns |
| phone_numbers = ["555-1234", "555-5678"] (array) | Separate user_phones table with (user_id, phone) |
A table in 1NF is also in 2NF if every non-key attribute depends on the FULL primary key — not just part of it. This only matters when the primary key is composite.
Table: order_products(order_id, product_id, quantity, product_name, unit_price). PK is (order_id, product_id). But product_name and unit_price depend only on product_id — not on the full PK. That is a partial dependency: part of the composite key determines non-key attributes.
Extract the partially dependent attributes into a new table with the partial key as PK. Keep only the attributes that depend on the full key in the original table.
order_products(order_id, product_id, quantity) — PK is full (order_id, product_id), and quantity genuinely needs both. products(product_id, product_name, unit_price) — product_id determines name and price independently. Now product_name changes in one place only.
A table in 2NF is also in 3NF if no non-key attribute depends on another non-key attribute. Transitivity chain: PK → A → B means B is transitively dependent on the PK through A. Extract A and B into a separate table.
Table: orders(order_id, customer_id, customer_name, customer_zip, city, state). order_id → customer_zip (direct). customer_zip → city (transitive via zip!). customer_zip → state (transitive). City and state depend on zip, not directly on order_id.
Create zip_codes(zip, city, state) where zip is the PK. Remove city and state from orders. Keep only zip in the orders table as a FK reference.
orders(order_id, customer_id, customer_zip). zip_codes(zip, city, state). When a city name changes (e.g., a city renaming), update exactly one row in zip_codes — not thousands of order rows. The update anomaly is eliminated.
BCNF is stricter than 3NF: for every non-trivial FD X→Y, X must be a superkey. 3NF allows Y to be a prime attribute (part of a candidate key); BCNF does not. Most tables in 3NF are also in BCNF — violations are rare but occur when a table has multiple overlapping candidate keys.
Relation R(student, course, teacher) where: {student, course} is a candidate key; {student, teacher} is also a candidate key (each teacher teaches only one course, each student-teacher pair is unique). FD: teacher → course. But teacher is NOT a superkey — it only determines course, not the full tuple. This violates BCNF. In 3NF, this is allowed because course is a prime attribute.
Split into teacher_course(teacher, course) (with teacher as PK — the determinant) and student_teacher(student, teacher). Both are in BCNF. Tradeoff: the FD {student, course} → teacher is no longer enforceable in a single table — it must be re-derived via a join. BCNF may sacrifice dependency preservation, but never lossless-join decomposition.
3NF always preserves all functional dependencies — useful when you need to enforce FDs via key constraints. BCNF eliminates all redundancy but may require application logic or triggers to enforce some FDs. In practice, target 3NF for OLTP schemas; BCNF only when specific anomalies are observed.
-- BEFORE: denormalized order table (multiple anomaly types) -- order_data(order_id, cust_id, cust_name, cust_zip, city, state, product, price, qty) -- FDs: order_id → cust_id, cust_name, cust_zip -- cust_zip → city, state (transitive via zip) -- product → price (partial dep on product alone) -- (order_id, product) → qty (full dep on composite key) -- AFTER: normalized to 3NF CREATE TABLE zip_codes ( zip CHAR(5) PRIMARY KEY, city VARCHAR(100) NOT NULL, state CHAR(2) NOT NULL ); CREATE TABLE customers ( customer_id SERIAL PRIMARY KEY, name VARCHAR(100) NOT NULL, zip CHAR(5) REFERENCES zip_codes(zip) ); CREATE TABLE products ( product_id SERIAL PRIMARY KEY, name VARCHAR(200) NOT NULL, unit_price NUMERIC(10,2) NOT NULL ); CREATE TABLE orders ( order_id SERIAL PRIMARY KEY, customer_id INT NOT NULL REFERENCES customers(customer_id), created_at TIMESTAMPTZ NOT NULL DEFAULT now() ); CREATE TABLE order_items ( order_id INT REFERENCES orders(order_id), product_id INT REFERENCES products(product_id), quantity INT NOT NULL CHECK (quantity > 0), PRIMARY KEY (order_id, product_id) ); -- Now city update: one row in zip_codes (was: thousands of order rows) UPDATE zip_codes SET city = 'New Springfield' WHERE zip = '62701'; -- Can add a customer before they place any order (no insert anomaly) INSERT INTO customers (name, zip) VALUES ('Alice', '62701'); -- Deleting an order doesn't lose customer data (no delete anomaly) DELETE FROM orders WHERE order_id = 42;
Normalization is ideal for OLTP (Online Transaction Processing) — frequent, small updates where redundancy causes anomalies. OLAP (Online Analytical Processing) workloads — bulk reads, aggregations, complex filters — benefit from controlled denormalization to reduce join counts.
| Property | Normalized (3NF) — OLTP | Star Schema — OLAP | Snowflake Schema — OLAP |
|---|---|---|---|
| Redundancy | None — each fact stored once | Dimension tables denormalized | Dimension tables normalized |
| Join count | Many (4–10 per report query) | Few (fact + dimension tables) | More than star (dimension hierarchies) |
| Update anomalies | None | Possible in dimensions | Reduced vs star |
| Query speed | Slower (more joins) | Fast (fewer joins) | Between star and normalized |
| Storage size | Smallest | Larger (dimension data repeated) | Between |
| Typical use | Transactional apps, RDBMS core | Snowflake, BigQuery, Redshift fact tables | Enterprise data warehouses |
One central fact table (measurements: sales_amount, quantity_sold) with foreign keys to multiple dimension tables (date, product, customer, store). Dimension tables are intentionally denormalized — they store all attributes of the dimension, even if those attributes are themselves functionally related. Example: dim_product stores (product_id, name, category, subcategory, brand, manufacturer) even though subcategory → category is a transitive dependency. Fewer joins means BI tools respond faster.
When dimension data changes (customer changes address, product changes category), you have a choice: Type 1 — overwrite (lose history), Type 2 — add a new row with valid_from/valid_to timestamps (full history), Type 3 — add a new column for previous value (limited history). SCD Type 2 is most common in enterprise DW — every historical sale records the product's category as it was at the time of sale.
Start with a denormalized table with 1NF violations (red cells). Step through normalization. Each step shows the table decomposition animated on the canvas.
Each normal form eliminates a specific class of data anomaly. 1NF removes multi-valued cells; 2NF removes partial dependency on a composite key; 3NF removes transitive dependencies; BCNF handles edge cases where non-trivial determinants are not superkeys.
Denormalization is the deliberate introduction of redundancy for a performance reason. It is not a failure of design — it is a conscious trade-off. The prerequisite is always a measured performance problem, not an assumption.
Store a pre-computed count or sum on the parent row. Example: users.order_count updated by a trigger or application code on every order insert/delete. Queries that need order_count avoid a COUNT(*) aggregate across all orders. Maintained via triggers or background jobs; document the denormalization contract clearly.
A materialized view stores the result of a query on disk. Queries hit the materialized view instead of the base tables — no join overhead. PostgreSQL's REFRESH MATERIALIZED VIEW CONCURRENTLY updates it without blocking readers. Use for expensive aggregations and reports that run frequently but whose data changes infrequently (dashboards, leaderboards, reporting tables).
In MongoDB/Cassandra, related data is embedded in a single document/row rather than normalized across collections. This trades update overhead for read speed — the entire object is retrieved in one I/O. Use embedding when the embedded data is always read with the parent, rarely updated independently, and bounded in size. Use references (normalization equivalent) when the embedded data has its own identity and lifecycle.
tags = "sql,database,postgres" violates 1NF. You cannot index individual values, enforce FK constraints, or query a specific tag without string parsing. Use a separate tags table with (entity_id, tag) rows. If using PostgreSQL, an array column with a GIN index is the pragmatic middle ground.
The JOIN between two indexed tables is usually fast (milliseconds). The update anomaly from storing customer city in every order row is always costly (potentially thousands of rows per update, inconsistency risk). Prove the JOIN is actually the bottleneck with EXPLAIN ANALYZE before adding redundancy.
Pushing an analytical schema to 3NF or beyond means dozens of JOINs per reporting query. BI tools and analysts struggle with complex join paths. Star schemas intentionally stop at controlled denormalization — dimension tables are not normalized. Know your workload before choosing the schema pattern.
status = "paid_shipped" encodes two facts (payment status and shipping status) in one column. When payment or shipping logic changes independently, you must parse and rewrite the encoding. Use two separate boolean or enum columns — one for payment_status, one for shipping_status.