Python · SQL · Web Dev · Java · AI/ML tracks launching soon — your one platform for all of IT
SQL — Module 26Intermediate
Normalisation
Design relational schemas that eliminate redundancy, prevent update anomalies, and stay consistent — 1NF through 3NF explained with real FreshCart examples
Before relational databases existed, data was often stored in flat files — one big table with every piece of information crammed into it. This feels natural: put everything in one place, easy to find. But as soon as the data needs to be updated, the problems begin.
Imagine FreshCart stored all order information in a single flat table like this:
order_id
customer_name
customer_email
customer_city
product_name
category
brand
qty
unit_price
store_city
store_manager
1001
Aisha Khan
aisha@gmail.com
Seattle
Amul Butter
Dairy
Amul
2
56.00
Seattle
Suresh Rao
1001
Aisha Khan
aisha@gmail.com
Seattle
Tata Salt
Staples
Tata
1
22.00
Seattle
Suresh Rao
1002
Rahul Sharma
rahul@gmail.com
New York
Amul Butter
Dairy
Amul
3
56.00
New York
Olivia Brown
1003
Aisha Khan
aisha@gmail.com
Seattle
Maggi Noodles
Staples
Nestle
5
15.00
Seattle
Suresh Rao
This table has four serious problems — and understanding them is the entire motivation for normalisation.
Update anomaly
Aisha Khan's email appears in three rows. If she changes her email, you must update all three rows. Miss one and your database has inconsistent data — two different "correct" emails for the same customer.
Insertion anomaly
You cannot add a new product to the catalogue without also having an order for it. The product row requires an order_id because the table is built around orders.
Deletion anomaly
If order 1002 is deleted, you lose all information about the New York store and Olivia Brown — because that was the only row containing those store details.
Redundancy
Amul Butter's name, category, and brand are repeated in every order that contains it. With millions of orders, this is millions of repeated values wasting storage and causing inconsistency risk.
Normalisation is the process of restructuring a relational database to eliminate these anomalies. It organises columns and tables to ensure that data dependencies make sense — that each piece of data is stored exactly once, in the right place, and can be updated in exactly one location.
// Part 02
The Normal Forms — A Progressive Hierarchy
Normalisation is defined as a series of normal forms — each a progressively stricter set of rules. A table that satisfies the rules of a given normal form is said to be "in" that normal form. Each normal form builds on the previous one: a table in 3NF is also in 2NF and 1NF.
1NFFirst Normal Form — Atomic values, no repeating groups
Rule
Every column contains atomic (indivisible) values. No column contains multiple values or repeating groups. There are no duplicate rows.
Key test
Test: Can any cell be split into multiple values? If yes → not 1NF
2NFSecond Normal Form — No partial dependencies
Rule
The table is in 1NF AND every non-key column is fully functionally dependent on the entire primary key — not just part of it. Only relevant for composite primary keys.
Key test
Test: Does any non-key column depend on only PART of a composite PK? If yes → not 2NF
3NFThird Normal Form — No transitive dependencies
Rule
The table is in 2NF AND no non-key column depends on another non-key column. Every non-key column depends directly on the primary key, and nothing else.
Key test
Test: Does any non-key column determine another non-key column? If yes → not 3NF
BCNFBoyce-Codd Normal Form — Stricter 3NF
Rule
A stricter version of 3NF: for every functional dependency A → B, A must be a superkey (a key that uniquely identifies the entire row). BCNF resolves edge cases where 3NF is insufficient.
Key test
Test: Does any non-superkey column determine any column? If yes → not BCNF
In practice, the target for most production databases is 3NF. It eliminates the major anomalies while keeping the schema practical. BCNF, 4NF, and 5NF address increasingly rare edge cases and are rarely required outside academic settings or highly specialised applications.
// Part 03
First Normal Form (1NF) — Atomic Values
A table is in 1NF when every column contains a single, atomic (indivisible) value — no lists, no sets, no repeating groups. Each row represents exactly one fact, and every fact is in exactly one cell.
Violations of 1NF
1NF violations — common patterns
-- Violation 1: Multiple values in one cell (comma-separated list)
-- products table with a 'tags' column
product_id | product_name | tags
1 | Amul Butter | 'dairy, refrigerated, fat'
-- PROBLEM: 'dairy, refrigerated, fat' is three values in one cell
-- Cannot query: WHERE tags = 'dairy' -- this won't work
-- Violation 2: Repeating column groups
-- orders table with multiple product columns
order_id | product_1 | qty_1 | product_2 | qty_2 | product_3 | qty_3
1001 | Amul Butter | 2 | Tata Salt | 1 | NULL | NULL
-- PROBLEM: How do you handle an order with 10 products?
-- Must add more columns. Queries are complex. NULLs everywhere.
-- Violation 3: Non-atomic composite value
customer_id | full_address
1 | '204 MG Road, Koramangala, Seattle 560001'
-- PROBLEM: Cannot query by city alone without string parsing
Fixing 1NF violations
Fixing 1NF — decompose into atomic values
-- Fix 1: Move multi-valued attribute to a separate table
-- Instead of tags column in products:
CREATE TABLE product_tags (
product_id INTEGER REFERENCES products(product_id),
tag VARCHAR(50) NOT NULL,
PRIMARY KEY (product_id, tag)
);
-- One row per tag per product — fully atomic
-- Fix 2: Move repeating groups to a separate table
-- Instead of product_1, qty_1, product_2, qty_2 columns in orders:
CREATE TABLE order_items (
order_id INTEGER REFERENCES orders(order_id),
product_id INTEGER REFERENCES products(product_id),
quantity INTEGER NOT NULL,
PRIMARY KEY (order_id, product_id)
);
-- One row per product per order — handles any number of products
-- Fix 3: Split composite address into atomic columns
CREATE TABLE customers (
customer_id INTEGER PRIMARY KEY,
street_address VARCHAR(200),
locality VARCHAR(100),
city VARCHAR(100),
state VARCHAR(100),
zip_code VARCHAR(10)
);
-- Each address component queryable independently
Loading FreshCart DB…
Ctrl + Enter to run
Loading FreshCart database in your browser…
// Part 04
Second Normal Form (2NF) — No Partial Dependencies
A table is in 2NF when it is in 1NF AND every non-key column depends on the entire primary key — not just part of it. This only matters when the primary key is composite (made of multiple columns). If the PK is a single column, 1NF automatically implies 2NF.
Understanding functional dependency
Column B is functionally dependent on column A if knowing the value of A determines the value of B. Knowing a customer_id determines the customer's email (one customer_id → one email). Knowing a product_id determines the product's name. Knowing an order_id determines the order date.
A 2NF violation — partial dependency
2NF violation — order_items with extra columns
-- Imagine order_items stored product details alongside order details:
CREATE TABLE order_items_bad (
order_id INTEGER,
product_id INTEGER,
quantity INTEGER,
unit_price DECIMAL(10,2),
-- These columns violate 2NF:
product_name VARCHAR(200), -- depends only on product_id, not on order_id
category VARCHAR(100), -- depends only on product_id
brand VARCHAR(100), -- depends only on product_id
PRIMARY KEY (order_id, product_id)
);
-- Composite PK: (order_id, product_id)
-- Dependency analysis:
-- quantity → depends on (order_id, product_id) BOTH ✓
-- unit_price → depends on (order_id, product_id) BOTH ✓ (price at time of order)
-- product_name → depends only on product_id ✗ PARTIAL DEPENDENCY
-- category → depends only on product_id ✗ PARTIAL DEPENDENCY
-- brand → depends only on product_id ✗ PARTIAL DEPENDENCY
-- Problems this causes:
-- If Amul Butter's name changes: must update EVERY order_item row
-- If a product is ordered 50,000 times: product_name stored 50,000 times
Fixing the 2NF violation
2NF fix — move partial dependencies to their own table
-- BEFORE (2NF violation):
order_items(order_id, product_id, quantity, unit_price, product_name, category, brand)
-- AFTER (2NF compliant — split into two tables):
-- Table 1: Facts about the order line item (depend on full PK)
CREATE TABLE order_items (
order_id INTEGER NOT NULL REFERENCES orders(order_id),
product_id INTEGER NOT NULL REFERENCES products(product_id),
quantity INTEGER NOT NULL,
unit_price DECIMAL(10,2) NOT NULL, -- price AT TIME of order (valid here)
line_total DECIMAL(10,2) NOT NULL,
PRIMARY KEY (order_id, product_id)
);
-- Table 2: Facts about the product (depend only on product_id)
CREATE TABLE products (
product_id INTEGER PRIMARY KEY,
product_name VARCHAR(200) NOT NULL,
category VARCHAR(100) NOT NULL,
brand VARCHAR(100),
unit_price DECIMAL(10,2) NOT NULL -- current price (may differ from order price)
);
-- Now product_name is stored ONCE in products
-- order_items joins to products to get the name when needed
Loading FreshCart DB…
Ctrl + Enter to run
Loading FreshCart database in your browser…
// Part 05
Third Normal Form (3NF) — No Transitive Dependencies
A table is in 3NF when it is in 2NF AND no non-key column determines another non-key column. Every non-key column must depend directly on the primary key — not on another non-key column via a chain (transitively).
Understanding transitive dependency
Column C is transitively dependent on the primary key A if A → B and B → C. The primary key determines B, and B determines C — but C does not directly depend on A. This is the chain that 3NF breaks.
A 3NF violation — transitive dependency
3NF violation — transitive dependency example
-- Imagine customers stored with city and state together:
CREATE TABLE customers_bad (
customer_id INTEGER PRIMARY KEY,
first_name VARCHAR(100),
email VARCHAR(255),
zip_code VARCHAR(10),
city VARCHAR(100), -- determined by zip_code (transitive!)
state VARCHAR(100) -- determined by zip_code (transitive!)
);
-- Dependency chain:
-- customer_id → zip_code (direct: each customer has one zip_code)
-- zip_code → city (a zip_code determines a city)
-- zip_code → state (a zip_code determines a state)
-- Therefore: customer_id → zip_code → city (TRANSITIVE)
-- Problems:
-- If Seattle's zip_code 560001 is reassigned to another city:
-- Must update EVERY customer row with that zip_code
-- Storage: city/state repeated for every customer with the same zip_code
-- Inconsistency risk: two customers with same zip_code can have different cities
-- Another classic 3NF violation:
CREATE TABLE employees_bad (
employee_id INTEGER PRIMARY KEY,
first_name VARCHAR(100),
department VARCHAR(100),
dept_head VARCHAR(100) -- determined by department, not employee_id!
);
-- employee_id → department → dept_head (TRANSITIVE)
Fixing the 3NF violation
3NF fix — extract the transitive dependency
-- BEFORE (3NF violation):
customers(customer_id, first_name, email, zip_code, city, state)
-- AFTER (3NF compliant):
-- Table 1: Customer facts that depend directly on customer_id
CREATE TABLE customers (
customer_id INTEGER PRIMARY KEY,
first_name VARCHAR(100) NOT NULL,
email VARCHAR(255) NOT NULL UNIQUE,
zip_code VARCHAR(10)
-- city and state removed — they belong in a zip_codes table
);
-- Table 2: Zip Code facts (city and state depend on zip_code, not customer)
CREATE TABLE zip_codes (
zip_code VARCHAR(10) PRIMARY KEY,
city VARCHAR(100) NOT NULL,
state VARCHAR(100) NOT NULL
);
-- employees fix:
CREATE TABLE departments (
department VARCHAR(100) PRIMARY KEY,
dept_head VARCHAR(100) NOT NULL
);
CREATE TABLE employees (
employee_id INTEGER PRIMARY KEY,
first_name VARCHAR(100) NOT NULL,
department VARCHAR(100) REFERENCES departments(department)
-- dept_head removed — it depends on department, not employee_id
);
Loading FreshCart DB…
Ctrl + Enter to run
Loading FreshCart database in your browser…
In FreshCart's employees table, store_id is stored in employees and the store's city and manager_name are in the stores table. This is correct 3NF — if we stored store_city in employees, it would be a transitive dependency: employee_id → store_id → city.
// Part 06
FreshCart — Full Normalisation Walkthrough
Let us walk through the original denormalised FreshCart flat table from Part 01 and normalise it step by step to 3NF.
Starting point — the flat table (unnormalised)
Unnormalised flat table
-- Single flat table with everything mixed together
orders_flat(
order_id,
customer_name, -- should be in customers
customer_email, -- should be in customers
customer_city, -- should be in customers
product_name, -- should be in products
category, -- should be in products (depends on product)
brand, -- should be in products
quantity, -- belongs here (depends on order + product)
unit_price, -- belongs here (price at time of order)
store_city, -- should be in stores
store_manager -- should be in stores
)
Step 1 — Apply 1NF
Separate repeating groups. Each order can have multiple products — this is a repeating group. Split into orders (one row per order) and order_items (one row per product per order).
After 1NF — two tables
orders(order_id, customer_name, customer_email, customer_city,
store_city, store_manager, order_date)
-- One row per order ✓
order_items(order_id, product_name, category, brand, quantity, unit_price)
-- One row per product per order ✓
-- PK: (order_id, product_name) — composite
Step 2 — Apply 2NF
In order_items, the composite PK is (order_id, product_name). Check: which columns depend on the FULL PK vs only PART of it?
After 2NF — partial dependencies removed
-- order_items partial dependencies:
-- quantity → depends on (order_id, product_name) ✓ stays
-- unit_price → depends on (order_id, product_name) ✓ stays (price at time of order)
-- product_name → it IS part of the PK ✓
-- category → depends only on product_name ✗ move to products table
-- brand → depends only on product_name ✗ move to products table
-- After 2NF:
order_items(order_id, product_id, quantity, unit_price)
-- product_name, category, brand moved to their own table:
products(product_id, product_name, category, brand)
Step 3 — Apply 3NF
Check all tables for transitive dependencies — non-key columns that determine other non-key columns.
After 3NF — transitive dependencies removed
-- In orders table:
-- order_id → customer_name (direct) — but customer info should be its own entity
-- order_id → customer_email (depends on customer, not on order)
-- order_id → customer_city (depends on customer)
-- order_id → store_city (depends on store, not on order)
-- order_id → store_manager (depends on store)
-- Transitive chains:
-- order_id → customer_id → customer_email (3NF violation)
-- order_id → store_id → store_manager (3NF violation)
-- After 3NF:
customers(customer_id, customer_name, email, city)
stores(store_id, city, manager_name, monthly_target)
orders(order_id, customer_id, store_id, order_date, total_amount)
products(product_id, product_name, category, brand, unit_price)
order_items(order_id, product_id, quantity, unit_price, line_total)
-- This is exactly FreshCart's schema!
Loading FreshCart DB…
Ctrl + Enter to run
Loading FreshCart database in your browser…
The JOIN query recreates the flat view on demand. The difference: each piece of data is stored exactly once. Updating a customer's city requires updating one row in customers — not scanning millions of order rows. Adding a new product does not require an order. Deleting an order does not lose product or store data.
// Part 07
Denormalisation — When to Break the Rules
Normalisation eliminates redundancy and prevents anomalies. But it has a cost: every query that needs data from multiple tables must perform JOINs. JOINs are powerful but not free — they require the database to match rows across tables, which takes time proportional to the sizes of the tables involved.
In read-heavy analytical workloads — reporting, dashboards, business intelligence — JOINs across heavily normalised schemas can become the bottleneck. This is where denormalisation is intentionally applied: trading storage efficiency and update simplicity for query speed.
When denormalisation is appropriate
Read-heavy analytics
A reporting table queried millions of times per day that is only updated nightly. The update cost is small, the read benefit is large.
Data warehouses
Star schema and snowflake schema in data warehouses intentionally denormalise for analytical query performance.
Pre-computed aggregates
Storing total_amount in orders instead of computing SUM(order_items.line_total) on every order query.
High-read lookup data
Storing city and state on the customer row even though they "belong" in a locations table — if city is filtered on 90% of queries, the JOIN cost adds up.
FreshCart's deliberate denormalisation
Deliberate denormalisation in FreshCart
-- orders.total_amount is denormalised
-- It could always be computed as: SELECT SUM(line_total) FROM order_items WHERE order_id = X
-- But storing it directly in orders saves that JOIN+SUM for every order lookup
-- This is a valid denormalisation because:
-- 1. total_amount is queried far more often than it is computed
-- 2. The application maintains consistency: it always updates total_amount when order_items change
-- 3. The performance benefit is significant (no JOIN for the most common query)
-- The trade-off: if an order_items line_total is corrected,
-- total_amount in orders must also be manually updated
-- Application code must maintain this invariant
-- order_items.unit_price is also denormalised:
-- The current product price is in products.unit_price
-- The price AT TIME OF ORDER is stored in order_items.unit_price
-- This is correct — it is a slowly changing dimension (not a 3NF violation)
🎯 Pro Tip
The rule of thumb for denormalisation: normalise first, then selectively denormalise based on measured performance data. Never denormalise speculatively — "this might be slow" is not a reason. Measure the query, confirm the JOIN is the bottleneck, then denormalise the specific data that eliminates the bottleneck. Document every denormalisation decision and how the application maintains data consistency.
// Part 08
Functional Dependencies — The Theory Behind Normalisation
Understanding normalisation deeply requires understanding functional dependencies — the formal mathematical concept that normal forms are built on.
Definition
A functional dependency A → B (read: "A determines B") means that for any two rows with the same value of A, those rows will always have the same value of B. Knowing A is enough to know B.
Functional dependencies in FreshCart
-- These functional dependencies exist in FreshCart's data:
customer_id → first_name -- one customer_id has one first_name
customer_id → email -- one customer has one email
customer_id → city -- one customer has one city
product_id → product_name -- one product has one name
product_id → category -- one product belongs to one category
product_id → unit_price -- one product has one current price
store_id → city -- one store is in one city
store_id → manager_name -- one store has one manager
order_id → customer_id -- one order belongs to one customer
order_id → order_date -- one order has one date
order_id → total_amount -- one order has one total
-- Composite key dependency:
(order_id, product_id) → quantity -- one line item has one quantity
(order_id, product_id) → unit_price -- one line item has one price at order time
-- Transitive dependencies (these would violate 3NF if in the wrong table):
customer_id → city (would violate 3NF if city were stored in orders)
store_id → manager_name (would violate 3NF if stored in orders)
Using functional dependencies to test normal forms
Systematic normalisation test
-- For any table, to check its normal form:
-- 1NF test: Is every column atomic?
-- "tags VARCHAR containing 'dairy,fresh,organic'" → FAIL 1NF
-- 2NF test (only for composite PKs): Does every non-key column
-- depend on the FULL primary key?
-- In order_items(order_id, product_id, quantity, product_name):
-- product_name depends only on product_id (partial) → FAIL 2NF
-- 3NF test: Does any non-key column determine another non-key column?
-- In customers(customer_id, email, city, state):
-- If city → state (a city is always in one state) → CHECK 3NF
-- Actually: in India, a city can span state borders (rare)
-- And a customer might be in a different state than their city suggests
-- So city does NOT reliably determine state → PASS 3NF (in this context)
-- Context matters: functional dependencies are facts about your DOMAIN
-- not about SQL syntax. You must understand the business rules.
// Part 09
Practical Normalisation — Step-by-Step Design
Here is the practical workflow for designing a normalised schema from scratch, applied to a new FreshCart feature: a supplier management system.
Requirements from the product team
FreshCart needs to track: which supplier provides which products, the supplier's contact details, the contract price (may differ from the selling price), the contract start and end dates, and whether the supplier is preferred for that product.
supplier_id → name, contact_person, phone, email, address. product_id → product_name, category (already in products table). (supplier_id, product_id) → contract_price, start_date, end_date, is_preferred (depends on the combination of both).
Step 3
Design tables based on functional dependencies
Each set of attributes with the same determinant becomes a table. The determinant becomes the primary key.
Normalised supplier schema — 3NF
-- Suppliers table: attributes that depend on supplier_id
CREATE TABLE suppliers (
supplier_id INTEGER PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
supplier_name VARCHAR(200) NOT NULL,
contact_person VARCHAR(200) NOT NULL,
phone VARCHAR(20) NOT NULL,
email VARCHAR(255) NOT NULL UNIQUE,
city VARCHAR(100),
state VARCHAR(100),
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
-- Supply contracts: attributes that depend on (supplier_id, product_id)
CREATE TABLE supply_contracts (
supplier_id INTEGER NOT NULL REFERENCES suppliers(supplier_id),
product_id INTEGER NOT NULL REFERENCES products(product_id),
contract_price DECIMAL(10,2) NOT NULL CHECK (contract_price > 0),
start_date DATE NOT NULL,
end_date DATE,
is_preferred BOOLEAN NOT NULL DEFAULT false,
PRIMARY KEY (supplier_id, product_id),
-- A product can only have one preferred supplier
CONSTRAINT uq_preferred_supplier
UNIQUE (product_id, is_preferred)
-- Note: in practice this constraint is complex to implement
-- as UNIQUE allows multiple FALSE values but only one TRUE
);
-- This is 3NF because:
-- All supplier attributes depend directly on supplier_id
-- All contract attributes depend on (supplier_id, product_id)
-- No transitive dependencies
-- supplier city/state depend on supplier_id (the PK) directly
Loading FreshCart DB…
Ctrl + Enter to run
Loading FreshCart database in your browser…
// Part 10
What This Looks Like at Work
You are reviewing a database design at a Seattle health-tech startup. A junior engineer has designed a single flat table for their telemedicine platform's appointment system. You need to normalise it.
Update anomaly: if Dr. Priya Sharma changes her phone, every appointment row must be updated. Deletion anomaly: if all appointments with a clinic are cancelled, the clinic's data is lost. Insertion anomaly: cannot add a doctor to the system without scheduling an appointment.
10:00 AM
You apply normalisation
Identify entities and functional dependencies, then design the 3NF schema.
Normalised 3NF design
-- Entity: Patient (attributes depend on patient_id)
CREATE TABLE patients (
patient_id INTEGER PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
full_name VARCHAR(200) NOT NULL,
email VARCHAR(255) NOT NULL UNIQUE,
phone VARCHAR(20) NOT NULL,
city VARCHAR(100)
);
-- Entity: Doctor (attributes depend on doctor_id)
CREATE TABLE doctors (
doctor_id INTEGER PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
full_name VARCHAR(200) NOT NULL,
phone VARCHAR(20) NOT NULL,
email VARCHAR(255) NOT NULL UNIQUE,
specialisation VARCHAR(200) NOT NULL,
consultation_fee DECIMAL(8, 2) NOT NULL CHECK (consultation_fee > 0)
);
-- Entity: Clinic (attributes depend on clinic_id)
-- (Removed from appointment — transitive dependency)
CREATE TABLE clinics (
clinic_id INTEGER PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
clinic_name VARCHAR(200) NOT NULL,
address VARCHAR(500) NOT NULL,
city VARCHAR(100) NOT NULL
);
-- Association: which doctors work at which clinics
CREATE TABLE doctor_clinics (
doctor_id INTEGER REFERENCES doctors(doctor_id),
clinic_id INTEGER REFERENCES clinics(clinic_id),
PRIMARY KEY (doctor_id, clinic_id)
);
-- Entity: Appointment (depends on appointment_id)
CREATE TABLE appointments (
appointment_id INTEGER PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
patient_id INTEGER NOT NULL REFERENCES patients(patient_id),
doctor_id INTEGER NOT NULL REFERENCES doctors(doctor_id),
clinic_id INTEGER NOT NULL REFERENCES clinics(clinic_id),
appointment_date DATE NOT NULL,
appointment_time TIME NOT NULL,
-- consultation_fee stored here: price at booking time (may differ from doctor's current fee)
consultation_fee DECIMAL(8,2) NOT NULL,
status VARCHAR(20) NOT NULL DEFAULT 'Scheduled'
CHECK (status IN ('Scheduled','Completed','Cancelled','No-show')),
notes TEXT
);
11:00 AM
Review and approval
You walk through the normalised design with the team. The CTO asks: "Why store consultation_fee in appointments if it is already in doctors?" You explain: the fee at booking time may differ from the doctor's current fee — storing it in appointments is a deliberate denormalisation that preserves historical accuracy (like unit_price in order_items). This is accepted and the design is approved.
🎯 Pro Tip
The question "why is this stored here if it's already in the other table?" is one of the most common and most useful questions in schema review. Sometimes the answer is "it should not be here — move it" (a normalisation fix). Sometimes the answer is "it needs to capture the value at a specific point in time, which may differ from the current value" (a valid denormalisation). Learning to distinguish between these two cases is a core database design skill.
// Part 11
Interview Prep — 5 Questions With Complete Answers
Q: What is normalisation and what problems does it solve?
Normalisation is the process of organising a relational database schema to reduce redundancy and eliminate data anomalies. It works by ensuring that each piece of data is stored in exactly one place, and that every column in a table depends on the primary key — not on other non-key columns or on only part of a composite key.
The three anomalies normalisation prevents: Update anomaly — when the same data exists in multiple places, updating it requires finding and changing every occurrence. Missing one creates inconsistency. Insertion anomaly — when data about one entity is mixed with data about another, you cannot insert one without the other. Deletion anomaly — when data about two different entities is stored together, deleting information about one accidentally destroys information about the other.
A concrete example: storing customer email in an orders table alongside order details means the email must be updated in every order row if it changes. A normalised design stores the email once in a customers table and references it from orders via customer_id. Changing the email requires one update to one row — not thousands of updates across all order rows. This is the fundamental value proposition of normalisation: store each fact exactly once, update it in exactly one place, and never risk inconsistency from partial updates.
Q: Explain the difference between 1NF, 2NF, and 3NF with examples.
First Normal Form (1NF) requires that every column contains a single, atomic (indivisible) value and that there are no repeating groups. A table violates 1NF if a cell contains multiple values (a comma-separated tags list), if the schema uses repeating column groups (product_1, product_2, product_3), or if there are duplicate rows. The fix is always to move multi-valued attributes to a separate table with a FK relationship.
Second Normal Form (2NF) applies only to tables with composite primary keys. It requires that every non-key column depends on the entire composite key — not just part of it. A partial dependency exists when a non-key column can be determined by only one column of the composite key. Example: in order_items(order_id, product_id, quantity, product_name), product_name depends only on product_id, not on the full (order_id, product_id) key — a 2NF violation. Fix: move product_name to a products table keyed by product_id alone.
Third Normal Form (3NF) requires that no non-key column determines another non-key column. A transitive dependency exists when column A (the PK) determines column B (non-key), and column B determines column C (another non-key). Example: in employees(employee_id, department, dept_head), employee_id → department → dept_head creates a transitive chain. dept_head depends on department, not directly on employee_id. Fix: move (department, dept_head) to a departments table — each department is stored once with its head. The key test: can you update one non-key column without touching the PK? If yes, it should be in a separate table.
Q: What is a functional dependency and how does it relate to normalisation?
A functional dependency A → B (A determines B) means that for any two rows with the same value of A, those rows will always have the same value of B. If you know A, you can uniquely determine B. Functional dependencies are facts about the domain — they reflect business rules about how data relates, not SQL syntax.
Functional dependencies are the mathematical foundation of normal forms. 1NF is about atomicity (no multi-valued attributes). 2NF is about eliminating partial functional dependencies on composite keys. 3NF is about eliminating transitive functional dependencies. BCNF generalises further: for every functional dependency A → B, A must be a superkey. Each normal form is defined in terms of which functional dependencies are or are not allowed.
Practically, to normalise a schema, you identify all functional dependencies in your domain, group attributes by their determinants (the left-hand side of dependencies), and make each determinant a primary key with its dependent attributes as columns. Attributes with the same determinant belong in the same table. Attributes with different determinants belong in different tables — connected by foreign keys. The resulting schema, built from correctly identified functional dependencies, will automatically be in 3NF.
Q: When is it acceptable to denormalise a schema and what are the risks?
Denormalisation is acceptable when the read performance benefit outweighs the write complexity cost — specifically in read-heavy analytical workloads where the same JOIN computation is performed millions of times per day, and the data being joined changes infrequently. Data warehouses intentionally use denormalised star schemas because analytical queries over billions of rows need to avoid expensive multi-table JOINs. Reporting tables populated nightly from normalised OLTP tables are another common case — the report runs against a flat denormalised table for speed.
Common legitimate denormalisations: storing computed aggregates (total_amount in orders instead of always computing SUM(order_items.line_total)), storing denormalised display values (customer_name on an events table to avoid a JOIN for every event display), and storing point-in-time values that must not change when the source changes (unit_price in order_items).
The risks: data inconsistency when the denormalised copy falls out of sync with the source. If orders.total_amount is stored but an order_items record is corrected, total_amount must be manually updated too — the database no longer enforces consistency. Application code must maintain the invariant. This requires careful testing, triggers, or event-driven synchronisation. The rule: normalise first, measure performance, denormalise only where you have data showing the JOIN is the bottleneck, and document every denormalisation decision with how consistency is maintained.
Q: How does FreshCart's schema demonstrate normalisation principles?
FreshCart's six-table schema is a clean example of 3NF design. Each entity — customers, products, stores, employees, orders, order_items — has its own table where all columns depend directly on that table's primary key. Customer information (name, email, city) is stored once in customers and referenced by customer_id wherever needed. Product information (name, category, brand) is stored once in products. Store information is in stores. No non-key column in any table determines another non-key column.
The relationships are enforced by foreign keys: orders.customer_id references customers, orders.store_id references stores, order_items.order_id references orders, order_items.product_id references products. This structure means changing a customer's city requires one UPDATE to one row in customers — not thousands of updates across all order rows. Deleting a store does not lose product data. Adding a new product does not require an order to exist first.
FreshCart has two deliberate denormalisations worth noting: orders.total_amount stores the sum of order_items.line_total — this is a computed aggregate stored for read performance, maintained by application code. order_items.unit_price stores the price at the time of the order, which may differ from the current products.unit_price — this is a point-in-time capture, not redundancy. Both are valid and intentional, following the principle of "normalise first, denormalise only where justified by real performance data."
// Part 12
Design Mistakes — And How to Recognise Them
Data quality issue: two orders show different cities for the same customer — which is correct?
Cause: Customer city is stored in the orders table (or any child table) rather than in a dedicated customers table. When the customer moved cities and only some orders were updated, different rows ended up with different city values for the same customer. This is a classic update anomaly caused by a 3NF violation — city transitively depends on customer_id through the order, not directly.
Fix: Move city to the customers table where it belongs. Each customer has one city, stored once. UPDATE requires changing one row. All orders automatically reflect the current city via the JOIN. Run a data cleanup: SELECT customer_id, COUNT(DISTINCT city) FROM orders GROUP BY customer_id HAVING COUNT(DISTINCT city) > 1 — find all customers with inconsistent cities. Pick the canonical value and update customers.city. Then remove city from the orders table.
Cannot add a new product to the catalogue — INSERT into products fails because there is no order to attach it to
Cause: Product information is stored in the orders or order_items table rather than in a standalone products table. This is an insertion anomaly — the schema treats products as a property of orders rather than as an independent entity. You cannot record the existence of a product until an order is placed for it.
Fix: Create a standalone products table with its own primary key. Every product in the catalogue gets a row in products regardless of whether it has ever been ordered. order_items then references products via product_id. This is the correct 2NF/3NF design: product facts (name, category, price) belong in products; order-specific facts (quantity, price at order time) belong in order_items.
Deleting a store record also loses all information about the products that were sold in that store
Cause: Product information is stored in the same table as store information or order information, rather than in its own products table. When the store record is deleted, the only rows containing that product's details are gone — a deletion anomaly. This happens when the schema does not separate entities into their own tables.
Fix: Ensure each entity (product, store, customer, order) has its own table. Products exist independently of stores and orders. The relationship between stores and orders is captured via orders.store_id (a FK). Deleting a store (assuming RESTRICT) would be prevented if orders reference it — protecting the historical record. The product catalogue exists independently and is unaffected by store deletions.
Report shows incorrect totals — order_items were corrected but orders.total_amount still shows the old value
Cause: A deliberate denormalisation (storing total_amount in orders alongside the detailed amounts in order_items) has gone out of sync. A correction was made to order_items without updating the aggregate in orders. This is the inherent risk of any denormalisation — the application must maintain consistency between the redundant copies.
Fix: After any correction to order_items, recalculate and update orders.total_amount: UPDATE orders SET total_amount = (SELECT SUM(line_total) FROM order_items WHERE order_id = orders.order_id) WHERE order_id IN (corrected_order_ids). Going forward, either use a trigger to automatically update total_amount when order_items change, or add an application-level invariant that any code path modifying order_items must also recalculate total_amount. Document the denormalisation clearly so future developers know both tables must be updated together.
Try It Yourself
A FreshCart analyst built this single flat table for tracking marketing campaigns: campaigns_flat(campaign_id, campaign_name, start_date, end_date, store_id, store_city, store_manager, product_id, product_name, product_category, discount_pct, target_units). Identify: (1) Which normal form does this violate and why? (2) What are the anomalies? (3) Design a normalised 3NF schema with correct tables and FKs. You do not need to write CREATE TABLE — just describe the tables and their columns.
🎯 Key Takeaways
✓Normalisation eliminates three data anomalies: update anomaly (same data in multiple places), insertion anomaly (cannot add one entity without another), and deletion anomaly (deleting one entity accidentally removes another).
✓1NF: every column is atomic (single value). No lists in cells, no repeating column groups. Fix: move multi-valued attributes to a separate table.
✓2NF: every non-key column depends on the ENTIRE composite primary key. Only relevant for composite PKs. Fix: move partially dependent columns to a table keyed by the partial key.
✓3NF: no non-key column determines another non-key column. No transitive dependencies. Fix: extract the transitive chain into its own table with the intermediate column as the PK.
✓Functional dependency A → B means knowing A uniquely determines B. Normalisation groups attributes by their determinant — each determinant becomes a table's primary key.
✓The target for production databases is 3NF. BCNF, 4NF, and 5NF address increasingly rare edge cases and are rarely needed outside academic or highly specialised contexts.
✓Denormalisation is intentionally breaking normalisation rules for read performance. Valid for analytics, reporting tables, and pre-computed aggregates. Always: normalise first, measure, then denormalise only where justified.
✓Storing a value at its point in time (unit_price in order_items, consultation_fee in appointments) is not a 3NF violation — it is a deliberate historical capture. The value captures what was true at that moment, not what is true now.
✓Every denormalisation creates an application-level invariant: the code must keep all redundant copies in sync. Document every denormalisation and how consistency is maintained.
✓FreshCart's six-table schema is 3NF: each entity has its own table, all columns depend directly on the PK, no transitive dependencies, two deliberate denormalisations (total_amount, order unit_price) with clear business justifications.
What comes next
In Module 27, you learn aggregate functions — COUNT, SUM, AVG, MIN, MAX — the tools that turn raw rows into business metrics. This is where SQL becomes the language of analytics.