Normalization — 1NF to 5NF
The complete science of designing databases that don't betray you — eliminating every class of anomaly, through every normal form, with complete worked examples and real-world context.
What Normalization Is — And the Pain It Was Built to Eliminate
Normalization is the process of structuring a relational database schema to reduce data redundancy and eliminate data anomalies. It was developed by Edgar Codd alongside the relational model itself — he recognised immediately that having a good data model (tables and relations) was not sufficient. You also needed rules for how to design good tables. Without those rules, engineers would produce relational schemas that technically satisfied the relational model but were deeply flawed in practice.
Codd's initial paper (1970) introduced First Normal Form. He then published Second and Third Normal Forms in 1971. Raymond Boyce and Codd together published BCNF in 1974. Fourth Normal Form was formalised by Ronald Fagin in 1977. Fifth Normal Form (also called Project-Join Normal Form) was published by Fagin in 1979. The progression spans nearly a decade of research — each new normal form discovered by finding a class of anomaly that the previous form failed to prevent.
Before we study the normal forms, we must understand deeply what they are protecting us from. The enemy of normalization is data anomalies — situations where the database can be updated in ways that leave it in an inconsistent state. There are three classes of anomalies, and they arise from a single root cause: storing facts about multiple independent things in the same table.
The Problem Table — Before We Start Fixing Anything
We will use one concrete, realistic table throughout this entire module — watching it transform step by step through every normal form. The table is called STUDENT_COURSE_TEACHER. It is the kind of table a beginner might design when asked to track which students are enrolled in which courses taught by which teachers.
| student_id | student_name | student_email | course_id | course_name | teacher_id | teacher_name | teacher_phone | teacher_dept | grade |
|---|---|---|---|---|---|---|---|---|---|
| S001 | Rahul Sharma | rahul@uni.in | CS301 | Database Systems | T01 | Prof. Kumar | 98765-43210 | Computer Science | A |
| S001 | Rahul Sharma | rahul@uni.in | CS302 | Algorithms | T02 | Prof. Singh | 87654-32109 | Computer Science | B+ |
| S002 | Priya Reddy | priya@uni.in | CS301 | Database Systems | T01 | Prof. Kumar | 98765-43210 | Computer Science | A+ |
| S002 | Priya Reddy | priya@uni.in | CS401 | Machine Learning | T03 | Prof. Rao | 76543-21098 | AI Department | B |
| S003 | Arjun Nair | arjun@uni.in | CS301 | Database Systems | T01 | Prof. Kumar | 98765-43210 | Computer Science | B+ |
| S003 | Arjun Nair | arjun@uni.in | CS302 | Algorithms | T02 | Prof. Singh | 87654-32109 | Computer Science | A |
| S004 | Kavya Krishnan | kavya@uni.in | CS401 | Machine Learning | T03 | Prof. Rao | 76543-21098 | AI Department | A+ |
Look at this table carefully. Prof. Kumar's phone number appears in THREE rows. Rahul Sharma's email appears in TWO rows. The course name "Database Systems" appears in THREE rows. All of this repeated data is redundancy — and redundancy is the breeding ground for anomalies.
The Three Anomalies — A Complete Dissection
A relation is in First Normal Form if and only if every attribute in every tuple contains exactly one atomic value — a single, indivisible value from the attribute's domain. No attribute may contain a set of values, a list, an array, a nested relation, or any repeating group. Additionally, all rows must be unique (the relation must have a primary key).
What Violates 1NF — Every Case
Our STUDENT_COURSE_TEACHER table actually satisfies 1NF as written — each cell contains exactly one value. But there are many real-world tables that violate 1NF. Understanding every violation type is essential because 1NF violations appear in every real codebase.
Storing multiple values in a single cell — typically as comma-separated strings, pipe-delimited lists, or serialised arrays. This is the most common 1NF violation in real code, especially when developers coming from NoSQL backgrounds start working with relational databases.
| student_id | name | phone_numbers | courses_enrolled |
|---|---|---|---|
| S001 | Rahul Sharma | 98765-43210, 87654-32109 | CS301, CS302 |
| S002 | Priya Reddy | 76543-21098 | CS301, CS401 |
| S003 | Arjun Nair | 65432-10987, 54321-09876, 43210-98765 | CS302 |
-- PROBLEM 1: Cannot search by individual phone number
SELECT * FROM students WHERE phone_numbers = '87654-32109';
-- Returns ZERO rows! Because phone_numbers = '98765-43210, 87654-32109' ≠ '87654-32109'
-- Workaround attempt (terrible):
SELECT * FROM students WHERE phone_numbers LIKE '%87654-32109%';
-- Works but: cannot use an index (full table scan), slow at scale, fragile
-- PROBLEM 2: Cannot count unique phone numbers across students
SELECT COUNT(DISTINCT phone_numbers) FROM students;
-- Counts unique strings ('98765-43210, 87654-32109') not unique phone numbers
-- PROBLEM 3: Cannot add a phone type (mobile/home/work) per number
-- The schema has no way to record that 98765-43210 is mobile and 87654-32109 is home
-- PROBLEM 4: Insertion anomaly — adding a phone number requires reading the field,
-- parsing it, appending the new number, then rewriting the whole string
UPDATE students
SET phone_numbers = phone_numbers || ', 22222-11111'
WHERE student_id = 'S001';
-- Race condition: two concurrent updates both read '98765-43210, 87654-32109'
-- and both write different values, one overwrites the other
-- THE 1NF SOLUTION: separate table
CREATE TABLE student_phones (
student_id VARCHAR(10) NOT NULL,
phone_number VARCHAR(20) NOT NULL,
phone_type VARCHAR(10) DEFAULT 'mobile',
PRIMARY KEY (student_id, phone_number),
FOREIGN KEY (student_id) REFERENCES students(student_id) ON DELETE CASCADE
);
-- Now: full indexing, proper queries, no race conditions, type-per-number possiblePre-relational databases used repeating groups — a way of storing multiple related values in sequentially numbered columns. While this stores one value per cell, it violates 1NF because it encodes a collection within the tuple structure itself.
| student_id | name | skill1 | level1 | skill2 | level2 | skill3 | level3 |
|---|---|---|---|---|---|---|---|
| S001 | Rahul | Python | Expert | SQL | Intermediate | Java | Beginner |
| S002 | Priya | Machine Learning | Expert | Python | Expert | ||
| S003 | Arjun | React | Intermediate |
The problems are obvious: what if a student has 4 skills? 10 skills? You'd need to ALTER TABLE to add more columns — or silently truncate the data. Querying all skills for a student requires checking all column pairs. Searching for students with a specific skill requires checking every skill column separately.
-- 1NF COMPLIANT: one row per skill
CREATE TABLE students (student_id VARCHAR(10) PRIMARY KEY, name VARCHAR(100) NOT NULL);
CREATE TABLE student_skills (
student_id VARCHAR(10) NOT NULL,
skill_name VARCHAR(100) NOT NULL,
level VARCHAR(20) CHECK (level IN ('beginner', 'intermediate', 'expert')),
PRIMARY KEY (student_id, skill_name),
FOREIGN KEY (student_id) REFERENCES students(student_id) ON DELETE CASCADE
);
INSERT INTO student_skills VALUES
('S001', 'Python', 'expert'),
('S001', 'SQL', 'intermediate'),
('S001', 'Java', 'beginner'),
('S002', 'Machine Learning', 'expert'),
('S002', 'Python', 'expert'),
('S003', 'React', 'intermediate');
-- Now: unlimited skills per student, full indexing, clean queries
SELECT * FROM student_skills WHERE skill_name = 'Python' AND level = 'expert';Storing a value that appears atomic but actually encodes multiple pieces of information that the application needs to decompose. This is the hardest 1NF violation to detect because the cell contains "one string" — but that string is actually two or more distinct data points.
-- VIOLATION: address stored as one string but city/pincode are queried separately
CREATE TABLE customers (
customer_id VARCHAR(10) PRIMARY KEY,
address TEXT -- "123 MG Road, Bengaluru, Karnataka, 560001"
);
-- App does: address.split(',')[1] to get city — violates 1NF
-- Cannot index on city, state, or pincode separately
-- VIOLATION: full_name stored when first/last are used independently
CREATE TABLE employees (
employee_id INT PRIMARY KEY,
full_name VARCHAR(200) -- "Rahul Kumar Sharma"
);
-- App does: full_name.split(' ')[0] to get first name for salutations
-- Cannot sort by last name, cannot search by first name efficiently
-- VIOLATION: encoded composite meaning
CREATE TABLE products (
product_id VARCHAR(20) PRIMARY KEY
-- product_id = 'EL-SONY-TV-55-4K' encodes category + brand + type + size + quality
-- App parses product_id to extract category, brand etc.
-- This is NOT 1NF compliant — the value is composite
);
-- 1NF COMPLIANT VERSIONS:
CREATE TABLE customers (
customer_id VARCHAR(10) PRIMARY KEY,
street VARCHAR(200),
city VARCHAR(100), -- separately stored, separately indexable
state VARCHAR(50),
pincode CHAR(6)
);
CREATE TABLE employees (
employee_id INT PRIMARY KEY,
first_name VARCHAR(50) NOT NULL, -- separate columns
last_name VARCHAR(50) NOT NULL,
-- full_name is derived: first_name || ' ' || last_name (computed in queries)
);
CREATE TABLE products (
product_id SERIAL PRIMARY KEY, -- meaningless surrogate key
product_code VARCHAR(50) UNIQUE NOT NULL, -- human-readable, but not used as PK
category VARCHAR(50), -- extracted from composite code, stored separately
brand VARCHAR(50),
product_type VARCHAR(50),
screen_size INT,
resolution VARCHAR(10)
);Our Working Table — Already in 1NF
Our STUDENT_COURSE_TEACHER table already satisfies 1NF: every cell has one atomic value and all rows are unique (because the combination of student_id + course_id uniquely identifies each row). So why does it still have terrible anomalies? Because 1NF is merely the starting point — it only ensures atomic values. It says nothing about how attributes relate to each other within the table. The anomalies we identified come from higher-level structural problems that 2NF and 3NF address.
A relation is in Second Normal Form if and only if: (1) it is in 1NF, AND (2) every non-prime attribute is fully functionally dependent on the entire primary key. A non-prime attribute is any attribute that is not part of any candidate key. Full functional dependency means no non-prime attribute is dependent on any proper subset (partial subset) of the primary key.
2NF only applies to relations with composite primary keys.A relation with a single-attribute primary key is automatically in 2NF (partial dependency is impossible when the key is one attribute — there is no proper subset of a single-element set other than the empty set).
Understanding Partial Dependency — The Core Concept
A partial dependency exists when a non-prime attribute is functionally determined by only part of the composite primary key — not the entire key. The attribute "partially depends" on the key because knowing only some key attributes is sufficient to determine its value.
In our STUDENT_COURSE_TEACHER table, the primary key is (student_id, course_id). Let us test every non-prime attribute to see if it depends on the full composite key or just part of it.
// RELATION: STUDENT_COURSE_TEACHER
// PRIMARY KEY: (student_id, course_id)
// NON-PRIME ATTRIBUTES: student_name, student_email, course_name, teacher_id,
// teacher_name, teacher_phone, teacher_dept, grade
// ANALYSIS: For each non-prime attribute, what determines it?
// student_name:
// Does student_name depend on student_id alone? YES
// (Knowing student_id = 'S001' tells us name = 'Rahul Sharma' — we don't need course_id)
// → PARTIAL DEPENDENCY: student_name → student_id (only part of PK)
// student_email:
// Does student_email depend on student_id alone? YES
// (Knowing student_id tells us the email — course doesn't matter)
// → PARTIAL DEPENDENCY: student_email → student_id
// course_name:
// Does course_name depend on course_id alone? YES
// (Knowing course_id = 'CS301' tells us name = 'Database Systems' — student irrelevant)
// → PARTIAL DEPENDENCY: course_name → course_id
// teacher_id:
// Does teacher_id depend on course_id alone? YES
// (CS301 is always taught by T01 regardless of which student)
// → PARTIAL DEPENDENCY: teacher_id → course_id
// teacher_name:
// Does teacher_name depend on course_id alone? YES (via teacher_id)
// → PARTIAL DEPENDENCY: teacher_name → course_id
// teacher_phone:
// Does teacher_phone depend on course_id alone? YES (via teacher_id)
// → PARTIAL DEPENDENCY
// teacher_dept:
// Does teacher_dept depend on course_id alone? YES (via teacher_id)
// → PARTIAL DEPENDENCY
// grade:
// Does grade depend on student_id alone? NO — a student has different grades per course
// Does grade depend on course_id alone? NO — a course has different grades per student
// Does grade depend on (student_id, course_id) TOGETHER? YES
// → FULL DEPENDENCY: grade → (student_id, course_id)
// grade is the ONLY 2NF-compliant attribute in this relation!
// CONCLUSION: Almost every attribute has a partial dependency.
// This table is severely 2NF-violating.Decomposing to 2NF — The Systematic Process
The fix for partial dependencies is always the same: remove the partially dependent attributes from the original relation and create a new relation containing those attributes along with the part of the key that determines them. The original relation retains only the fully dependent attributes.
We identify three distinct determinants among the partial dependencies: student_id determines student attributes, course_id determines course and teacher attributes, and (student_id, course_id) together determine grade. We create one table per determinant.
| student_id (PK) | student_name | student_email |
|---|---|---|
| S001 | Rahul Sharma | rahul@uni.in |
| S002 | Priya Reddy | priya@uni.in |
| S003 | Arjun Nair | arjun@uni.in |
| S004 | Kavya Krishnan | kavya@uni.in |
| course_id (PK) | course_name | teacher_id | teacher_name | teacher_phone | teacher_dept |
|---|---|---|---|---|---|
| CS301 | Database Systems | T01 | Prof. Kumar | 98765-43210 | Computer Science |
| CS302 | Algorithms | T02 | Prof. Singh | 87654-32109 | Computer Science |
| CS401 | Machine Learning | T03 | Prof. Rao | 76543-21098 | AI Department |
| student_id (PK, FK) | course_id (PK, FK) | grade |
|---|---|---|
| S001 | CS301 | A |
| S001 | CS302 | B+ |
| S002 | CS301 | A+ |
| S002 | CS401 | B |
| S003 | CS301 | B+ |
| S003 | CS302 | A |
| S004 | CS401 | A+ |
CREATE TABLE students (
student_id VARCHAR(10) PRIMARY KEY,
student_name VARCHAR(100) NOT NULL,
student_email VARCHAR(150) UNIQUE NOT NULL
);
CREATE TABLE courses (
course_id VARCHAR(10) PRIMARY KEY,
course_name VARCHAR(200) NOT NULL,
teacher_id VARCHAR(10) NOT NULL,
teacher_name VARCHAR(100) NOT NULL,
teacher_phone VARCHAR(20),
teacher_dept VARCHAR(100)
);
CREATE TABLE enrollments (
student_id VARCHAR(10) NOT NULL,
course_id VARCHAR(10) NOT NULL,
grade CHAR(2),
PRIMARY KEY (student_id, course_id),
FOREIGN KEY (student_id) REFERENCES students(student_id) ON DELETE CASCADE,
FOREIGN KEY (course_id) REFERENCES courses(course_id) ON DELETE RESTRICT
);Verifying the 2NF Schema Fixes the Anomalies
Prof. Mehta can now be inserted into the COURSES table (or a TEACHERS table in 3NF) independently, without any student enrollment existing. A new student can be added to STUDENTS without any course enrollment.
Prof. Kumar's phone number now appears in exactly ONE row in the COURSES table. Update one row → done. No partial update risk.
Deleting Kavya's enrollment from ENROLLMENTS no longer affects COURSES. Prof. Rao's course (CS401) still exists in the COURSES table. Teacher data is independent.
Teacher_name, teacher_phone, teacher_dept appear in the COURSES table alongside course information. A teacher teaching multiple courses still has their data duplicated. This is a transitive dependency — fixed by 3NF.
A relation is in Third Normal Form if and only if: (1) it is in 2NF, AND (2) for every non-trivial functional dependency X → A, either X is a superkey OR A is a prime attribute (part of some candidate key).
In simpler terms: no non-prime attribute should be transitively dependent on the primary key. Every non-prime attribute must depend directly and only on the primary key — not on another non-prime attribute.
Understanding Transitive Dependency — The 3NF Enemy
A transitive dependency exists when a non-prime attribute A is functionally determined by another non-prime attribute B (which is itself determined by the primary key). The chain looks like: PK → B → A. A is transitively dependent on PK through B.
The problem: B and A now have a separate relationship that is being stored inside the table whose primary purpose is something else. When B's value changes in the real world (Prof. Kumar changes departments), we face the same update anomaly problem — we must find every row where B = "Prof. Kumar" and update A = department in each one.
// COURSES table after 2NF: (course_id, course_name, teacher_id, teacher_name, teacher_phone, teacher_dept)
// PRIMARY KEY: course_id
// NON-PRIME ATTRIBUTES: course_name, teacher_id, teacher_name, teacher_phone, teacher_dept
// DEPENDENCY ANALYSIS:
// course_id → course_name: DIRECT dependency on PK ✓
// course_id → teacher_id: DIRECT dependency on PK ✓
// (each course is taught by one teacher — teacher_id depends directly on course_id)
// But now look at teacher_name, teacher_phone, teacher_dept:
// teacher_id → teacher_name (knowing the teacher ID tells us their name)
// teacher_id → teacher_phone (knowing the teacher ID tells us their phone)
// teacher_id → teacher_dept (knowing the teacher ID tells us their department)
// So the full chain is:
// course_id → teacher_id → teacher_name ← TRANSITIVE DEPENDENCY
// course_id → teacher_id → teacher_phone ← TRANSITIVE DEPENDENCY
// course_id → teacher_id → teacher_dept ← TRANSITIVE DEPENDENCY
// teacher_name, teacher_phone, teacher_dept depend on course_id TRANSITIVELY through teacher_id.
// They are facts about the TEACHER, not facts about the COURSE.
// They have no business being in the COURSES table.
// EVIDENCE OF THE PROBLEM:
// Prof. Kumar teaches CS301. Prof. Kumar changes departments from CS to AI.
// We must update the COURSES table: UPDATE courses SET teacher_dept = 'AI' WHERE teacher_id = 'T01'
// If Prof. Kumar also teaches CS303 (another course), we must update that row too.
// If we miss any row → update anomaly. Same problem as before 2NF.Decomposing to 3NF — Extracting the Transitive Dependencies
The fix is the same pattern as 2NF: extract the transitively dependent attributes along with their determinant into a new table. The original table retains the determinant (teacher_id as a foreign key) but loses the transitively dependent attributes.
| teacher_id (PK) | teacher_name | teacher_phone | teacher_dept |
|---|---|---|---|
| T01 | Prof. Kumar | 98765-43210 | Computer Science |
| T02 | Prof. Singh | 87654-32109 | Computer Science |
| T03 | Prof. Rao | 76543-21098 | AI Department |
| course_id (PK) | course_name | teacher_id (FK) |
|---|---|---|
| CS301 | Database Systems | T01 |
| CS302 | Algorithms | T02 |
| CS401 | Machine Learning | T03 |
The Complete 3NF Schema — Four Clean Tables
-- All four tables. Every non-prime attribute depends directly on its table's primary key.
CREATE TABLE students (
student_id VARCHAR(10) PRIMARY KEY,
student_name VARCHAR(100) NOT NULL,
student_email VARCHAR(150) UNIQUE NOT NULL
-- student_name and student_email depend DIRECTLY on student_id ✓
);
CREATE TABLE teachers (
teacher_id VARCHAR(10) PRIMARY KEY,
teacher_name VARCHAR(100) NOT NULL,
teacher_phone VARCHAR(20) UNIQUE,
teacher_dept VARCHAR(100) NOT NULL
-- All attributes depend DIRECTLY on teacher_id ✓
);
CREATE TABLE courses (
course_id VARCHAR(10) PRIMARY KEY,
course_name VARCHAR(200) NOT NULL,
teacher_id VARCHAR(10) NOT NULL,
-- course_name depends directly on course_id ✓
-- teacher_id depends directly on course_id ✓ (course directly determines teacher)
FOREIGN KEY (teacher_id) REFERENCES teachers(teacher_id)
);
CREATE TABLE enrollments (
student_id VARCHAR(10) NOT NULL,
course_id VARCHAR(10) NOT NULL,
grade CHAR(2),
-- grade depends on the FULL composite key (student_id, course_id) ✓
PRIMARY KEY (student_id, course_id),
FOREIGN KEY (student_id) REFERENCES students(student_id) ON DELETE CASCADE,
FOREIGN KEY (course_id) REFERENCES courses(course_id) ON DELETE RESTRICT
);
-- VERIFICATION: All anomalies eliminated?
-- Insert anomaly: YES — Prof. Mehta can be added to TEACHERS without any course
-- Update anomaly: YES — Prof. Kumar's phone is in ONE row in TEACHERS
-- Delete anomaly: YES — Deleting Kavya's enrollment doesn't touch TEACHERS or COURSESTransitive Dependencies in Single-PK Tables — The Hidden Case
The most dangerous 3NF violations are in single-primary-key tables because students often forget that 3NF applies there too. A table with a single-attribute PK is automatically in 2NF — but can easily violate 3NF if its non-prime attributes transitively depend on each other.
-- ORDERS table with a single PK order_id — but 3NF is still violated!
CREATE TABLE orders_bad (
order_id INT PRIMARY KEY,
customer_id INT NOT NULL,
customer_name VARCHAR(100), -- depends on customer_id, NOT order_id directly!
customer_email VARCHAR(150), -- depends on customer_id, NOT order_id directly!
customer_city VARCHAR(100), -- depends on customer_id, NOT order_id directly!
order_date DATE,
total DECIMAL(10,2)
);
-- TRANSITIVE DEPENDENCIES:
-- order_id → customer_id (direct)
-- customer_id → customer_name, customer_email, customer_city (transitive through customer_id)
-- So: order_id → customer_id → customer_name (transitive → 3NF VIOLATION)
-- EVIDENCE OF PROBLEM:
-- Customer Rahul changes email: must update EVERY ORDER row for Rahul
-- Rahul has placed 100 orders → 100 rows must be updated → update anomaly
-- 3NF SOLUTION: extract customer data to separate table
CREATE TABLE customers (
customer_id INT PRIMARY KEY,
customer_name VARCHAR(100) NOT NULL,
customer_email VARCHAR(150) UNIQUE NOT NULL,
customer_city VARCHAR(100)
);
CREATE TABLE orders_good (
order_id INT PRIMARY KEY,
customer_id INT NOT NULL, -- FK (direct dependency on PK ✓)
order_date DATE NOT NULL,
total DECIMAL(10,2) NOT NULL CHECK (total >= 0),
FOREIGN KEY (customer_id) REFERENCES customers(customer_id)
);
-- Now: customer_name appears ONCE in customers. Update once, consistent everywhere.A relation is in Boyce-Codd Normal Form if and only if for every non-trivial functional dependency X → Y in the relation, X is a superkey. No exceptions.
3NF allows one exception: if Y is a prime attribute (part of some candidate key), X does not need to be a superkey. BCNF removes this exception entirely. In BCNF, the only non-trivial dependencies allowed are those where the left side is a superkey.
Why 3NF Is Not Always Enough — The BCNF Case
For most practical schemas, 3NF and BCNF are equivalent — if a relation is in 3NF, it is also in BCNF. The difference only manifests when a relation has multiple overlapping candidate keys. This is the specific scenario where a relation can be in 3NF but not in BCNF.
The classic example: a university allows multiple teachers to teach the same course, but each student can only be taught a particular subject by one teacher (they don't have two different teachers for the same subject simultaneously). Also, each teacher teaches only one subject.
Business rules:
Each student studies each subject with exactly one teacher
A teacher teaches only one subject
A subject can be taught by multiple teachers
| student_id | teacher_id | subject |
|---|---|---|
| S001 | T01 | Physics |
| S001 | T02 | Chemistry |
| S002 | T01 | Physics |
| S002 | T03 | Physics |
| S003 | T02 | Chemistry |
| S003 | T01 | Physics |
Let us use a cleaner version where the rules are consistently satisfied:
| student_id | teacher_id | subject |
|---|---|---|
| S001 | T01 | Physics |
| S001 | T02 | Chemistry |
| S002 | T01 | Physics |
| S002 | T02 | Chemistry |
| S003 | T03 | Physics |
| S003 | T02 | Chemistry |
Finding Candidate Keys and FDs
// FUNCTIONAL DEPENDENCIES:
// FD1: (student_id, subject) → teacher_id
// "A student studies each subject with exactly one teacher"
// Knowing student + subject → we know which teacher
// FD2: teacher_id → subject
// "Each teacher teaches only one subject"
// Knowing teacher_id → we know which subject
// CANDIDATE KEYS:
// (student_id, teacher_id):
// Closure: {student_id, teacher_id}
// Apply teacher_id → subject: {student_id, teacher_id, subject}
// = all attributes → superkey
// Is it minimal? Remove student_id: {teacher_id} → {teacher_id, subject} ≠ all ✓
// Remove teacher_id: {student_id} → {student_id} ≠ all ✓
// → CANDIDATE KEY #1: {student_id, teacher_id}
// (student_id, subject):
// Closure: {student_id, subject}
// Apply (student_id, subject) → teacher_id: {student_id, subject, teacher_id}
// = all attributes → superkey
// Is it minimal? Remove student_id: {subject} → cannot derive teacher_id for a student ✓
// Remove subject: {student_id} → cannot derive teacher_id ✓
// → CANDIDATE KEY #2: {student_id, subject}
// TWO CANDIDATE KEYS: {student_id, teacher_id} and {student_id, subject}
// Prime attributes: student_id, teacher_id, subject — ALL attributes are prime!
// 3NF CHECK:
// For FD1: (student_id, subject) → teacher_id
// Is {student_id, subject} a superkey? YES (it's a candidate key)
// → Satisfies 3NF ✓
// For FD2: teacher_id → subject
// Is {teacher_id} a superkey? NO (teacher_id alone cannot determine student_id)
// Is subject a prime attribute? YES (subject is part of candidate key {student_id, subject})
// → 3NF exception applies: Y is prime → Satisfies 3NF ✓
// CONCLUSION: Relation IS in 3NF
// BCNF CHECK:
// For FD2: teacher_id → subject
// Is {teacher_id} a superkey? NO!
// BCNF requires every determinant to be a superkey — NO EXCEPTIONS
// → BCNF VIOLATED!The relation is in 3NF but NOT in BCNF. The FD teacher_id → subjectviolates BCNF because teacher_id is not a superkey. However, 3NF allows it because subject is a prime attribute.
The Anomaly That Remains in 3NF — But Not After BCNF
Despite being in 3NF, this relation still has an update anomaly: if teacher T01 changes from teaching Physics to teaching Mathematics, we must update every row where teacher_id = 'T01'. If T01 teaches 500 students, we update 500 rows. Miss one → inconsistency.
BCNF Decomposition
We decompose by removing the violating FD (teacher_id → subject) into its own relation:
| teacher_id (PK) | subject |
|---|---|
| T01 | Physics |
| T02 | Chemistry |
| T03 | Physics |
| student_id (PK) | teacher_id (PK, FK) |
|---|---|
| S001 | T01 |
| S001 | T02 |
| S002 | T01 |
| S002 | T02 |
| S003 | T03 |
| S003 | T02 |
The Critical BCNF Trade-off — Dependency Preservation
There is a significant cost to BCNF decomposition that 3NF decomposition does not have: BCNF decomposition may not preserve all functional dependencies. After decomposing to BCNF, some FDs may no longer be directly enforceable within a single table — enforcing them requires a JOIN.
-- After BCNF decomposition, we have:
-- TEACHER_SUBJECT(teacher_id PK, subject)
-- STUDENT_TEACHER(student_id PK, teacher_id FK)
-- The original FD: (student_id, subject) → teacher_id
-- This FD CANNOT be directly enforced in either BCNF table alone.
-- To verify it, we must JOIN:
SELECT st.student_id, ts.subject, st.teacher_id
FROM student_teacher st
JOIN teacher_subject ts ON st.teacher_id = ts.teacher_id;
-- To enforce "each student has at most one teacher per subject":
-- We need a UNIQUE constraint on (student_id, subject) — but these columns
-- are now in DIFFERENT tables! We cannot add a UNIQUE constraint that spans tables.
-- This means the BCNF schema can allow invalid data:
INSERT INTO student_teacher VALUES ('S001', 'T01'); -- T01 teaches Physics
INSERT INTO student_teacher VALUES ('S001', 'T03'); -- T03 ALSO teaches Physics!
-- Now student S001 has TWO Physics teachers — violates the original business rule
-- But the BCNF schema cannot prevent this without application-level enforcement
-- THE CHOICE:
-- 3NF: All dependencies preserved. Some redundancy possible. Anomaly-free enough for most uses.
-- BCNF: All dependencies enforced by superkeys. May lose some dependency enforcement in decomposition.
-- IN PRACTICE:
-- Most schemas in production aim for 3NF.
-- BCNF is preferred when the dependency that causes the violation is truly independent
-- and the lost FD can be enforced at the application layer or with triggers.For production database design, the standard recommendation is: aim for 3NF, and move to BCNF only when the anomalies remaining in 3NF are causing real production problems. BCNF is the theoretically stronger form, but its inability to always preserve functional dependencies means you may need triggers or application code to enforce constraints that were automatic in 3NF. The cost of that complexity must be weighed against the benefit of eliminating the remaining anomaly.
For exams and interviews: know the BCNF definition precisely, be able to identify when 3NF and BCNF differ, and be able to explain the trade-off between them.
A relation is in Fourth Normal Form if and only if it is in BCNF AND contains no non-trivial multi-valued dependencies unless the determinant is a superkey.
Multi-valued Dependencies — A New Category of Problem
BCNF handles functional dependencies — where one set of attributes determines one value. But some relations have a different kind of redundancy that functional dependencies cannot capture: multi-valued dependencies.
A multi-valued dependency X ↠ Y exists in a relation R(X, Y, Z) when: the set of Y values associated with a given X value is independent of the Z values associated with that same X value. The Y values and Z values are two independent sets of facts about X, and forcing them to coexist in one table causes spurious tuple multiplication.
A classic example: a course can have multiple textbooks AND multiple teachers. These are two independent facts about a course — the set of textbooks doesn't depend on which teachers teach it, and vice versa.
| course_id | teacher_id | textbook |
|---|---|---|
| CS301 | T01 | Database System Concepts |
| CS301 | T01 | Fundamentals of Database Systems |
| CS301 | T02 | Database System Concepts |
| CS301 | T02 | Fundamentals of Database Systems |
Notice the problem: every combination of teacher and textbook for CS301 must appear as a separate row. If CS301 gets a third teacher, we must add 2 new rows (one per textbook). If CS301 gets a third textbook, we must add 2 new rows (one per teacher). The rows multiply because two independent sets of facts are being stored together.
// MULTI-VALUED DEPENDENCIES in COURSE_TEACHER_TEXTBOOK:
// course_id ↠ teacher_id (the set of teachers for a course is independent of textbooks)
// course_id ↠ textbook (the set of textbooks for a course is independent of teachers)
// 4NF VIOLATION:
// course_id ↠ teacher_id: Is course_id a superkey? NO (course_id doesn't uniquely identify rows)
// course_id ↠ textbook: Is course_id a superkey? NO
// → 4NF VIOLATED
// THE ANOMALIES:
// Insert: Cannot add a new textbook for CS301 with ONE row — must add one row per teacher
// Delete: If T01 is removed from CS301, we must delete rows carefully to not lose textbook info
// Update: These same spurious combinations must be maintained consistently
// 4NF DECOMPOSITION: Split into two separate tables, one per independent multi-valued fact
// COURSE_TEACHERS: course_id ↠ teacher_id
// COURSE_TEXTBOOKS: course_id ↠ textbook
-- AFTER 4NF DECOMPOSITION:
CREATE TABLE course_teachers (
course_id VARCHAR(10) NOT NULL,
teacher_id VARCHAR(10) NOT NULL,
PRIMARY KEY (course_id, teacher_id),
FOREIGN KEY (course_id) REFERENCES courses(course_id),
FOREIGN KEY (teacher_id) REFERENCES teachers(teacher_id)
);
CREATE TABLE course_textbooks (
course_id VARCHAR(10) NOT NULL,
textbook VARCHAR(200) NOT NULL,
PRIMARY KEY (course_id, textbook),
FOREIGN KEY (course_id) REFERENCES courses(course_id)
);
-- COURSE_TEACHERS: COURSE_TEXTBOOKS:
-- CS301 | T01 CS301 | Database System Concepts
-- CS301 | T02 CS301 | Fundamentals of Database Systems
-- NOW: adding teacher T03 to CS301 = 1 row in COURSE_TEACHERS
-- adding a new textbook = 1 row in COURSE_TEXTBOOKS
-- No spurious multiplication. Independent facts stored independently.A relation is in Fifth Normal Form (also called Project-Join Normal Form, PJNF) if and only if it is in 4NF and every join dependency in the relation is implied by the candidate keys. A join dependency exists when a relation can be losslessly decomposed into three or more projections that can be rejoined to reconstruct the original relation.
5NF addresses the rarest and most subtle form of redundancy — join dependencies that cannot be explained by either functional or multi-valued dependencies. The classic example involves a three-way relationship where a specific combination of values is meaningful only when all three are present together.
// Scenario: A SUPPLIER can supply certain PARTs, a SUPPLIER works on certain PROJECTs,
// and certain PARTs are used in certain PROJECTs.
// BUSINESS RULE: A supplier supplies a part to a project IF AND ONLY IF:
// (a) The supplier can supply that part, AND
// (b) The supplier works on that project, AND
// (c) That part is used in that project
// This is a cyclic join dependency — not expressible as an FD or MVD.
-- SUPPLIER_PART_PROJECT table (contains all valid three-way combinations):
-- supplier | part | project
-- S1 | P1 | J1
-- S1 | P1 | J2
-- S1 | P2 | J1
-- S2 | P1 | J1
-- If we decompose into three binary relations:
-- SUPPLIER_PART: {(S1,P1), (S1,P2), (S2,P1)}
-- SUPPLIER_PROJECT: {(S1,J1), (S1,J2), (S2,J1)}
-- PART_PROJECT: {(P1,J1), (P1,J2), (P2,J1)}
-- To reconstruct, we JOIN all three:
SELECT sp.supplier, sp.part, sj.project
FROM supplier_part sp
JOIN supplier_project sj ON sp.supplier = sj.supplier
JOIN part_project pp ON sp.part = pp.part AND sj.project = pp.project;
-- This JOIN produces EXACTLY the original tuples — no spurious tuples.
-- If any binary pair relationship is invalid (spurious), the JOIN removes it
-- because the third pair acts as a filter.
-- IN SQL (5NF-compliant schema):
CREATE TABLE supplier_parts (
supplier_id VARCHAR(10) NOT NULL,
part_id VARCHAR(10) NOT NULL,
PRIMARY KEY (supplier_id, part_id)
);
CREATE TABLE supplier_projects (
supplier_id VARCHAR(10) NOT NULL,
project_id VARCHAR(10) NOT NULL,
PRIMARY KEY (supplier_id, project_id)
);
CREATE TABLE part_projects (
part_id VARCHAR(10) NOT NULL,
project_id VARCHAR(10) NOT NULL,
PRIMARY KEY (part_id, project_id)
);
-- ADVANTAGE: Adding that S1 can now supply P3 to J1 requires
-- checking all three binary relations — the three-way constraint is
-- enforced by the intersection of independent binary facts.5NF is rarely encountered in practical application database design. It appears in academic literature, in GATE exam questions, and in certain specific domains (supply chain management, logistics, complex scheduling systems where three-way cyclic constraints are present). For most production schemas, achieving BCNF or 3NF is the practical target.
Lossless Decomposition — The Non-Negotiable Requirement
Every decomposition during normalisation must satisfy one absolute requirement: it must be lossless (also called lossless-join). A decomposition is lossless if the original relation can be reconstructed exactly by joining the decomposed relations — no spurious tuples added, no original tuples lost.
A lossy decomposition is a catastrophe — it permanently destroys information. You can never recover the original data from the decomposed tables. Every normalisation step you perform must preserve the ability to reconstruct the original data by joining.
// ORIGINAL RELATION R(A, B, C) with tuples:
// A | B | C
// 1 | x | p
// 1 | y | q
// 2 | x | r
// DECOMPOSITION 1: R1(A, B) and R2(B, C)
// R1: R2:
// A | B B | C
// 1 | x x | p
// 1 | y y | q
// 2 | x x | r
// RECONSTRUCT via natural join R1 ⋈ R2 (join on B):
// A | B | C
// 1 | x | p ← original ✓
// 1 | x | r ← SPURIOUS! This tuple was NOT in the original relation!
// 1 | y | q ← original ✓
// 2 | x | p ← SPURIOUS! This tuple was NOT in the original relation!
// 2 | x | r ← original ✓
// → LOSSY DECOMPOSITION — spurious tuples introduced. This decomposition DESTROYS INFORMATION.
// DECOMPOSITION 2: R1(A, B) and R2(A, C)
// R1: R2:
// A | B A | C
// 1 | x 1 | p
// 1 | y 1 | q
// 2 | x 2 | r
// RECONSTRUCT via natural join R1 ⋈ R2 (join on A):
// A | B | C
// 1 | x | p ← original ✓
// 1 | x | q ← SPURIOUS!
// 1 | y | p ← SPURIOUS!
// 1 | y | q ← original ✓
// 2 | x | r ← original ✓
// → ALSO LOSSY. Joining on A introduces spurious combinations.
// THE LOSSLESS DECOMPOSITION RULE (Heath's Theorem):
// A decomposition of R(X, Y, Z) into R1(X, Y) and R2(X, Z) is lossless if and only if:
// X → Y (X functionally determines Y), OR
// X → Z (X functionally determines Z)
// In other words: the attributes in the JOIN column set must be a superkey of at least one
// of the decomposed relations.
// APPLYING THE RULE:
// Decompose COURSES(course_id, course_name, teacher_id, teacher_name, teacher_phone, teacher_dept)
// into COURSES(course_id, course_name, teacher_id) and TEACHERS(teacher_id, teacher_name, teacher_phone, teacher_dept)
// JOIN attribute: teacher_id
// Is teacher_id a superkey of TEACHERS? YES (teacher_id is the PK of TEACHERS)
// → LOSSLESS DECOMPOSITION ✓
-- VERIFICATION:
SELECT c.course_id, c.course_name, c.teacher_id, t.teacher_name, t.teacher_phone, t.teacher_dept
FROM courses c JOIN teachers t ON c.teacher_id = t.teacher_id;
-- This JOIN exactly reconstructs the original COURSES table — no spurious tuples.Dependency Preservation — The Other Requirement
The second desirable property of normalisation decompositions is dependency preservation: every functional dependency in the original relation should be enforceable in at least one of the decomposed relations without requiring a join.
Why does this matter? Because FDs that span multiple tables cannot be enforced by simple constraints — they require either triggers or application-level checks, both of which are error-prone and performance-impacting. 3NF decomposition always preserves dependencies. BCNF decomposition may not.
// ORIGINAL RELATION: R(student_id, teacher_id, subject)
// FUNCTIONAL DEPENDENCIES:
// FD1: (student_id, subject) → teacher_id
// FD2: teacher_id → subject
// BCNF DECOMPOSITION produces:
// R1(teacher_id, subject) — teacher_id is PK, enforces FD2
// R2(student_id, teacher_id) — (student_id, teacher_id) is PK
// CHECK DEPENDENCY PRESERVATION:
// FD1: (student_id, subject) → teacher_id
// student_id is in R2. subject is in R1. teacher_id is in both.
// FD1 CANNOT be checked within R1 alone (no student_id there)
// FD1 CANNOT be checked within R2 alone (no subject there)
// Checking FD1 requires JOIN of R1 and R2 → NOT PRESERVED in decomposition
// → FD1 is a lost dependency
// CONSEQUENCE:
-- Can we insert this into R2 without violating FD1?
INSERT INTO r2_student_teacher VALUES ('S001', 'T01'); -- T01 teaches Physics
INSERT INTO r2_student_teacher VALUES ('S001', 'T03'); -- T03 also teaches Physics!
-- Without FD1 enforced in any single table, S001 now has TWO Physics teachers
-- The BCNF schema cannot prevent this — it requires application-level enforcement
// CONTRAST WITH 3NF DECOMPOSITION (from our main example):
// Original: COURSES(course_id, course_name, teacher_id, teacher_name, teacher_phone, teacher_dept)
// 3NF decomposition:
// COURSES(course_id, course_name, teacher_id) — course_id → teacher_id preserved here ✓
// TEACHERS(teacher_id, teacher_name, teacher_phone, teacher_dept) — teacher_id → * preserved ✓
// ALL FDs are preserved in the 3NF decomposition.Denormalization — When Breaking Normalization Rules Is the Right Decision
Normalization produces schemas with minimal redundancy and zero anomalies — which are ideal for data integrity. But normalised schemas have a performance cost: retrieving complete information about an entity often requires JOINs across multiple tables. At scale — millions to billions of rows — those JOINs can become performance bottlenecks.
Denormalization is the deliberate, controlled introduction of redundancy to improve read performance. It is a design choice, not a design mistake. The engineer who denormalises knows exactly what normalization rule they are breaking, why they are breaking it, and what mechanism will keep the redundant data consistent. Denormalization without a consistency strategy is just bad design.
When Denormalization Is Justified
When a JOIN query runs millions of times per day but the underlying data changes rarely, materialising the JOIN result as a pre-computed column eliminates the runtime JOIN cost. The cost of keeping the denormalised column in sync (via triggers or periodic updates) is far less than the cost of 10 million JOINs per day.
Sometimes you need to capture what the data looked like at a specific point in time — and normalisation cannot provide this because the referenced table changes. Denormalisation into the fact table preserves the historical snapshot.
Analytical databases (data warehouses) are read-heavy and write-infrequent. Complex analytical queries that aggregate across many dimensions perform dramatically better on denormalised star/snowflake schemas than on fully normalised OLTP schemas. This is why data warehouse design deliberately violates 3NF.
An API endpoint that renders a list of 100 orders, each needing the customer name, faces an N+1 problem: 1 query for orders + 100 queries for customer names. Denormalising customer_name into the orders table converts this to 1 query.
Denormalization Techniques and Consistency Strategies
-- TECHNIQUE 1: Storing a derived/redundant column with a trigger to maintain it
-- Denormalised: orders.customer_name (redundant — also in customers.name)
ALTER TABLE orders ADD COLUMN customer_name VARCHAR(100);
-- Populate existing rows:
UPDATE orders o SET customer_name = (SELECT name FROM customers WHERE customer_id = o.customer_id);
-- Trigger to keep in sync:
CREATE OR REPLACE FUNCTION sync_order_customer_name()
RETURNS TRIGGER AS $$
BEGIN
-- When a customer's name changes, update all their orders
UPDATE orders
SET customer_name = NEW.name
WHERE customer_id = NEW.customer_id;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER after_customer_name_update
AFTER UPDATE OF name ON customers
FOR EACH ROW
WHEN (OLD.name IS DISTINCT FROM NEW.name)
EXECUTE FUNCTION sync_order_customer_name();
-- TRADE-OFF: UPDATE to customers.name now triggers UPDATE on orders table.
-- If a customer has 10,000 orders, updating their name touches 10,001 rows.
-- Acceptable if customer name changes are rare; unacceptable if frequent.
-- TECHNIQUE 2: Historical snapshot — intentional denormalisation (no sync needed)
-- order_items.unit_price is denormalised from products.price
-- This is NOT an error — it's intentional historical accuracy
CREATE TABLE order_items (
order_id INT NOT NULL,
product_id INT NOT NULL,
quantity INT NOT NULL,
unit_price DECIMAL(10,2) NOT NULL, -- snapshot of price at time of order
-- unit_price is intentionally redundant with products.current_price
-- It must NOT be updated when products.price changes — that would corrupt history
PRIMARY KEY (order_id, product_id)
);
-- Documentation comment: unit_price is a historical snapshot. Do NOT sync with products.price.
-- TECHNIQUE 3: Materialised view — database-managed denormalisation
-- PostgreSQL maintains a materialised view — a pre-computed, cached result set
CREATE MATERIALIZED VIEW order_summary AS
SELECT
o.order_id,
o.order_date,
c.name AS customer_name, -- denormalised from customers
c.city AS customer_city,
COUNT(oi.product_id) AS item_count,
SUM(oi.quantity * oi.unit_price) AS total
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN order_items oi ON o.order_id = oi.order_id
GROUP BY o.order_id, o.order_date, c.name, c.city;
-- Create index on the materialised view for fast lookups:
CREATE INDEX idx_order_summary_customer ON order_summary(customer_name);
-- Refresh strategy: choose based on data freshness requirements
REFRESH MATERIALIZED VIEW CONCURRENTLY order_summary;
-- CONCURRENTLY: refresh without locking reads (requires unique index on the view)
-- Schedule via pg_cron or application-level job:
-- SELECT cron.schedule('refresh-order-summary', '*/15 * * * *', 'REFRESH MATERIALIZED VIEW CONCURRENTLY order_summary');
-- Query the materialised view (no JOIN at query time):
SELECT * FROM order_summary WHERE customer_city = 'Bengaluru' ORDER BY total DESC;
-- Fast: reads pre-computed data, uses index, no runtime JOINSwiggy's fully normalised schema for order data has: orders table, order_items table, products table, restaurants table, customers table — 5 JOIN operations to render one order card. At 50 million orders per day, the order history page is queried constantly.
The solution: a pre-computed order_history_view materialised view (or a separate order_display table) that stores order_id, restaurant_name, item_names (as JSON array), total, date, and status — all in one row. The history page reads ONE row per order. The normalised tables are still the source of truth — the denormalised view is refreshed every few minutes. Read performance: O(1). Consistency: eventually consistent (acceptable for order history display).
All Normal Forms — Side by Side
| Normal Form | Prerequisite | Additional Requirement | Eliminates | Dependency Type |
|---|---|---|---|---|
| 1NF | — | All attribute values are atomic; no repeating groups; rows are unique | Multi-valued cells, repeating groups | Structural constraint |
| 2NF | 1NF | Every non-prime attribute is fully dependent on the entire composite PK | Partial dependencies | Functional dependency (partial) |
| 3NF | 2NF | No non-prime attribute is transitively dependent on the PK | Transitive dependencies | Functional dependency (transitive) |
| BCNF | 3NF | For every non-trivial FD X→Y, X must be a superkey | Remaining FD anomalies in overlapping-key relations | Functional dependency (all) |
| 4NF | BCNF | No non-trivial multi-valued dependency X↠Y unless X is a superkey | Independent multi-valued attribute duplication | Multi-valued dependency |
| 5NF | 4NF | Every join dependency is implied by the candidate keys | Cyclic join dependency redundancy | Join dependency |
The Quick Test — How to Identify Which Normal Form Is Violated
// STEP 1: Is every cell atomic? No multi-valued cells, no repeating groups?
// NO → Not even 1NF. Fix: split into separate tables or separate columns.
// STEP 2: Is the primary key composite?
// NO → Automatically in 2NF (partial dependency impossible with single-attribute PK)
// Proceed to Step 3.
// YES → For each non-prime attribute A, ask:
// "Does A depend on the FULL primary key, or just PART of it?"
// If PART → 2NF violation. Fix: move A and its partial key to a new table.
// STEP 3: For each non-prime attribute A, ask:
// "What directly determines A?"
// If the answer is "the primary key directly" → OK
// If the answer is "another non-prime attribute B (which is determined by PK)"
// → Transitive dependency → 3NF violation.
// Fix: move B and its dependents to a new table with B as PK.
// STEP 4 (BCNF check): For every non-trivial FD X → Y in the relation:
// Is X a superkey?
// NO → BCNF violation (even if it passed 3NF because Y is a prime attribute).
// Fix: decompose, accepting potential dependency-preservation loss.
// STEP 5 (4NF check): Are there independent multi-valued attributes of the same key?
// e.g., course_id ↠ teachers AND course_id ↠ textbooks (independent sets)?
// YES → 4NF violation. Fix: separate tables for each multi-valued set.
// STEP 6 (5NF check — rarely needed):
// Can the relation be decomposed into 3+ projections that perfectly rejoin?
// And is this decomposition NOT implied by FDs or MVDs?
// YES → 5NF violation. Very rare in practice.
// PRACTICAL STOPPING POINT:
// For OLTP databases: aim for 3NF as minimum, BCNF where practical.
// For data warehouses: intentional denormalisation (star/snowflake schema) is standard.
// For exam/interview: understand all 6 normal forms precisely.The Schema Review — Catching Normalisation Violations in Code Review
Normalization knowledge shows up most powerfully in code reviews. When a junior engineer submits a migration adding new tables, a senior engineer immediately scans for the patterns we have studied. Here is a realistic review showing how this knowledge applies directly.
-- Migration: create training assignments table
CREATE TABLE training_assignments (
assignment_id SERIAL PRIMARY KEY,
employee_id INT NOT NULL,
employee_name VARCHAR(100), -- VIOLATION: 3NF transitive
employee_department VARCHAR(100), -- VIOLATION: 3NF transitive
employee_location VARCHAR(100), -- VIOLATION: 3NF transitive
course_id INT NOT NULL,
course_name VARCHAR(200), -- VIOLATION: 3NF transitive
course_duration_hrs INT, -- VIOLATION: 3NF transitive
vendor_id INT,
vendor_name VARCHAR(100), -- VIOLATION: 3NF transitive
vendor_contact VARCHAR(200), -- VIOLATION: 3NF transitive (multi-valued?)
assigned_date DATE NOT NULL DEFAULT CURRENT_DATE,
completion_date DATE,
status VARCHAR(20) NOT NULL DEFAULT 'pending',
score INT,
skills_covered TEXT -- VIOLATION: 1NF (comma-separated values)
-- "Python,SQL,Data Engineering"
);Senior engineer's review comments:
These depend on employee_id, not on assignment_id. They are facts about the employee, not about the assignment. If Rahul changes departments, we must update every assignment row. Move them to the employees table (they may already be there). This table should only have employee_id as FK.
These depend on course_id, not assignment_id. Facts about the course belong in a courses table. Store only course_id as FK here.
These depend on vendor_id. Create a vendors table. Store only vendor_id FK here. Also: vendor_contact as TEXT is suspicious — if it contains multiple contact methods, this is also a 1NF violation. Separate table for vendor_contacts.
Comma-separated values in a TEXT column is a classic 1NF violation. If we ever need to query "find all employees who completed training covering Python" this query is impossible to do efficiently. Create a training_assignment_skills(assignment_id FK, skill VARCHAR) table with PK (assignment_id, skill).
assignment_id (PK), employee_id (FK), course_id (FK), vendor_id (FK, nullable), assigned_date, completion_date, status, score. That is it. Everything else belongs in the referenced tables.
-- Assuming employees, courses, and vendors tables already exist or are created separately
-- Correct training_assignments table: only assignment-specific facts
CREATE TABLE training_assignments (
assignment_id SERIAL PRIMARY KEY,
employee_id INT NOT NULL,
course_id INT NOT NULL,
vendor_id INT, -- nullable: some courses are internal (no vendor)
assigned_date DATE NOT NULL DEFAULT CURRENT_DATE,
completion_date DATE,
status VARCHAR(20) NOT NULL DEFAULT 'pending'
CHECK (status IN ('pending', 'in_progress', 'completed', 'failed', 'cancelled')),
score INT CHECK (score BETWEEN 0 AND 100),
FOREIGN KEY (employee_id) REFERENCES employees(employee_id) ON DELETE RESTRICT,
FOREIGN KEY (course_id) REFERENCES courses(course_id) ON DELETE RESTRICT,
FOREIGN KEY (vendor_id) REFERENCES vendors(vendor_id) ON DELETE SET NULL
);
-- Separate table for skills covered (1NF compliant)
CREATE TABLE assignment_skills (
assignment_id INT NOT NULL,
skill VARCHAR(100) NOT NULL,
PRIMARY KEY (assignment_id, skill),
FOREIGN KEY (assignment_id) REFERENCES training_assignments(assignment_id) ON DELETE CASCADE
);
-- Now queries are clean:
-- "Find employees who completed Python training in the last 6 months"
SELECT DISTINCT e.name, e.department
FROM employees e
JOIN training_assignments ta ON e.employee_id = ta.employee_id
JOIN assignment_skills ask ON ta.assignment_id = ask.assignment_id
WHERE ask.skill = 'Python'
AND ta.status = 'completed'
AND ta.completion_date >= CURRENT_DATE - INTERVAL '6 months';
-- "If training vendor changes contact info: update ONE row in vendors table.
-- Zero training_assignment rows need updating." ← the update anomaly is goneEvery Normalisation Interview Trap — Questions That Fail 90% of Candidates
🎯 Key Takeaways
- ✓Normalization eliminates data anomalies — insert (cannot add a fact without adding an unrelated fact), update (changing one real-world fact requires changing multiple rows), and delete (deleting one fact destroys another). All three arise from storing facts about multiple independent things in the same table.
- ✓1NF: every cell contains exactly one atomic value. No comma-separated lists, no repeating groups, no composite values used as a single atom. Rows must be unique. 1NF is the floor — tables can be in 1NF and still have severe anomalies.
- ✓2NF eliminates partial dependencies: every non-prime attribute must depend on the ENTIRE composite primary key, not just part of it. Only relevant for tables with composite primary keys. Fix: extract partially dependent attributes along with their partial key into a new table.
- ✓3NF eliminates transitive dependencies: no non-prime attribute should be determined by another non-prime attribute. The chain PK → B → A is a transitive dependency — A must be extracted along with B into a separate table with B as primary key. Applies to ALL tables including single-PK tables.
- ✓BCNF is stricter than 3NF: for every non-trivial FD X→Y, X must be a superkey — no exceptions. 3NF allows one exception (Y is a prime attribute). BCNF may not preserve all functional dependencies after decomposition. Most production schemas target 3NF; BCNF is applied selectively.
- ✓4NF eliminates multi-valued dependencies: when two independent sets of multi-valued facts exist about the same key (course_id ↠ teachers AND course_id ↠ textbooks), they cause tuple multiplication. Fix: one table per independent multi-valued set.
- ✓5NF (Project-Join Normal Form) eliminates join dependencies not implied by candidate keys. Rare in practice. Addresses cyclic three-way constraints where a combination of facts is only valid when all three components satisfy their pairwise binary relationships.
- ✓Every decomposition must be lossless: the original relation must be perfectly reconstructible by joining the decomposed relations. Guaranteed if the join attribute is a key (superkey) of at least one of the decomposed relations (Heath's Theorem).
- ✓Dependency preservation is desirable but not always achievable: 3NF decomposition always preserves dependencies. BCNF decomposition may not — some FDs may become unenforceable within a single table and require triggers or application-level enforcement.
- ✓Denormalization is deliberate, controlled introduction of redundancy for performance. Justified for high read-to-write ratios, historical snapshots, analytical workloads, and eliminating expensive JOIN patterns. Always requires a consistency mechanism (trigger, materialized view, or application logic) to prevent the anomalies that normalization prevents.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.