Data Models
The complete evolution of how humanity has organised data — from paper ledgers to distributed graph databases — and why every architectural decision in database engineering traces back to the choice of data model.
What Is a Data Model — The Precise Definition
Most students hear "data model" and think it means a database schema — the list of tables and columns. That is a specific instance of a data model, but not the concept itself. A data model is something far more fundamental.
A data model is a formal framework — a collection of concepts, rules, and constraints — that precisely describes three things:
When a database researcher says "the relational model," they are not talking about tables and SQL specifically. They are talking about the mathematical framework — the abstract structure of relations (sets of tuples), the algebra of operations over those relations, and the integrity rules that govern them. SQL is just one implementation. Tables on a screen are just one visualisation. The model is the theory underneath.
Why Choosing the Right Data Model is the Most Consequential Decision in System Design
The data model is chosen once — at the beginning of a system's life. After that, changing it is one of the most painful and expensive engineering operations that exists. Migrating from a relational model to a document model for a system with 100 million existing records is a multi-month project involving data transformation pipelines, application rewrites, data validation procedures, and careful rollout strategies. Companies have gone bankrupt executing bad data model migrations.
The data model choice determines: how fast queries run at scale, how easy it is to add new features, how much application code is required to maintain data integrity, what kind of engineers you need to hire, and what your database infrastructure costs will be. A wrong choice compounds over years. The most senior engineers at any company are the ones who chose their data models well at the beginning.
In the 1960s, every major organisation that adopted computers made a data model choice — and most of them got it wrong. The US banking system, airline reservation systems, and government agencies built massive hierarchical and network database systems that became increasingly expensive to maintain, modify, and query. When the relational model arrived in 1970, it was not just a technical improvement — it was a rescue. Edgar Codd's paper was titled "A Relational Model of Data for Large Shared Data Banks" — and it directly addressed the pain of organisations already drowning in the complexity of their existing data models. Understanding history is understanding why things are designed the way they are.
The Hierarchical Model
The hierarchical model was the first commercially significant data model, introduced by IBM in 1966 as part of the Information Management System (IMS) — built specifically to manage the bill of materials for the Apollo spacecraft programme. NASA needed to track hundreds of thousands of parts, each composed of sub-parts, each composed of further sub-parts — a naturally tree-shaped problem. The hierarchical model was a perfect fit for that specific problem. The trouble began when organisations tried to use it for everything else.
Structure — The Tree
In the hierarchical model, data is organised as a collection of trees. Each tree has a single root node at the top. Every node in the tree (except the root) has exactly one parent. A parent can have many children. Children can have their own children. This creates a rigid, top-down structure where relationships are physically embedded — a child record is stored directly beneath its parent record on disk.
University (root)
│
├── Department: Computer Science
│ ├── Course: DBMS (CS301)
│ │ ├── Student: Rahul Sharma (enrolled)
│ │ ├── Student: Priya Reddy (enrolled)
│ │ └── Student: Arjun Nair (enrolled)
│ │
│ └── Course: Algorithms (CS302)
│ ├── Student: Rahul Sharma (enrolled)
│ └── Student: Kiran Patel (enrolled)
│
└── Department: Mathematics
└── Course: Linear Algebra (MA201)
└── Student: Priya Reddy (enrolled)
// Notice: Rahul appears TWICE — once under CS301, once under CS302
// Priya appears TWICE — once under CS301, once under MA201
// This duplication is not a bug — it is a fundamental limitation of the modelThe critical observation in that diagram: Rahul Sharma appears twice because he is enrolled in two courses. In the hierarchical model, a child can only have one parent — but a student in reality can be enrolled in many courses. The model cannot represent this naturally. The solution was to duplicate the student record under each course they attend. This duplication is not accidental — it is a structural consequence of the one-parent constraint. And it brings back every problem we listed in Module 01: redundancy, inconsistency, and update anomalies.
How Navigation Worked — The Pointer Model
In IMS, data was not queried using a declarative language like SQL. Programmers wrote procedural code that navigated the tree structure manually — moving down from root to child to grandchild, following physical pointers stored in the records themselves. A query like "find all students enrolled in Computer Science courses" required code that:
// To find all CS students in IMS, a programmer would write:
GU DEPARTMENT (DEPT-NAME = 'Computer Science')
// GU = Get Unique — navigate to this node
GNP COURSE
// GNP = Get Next within Parent — get first course under CS dept
WHILE status = ok:
GNP STUDENT
// Navigate to each student under this course
WHILE status = ok:
PRINT student.name
GNP STUDENT
GNP COURSE // Move to next course
// This is not a query — it is navigation code.
// The programmer must know the physical tree structure to write this.
// Change the tree structure → rewrite every program that navigates it.
// This is the data dependency problem from Module 01 at its worst.Every DML operation in a hierarchical database required the programmer to know the exact path through the tree. There was no way to say "give me all students named Rahul" without knowing which branches of the tree might contain students named Rahul and navigating to each one explicitly. The lack of a declarative query language was a profound limitation.
Strengths — Where It Still Wins
If you always access data top-down along the tree — "get all orders for this customer" — the hierarchical model is extremely fast because the child records are physically adjacent to the parent on disk. One disk read retrieves the parent and all its children together.
For data that genuinely is a tree — file systems, XML documents, organisational charts, bill of materials — the hierarchical model maps directly to the natural structure without any awkward translation.
Inserting a new child record simply appends it beneath the parent. No index maintenance, no join tables, no foreign key checks across multiple tables. For high-volume insert workloads with a natural hierarchy, this is efficient.
Because the data structure is fixed and navigation paths are predetermined, query execution time is predictable — no query optimiser needed, no statistics to maintain. A given operation always takes the same amount of time.
Weaknesses — Why It Was Replaced
A student enrolled in multiple courses, a product in multiple categories, an employee in multiple projects — any real-world many-to-many relationship requires data duplication. The model has no concept of a record having two parents.
Every many-to-many relationship forces copies of records under multiple parents. Update one copy but not others and the database is inconsistent. There is no mechanism in the model to prevent or detect this.
DL/I (Data Language One, IBM's query language for IMS) is entirely navigational and procedural. Queries require detailed knowledge of the physical tree structure. Ad-hoc questions from business users are effectively impossible.
Adding a new level to the hierarchy, or reorganising the tree structure, requires rewriting every program that navigates the affected branches. The data-program dependency problem (Module 01) is at its most severe here.
Queries that span across tree branches — "find all students who are enrolled in both CS and Math courses" — require navigating multiple separate trees and manually correlating results in application code.
Deleting a parent node deletes all its children. Deleting a course deletes all enrollment records. Deleting a department deletes all its courses and all student enrollments. Data is coupled through the physical tree.
IBM IMS is still in active production use at thousands of banks, insurance companies, airline systems, and government agencies worldwide. According to IBM's own figures, IMS processes more than 50 billion transactions per day — more than any other database system. The Japanese banking system, several US federal agencies, and most major airlines run core transaction processing on IMS databases built in the 1970s.
Why hasn't it been replaced? Because migrating decades of hierarchical data to a relational model requires understanding every implicit structural relationship baked into the tree — many of which are undocumented and known only to engineers who retired or died. The risk and cost of migration exceeds the pain of maintaining the existing system. This is a critical lesson: data model decisions outlive every other technical decision in a system's lifetime.
The Network Model
In 1969, a committee called CODASYL (Conference on Data Systems Languages — the same group that standardised COBOL) published a specification for a network database standard. The network model was a direct response to the hierarchical model's most glaring weakness: the inability to represent many-to-many relationships without data duplication.
The key innovation of the network model was simple but profound: it removed the one-parent constraint. In the network model, a record can have multiple parent records and multiple child records simultaneously. Instead of a tree, data is organised as a directed graph — nodes connected by named relationships called sets.
Structure — Sets and Owners
The fundamental concept in the network model is the set — a named, directed relationship between two record types. A set has an owner record type and a member record type. One owner can have many members (one-to-many). Crucially, a record type can be a member in multiple different sets simultaneously — it can have multiple owners of different types.
// The many-to-many problem is now solvable:
STUDENT record: Rahul Sharma (S001)
│
├── ENROLLED_IN set ──────▶ COURSE: DBMS (CS301)
│ │
│ └── TAUGHT_BY set ──▶ TEACHER: Prof. Kumar
│
└── ENROLLED_IN set ──────▶ COURSE: Algorithms (CS302)
│
└── TAUGHT_BY set ──▶ TEACHER: Prof. Singh
// Rahul appears ONCE. He is a member of ENROLLED_IN set twice — once per course.
// No duplication. The student record has two logical parents through the same set.
// Cross-structure query: "find all students taught by Prof. Kumar"
// Navigate: TEACHER(Prof. Kumar) ← TAUGHT_BY ← COURSE ← ENROLLED_IN ← STUDENTS
// Still navigational — but now both directions are possible without duplication.How Navigation Worked — Currency Indicators
The network model used a concept called currency indicators — essentially pointers that kept track of the "current" record in each set and record type as the program navigated the database. A program would call operations like FIND, GET, STORE, MODIFY, ERASE, CONNECT, and DISCONNECT to navigate and manipulate the graph. The programmer was responsible for maintaining correct navigation state throughout every operation.
// "Find all courses taken by student S001 and their teachers"
MOVE 'S001' TO STUDENT-ID
FIND ANY STUDENT USING STUDENT-ID
// Currency: current STUDENT = Rahul Sharma
FIND FIRST COURSE WITHIN ENROLLED_IN
// Currency: current COURSE = DBMS (CS301)
WHILE DB-STATUS = 0:
GET CURRENT COURSE // Retrieve course data
FIND OWNER WITHIN TAUGHT_BY // Navigate to teacher
GET CURRENT TEACHER // Retrieve teacher data
PRINT COURSE-NAME, TEACHER-NAME
FIND NEXT COURSE WITHIN ENROLLED_IN // Next course for this student
// The programmer must manage currency indicators manually.
// If you navigate off-course (pun intended), currency is lost and
// you must restart the navigation from a known anchor point.
// Writing correct CODASYL programs required deep expertise in the
// exact structure of every set in the database schema.Why the Network Model Failed to Dominate
The network model solved the many-to-many relationship problem that crippled hierarchical databases. But it solved it at an enormous cost in complexity. The CODASYL specification itself was hundreds of pages long. Writing correct network database programs required intimate knowledge of the physical storage structure, careful management of currency indicators, and sophisticated error handling for every navigation operation.
More critically, the network model shared the hierarchical model's fatal flaw: it was navigational and procedural, not declarative. There was still no equivalent of SQL — no way to say "give me all students who scored above 80 in courses taught by professors in the Computer Science department" without writing a multi-page navigation program. Every ad-hoc business question required custom development work.
When Edgar Codd published his relational model paper in 1970, he explicitly criticised both the hierarchical and network models for their navigational nature. His key insight — which we examine thoroughly in the next section — was that users should describe what data they want, not how to navigate to it. That insight made both models obsolete for general-purpose data management.
The network model is not dead — it evolved. Modern graph databases (Neo4j, Amazon Neptune) are the direct intellectual descendants of the CODASYL network model. The fundamental insight — that some data is inherently graph-shaped and should be stored as nodes and edges rather than forced into tables — was correct. What changed is the query language. Cypher (Neo4j's query language) is declarative: "MATCH (s:Student)-[:ENROLLED_IN]->(c:Course) RETURN s, c" — describe the pattern you want, not the navigation path to find it. The network model's structure survived; its procedural interface did not.
The Relational Model — Why It Changed Everything
On June 26, 1970, IBM researcher Edgar F. Codd published a 12-page paper in the Communications of the ACM titled "A Relational Model of Data for Large Shared Data Banks." It is, without exaggeration, one of the most influential scientific papers in the history of computing. Codd won the Turing Award for it in 1981 — the highest honour in computer science.
Codd's central insight was deceptively simple: the hierarchical and network models required users to navigate physical data structures. This meant users had to understand how data was stored to retrieve it. This was backwards. The right abstraction, Codd argued, was to let users describe the logical properties of the data they wanted — and have the system figure out how to retrieve it. This is the principle of data independence applied to queries.
The Mathematical Foundation — Set Theory and Predicate Logic
Unlike the hierarchical and network models, which were engineering constructs without formal mathematical foundations, Codd built the relational model on two branches of mathematics that had been rigorously developed for decades: set theory (specifically, the mathematics of relations — mappings between sets) and first-order predicate logic(the formal language of logical conditions and quantification).
This mathematical foundation is not academic decoration — it has profound practical consequences. Because the relational model is mathematically defined, every operation has provably correct semantics. You can formally prove that a query returns correct results, that a decomposition preserves information, that an integrity constraint prevents a specific class of errors. No other data model has ever achieved this level of formal rigour, and it is a primary reason the relational model has remained the foundation of database engineering for over 50 years.
The Three Pillars of the Relational Model
A relation is a mathematical set of tuples — each tuple being an ordered list of domain values. In practical terms, a relation is a table: rows are tuples, columns are attributes, and each column has a domain (the set of valid values). The critical mathematical property: a relation is a SET — which means no two tuples can be identical (no duplicate rows), and tuples have no inherent order (the order of rows in a table is meaningless). These constraints come from set theory, not from engineering convenience.
relation R = {(C001, Rahul, Bengaluru), (C002, Priya, Hyderabad), (C003, Arjun, Mumbai)}
// This is a mathematical set of 3-tuples.
// No duplicate tuples (set property).
// Unordered (set property — sequences are not sets).
// Each position has a domain: customer_id ∈ strings, name ∈ strings, city ∈ city_namesCodd defined eight fundamental operations over relations — Selection (σ), Projection (π), Cartesian Product (×), Union (∪), Difference (−), Rename (ρ), Join (⋈), and Division (÷). These operations are closed — every operation takes one or more relations as input and produces a relation as output. This closure property means operations can be composed arbitrarily: the output of one operation can be the input of another. SQL is a declarative language that translates user queries into sequences of these algebraic operations.
// "Find names of customers from Bengaluru with orders above ₹500"
// In relational algebra:
π_name(σ_city='Bengaluru' AND amount>500 (customers ⋈ orders))
// Translation: JOIN customers and orders, SELECT rows matching conditions,
// PROJECT down to just the name column.
// The DBMS optimiser decides the best physical execution order for this.Codd defined integrity constraints that must hold over all relations at all times — not just when convenient. Domain constraints (values must be from the defined domain), Key constraints (no two tuples can have identical primary key values), and Referential Integrity (a foreign key value must either be null or reference an existing primary key value in the referenced relation). These constraints are enforced by the DBMS automatically — applications cannot violate them even if they try.
-- Domain constraint: age must be a non-negative integer
CHECK (age >= 0 AND age <= 150)
-- Key constraint: no two customers can have the same customer_id
PRIMARY KEY (customer_id) -- implicitly enforces uniqueness + not null
-- Referential integrity: every order must reference an existing customer
FOREIGN KEY (customer_id) REFERENCES customers(customer_id)
-- The DBMS rejects any INSERT into orders where customer_id doesn't existThe Physical-Logical Separation — Codd's Deepest Insight
Perhaps the most elegant feature of the relational model is the complete separation between the logical view of data and its physical storage. In a relational database, you describe data in terms of relations and attributes — logical concepts. The DBMS is completely free to store this data in any physical form it chooses: B+ trees, heap files, hash tables, sorted arrays, compressed columns — anything.
This means a DBA can completely reorganise the physical storage — rebuild indexes, change file organisation, move data between storage tiers — without any application knowing or caring. Applications always see the same logical view: tables with rows and columns. This is physical data independence in its purest form, and it is a direct mathematical consequence of the relational model's structure.
Why Relational Databases Handle Concurrency Better Than Any Previous Model
The hierarchical and network models used physical pointers to link records. When two transactions tried to modify related records simultaneously, they would both be modifying pointer chains — an operation that requires careful coordination and is fundamentally difficult to parallelise safely.
The relational model represents relationships through shared values — a foreign key in one table matching a primary key in another. Values, unlike pointers, can be compared and locked at the tuple level without affecting other tuples. This makes the relational model inherently more amenable to concurrent access control — which is why ACID transaction semantics were much easier to implement and reason about in relational systems than in hierarchical or network systems.
SQL — The Declaration of Independence from Navigation
SQL (originally called SEQUEL — Structured English Query Language, developed at IBM by Donald Chamberlin and Raymond Boyce in 1974) is the declarative interface to the relational model. It is the language in which users specify what data they want, leaving the how to retrieve it entirely to the DBMS.
-- NAVIGATIONAL (Network Model style):
// "Find all students enrolled in DBMS course taught by a CS department professor"
FIND ANY COURSE WHERE COURSE-NAME = 'DBMS'
FIND FIRST STUDENT WITHIN ENROLLED_IN
WHILE DB-STATUS = 0:
FIND OWNER WITHIN TEACHES
GET CURRENT PROFESSOR
IF PROFESSOR.DEPT = 'Computer Science':
FIND CURRENT STUDENT
GET CURRENT STUDENT
PRINT STUDENT.NAME
FIND NEXT STUDENT WITHIN ENROLLED_IN
// The programmer navigates. The programmer manages state. The programmer knows the structure.
-- DECLARATIVE (Relational Model / SQL):
SELECT s.name
FROM students s
JOIN enrollments e ON s.student_id = e.student_id
JOIN courses c ON e.course_id = c.course_id
JOIN professors p ON c.professor_id = p.professor_id
WHERE c.course_name = 'DBMS'
AND p.department = 'Computer Science';
-- The programmer describes WHAT they want.
-- The DBMS query optimiser decides HOW to retrieve it.
-- Add an index → same SQL runs faster. Change storage → same SQL works.
-- A business analyst who doesn't know pointer structures can write this.The Relational Model's Limitations — Honest Assessment
The relational model is not without genuine limitations. Understanding them is essential for knowing when to reach for alternative models — and for not making the mistake of forcing every problem into a relational box.
Modern applications are built with object-oriented programming languages. Objects have inheritance, polymorphism, nested structures, and methods. Relational tables are flat. Mapping objects to tables and back (ORM — Object-Relational Mapping) is complex, error-prone, and introduces performance overhead. Hibernate, SQLAlchemy, and Prisma exist entirely because of this mismatch.
The relational model requires a predefined schema. Every row in a table must have the same columns. Adding a new column requires an ALTER TABLE statement that locks the table (depending on the DBMS) and may require migrating all existing data. For applications where the data structure is still evolving rapidly, this rigidity is a significant friction.
The relational model assumes that all related data is accessible in a single system (for join operations to work). When you need to distribute a relational database across multiple servers (sharding), you either lose the ability to join across shards, or you introduce enormous coordination complexity. This is why social networks, global content platforms, and internet-scale applications often reach for non-relational databases.
Storing and retrieving a deeply nested data structure (a product with categories, subcategories, variants, images, reviews, and attributes) in a relational model requires many tables and many JOIN operations. For read-heavy access patterns where the entire nested structure is always needed at once, this is inefficient compared to storing the nested structure in a document database.
The Entity-Relationship Model — Designing Before Building
In 1976, Peter Chen published "The Entity-Relationship Model — Toward a Unified View of Data" — a paper that provided what the relational model itself lacked: a design methodology. The relational model told you how to store data once you had a schema. The ER model told you how to design the schema in the first place.
The ER model is conceptual — it operates at a higher level of abstraction than the relational model. It models the real world directly: the things that exist (entities), their properties (attributes), and how they relate to each other (relationships). It uses a visual notation (ER diagrams) that non-technical stakeholders can understand and validate before a single table is created.
This is why the ER model and the relational model are complementary rather than competing: you use the ER model to design your database schema, then convert the ER diagram to relational tables using a defined set of mapping rules. This is the standard database design methodology used in industry and academia for nearly 50 years.
The ER Model's Place in the Database Design Process
The Object-Oriented Model
The 1980s brought the rise of object-oriented programming — Smalltalk, C++, and later Java changed how software was written. Objects — with their encapsulated state (attributes), behaviour (methods), inheritance hierarchies, and polymorphism — became the dominant programming paradigm. And immediately, programmers faced a painful problem: their in-memory objects needed to be persisted in relational databases, and the translation was awkward and expensive.
This friction — called the object-relational impedance mismatch — motivated the development of object-oriented database management systems (OODBMS). The fundamental idea: eliminate the mismatch entirely by storing objects directly in the database, exactly as they exist in memory. No transformation, no mapping, no ORM layer.
Core Concepts of the Object-Oriented Model
Every object in an OODBMS has a unique, system-generated object identifier (OID) — similar to a primary key in a relational database, but managed entirely by the DBMS. The OID is immutable — it never changes regardless of how the object's attribute values change. Relationships between objects are represented by storing the OID of referenced objects (like pointers, but persistent).
Customer object with OID=0x4A2F has attributes (name="Rahul", city="Bengaluru") and a reference to Order OID=0x8B3C. The link is the OID, not a foreign key.Unlike relational attributes (which must be atomic — a single value), object attributes can be complex types: lists, sets, arrays, other objects, or nested structures. A Customer object can have an attribute orders of type List<Order> where each Order is itself an object with its own attributes.
customer.orders = [Order{id:1, items:[...], total:280}, Order{id:2, items:[...], total:450}]
// This nested structure is stored and retrieved as-is — no join required.Objects are instances of classes. Classes can inherit from other classes, inheriting all parent attributes and methods. A SavingsAccount class can extend Account, inheriting account_number and balance, and adding interest_rate. The database understands the inheritance hierarchy.
Account (account_number, balance, owner)
├── SavingsAccount (interest_rate)
└── CurrentAccount (overdraft_limit)
// Query: find all Accounts with balance > 10000 returns both types.Objects in an OODBMS can have methods — executable code stored alongside the data. A BankAccount object can have a transfer(amount, target) method. The logic is part of the stored object. This is a radical departure from the relational model where data and code are strictly separated.
account.deposit(5000) // method stored in the database
account.calculateInterest() // business logic lives WITH the dataThe ODMG (Object Data Management Group) standardised OQL (Object Query Language) in 1993 as the object-oriented equivalent of SQL. It is declarative like SQL but operates on objects and their traversal paths rather than flat table joins.
SELECT c.name, c.orders
FROM customers c
WHERE c.city = "Bengaluru"
AND COUNT(c.orders) > 5
// c.orders is traversed directly — no JOIN across tables neededWhy OODBMS Failed to Replace Relational
Object databases were genuinely useful for specific domains — CAD/CAM engineering systems (where objects are literally mechanical components with complex structures), geographic information systems (GIS), multimedia applications, and scientific simulation data. Objectivity/DB, GemStone, and Versant were commercially successful OODBMS products.
But they never replaced relational databases for general business applications, for several fundamental reasons:
OQL was standardised in 1993 but vendor implementations diverged significantly. Business analysts and managers could not use object databases as easily as they could use SQL. The SQL ecosystem — reporting tools, BI software, ETL tools — all assumed relational tables.
Oracle, PostgreSQL, and SQL Server added object-relational extensions — user-defined types, nested tables, array columns. This allowed relational databases to handle complex types without abandoning the relational core. The relational camp absorbed the best ideas from object databases.
Object databases lacked the decades of query optimiser research that relational databases had accumulated. OQL queries often performed poorly on large datasets compared to equivalent SQL queries on a well-indexed relational database.
By the 1990s, SQL had become the universal language of enterprise data. Decades of developer training, tooling, and institutional knowledge were invested in SQL and relational databases. Object databases required learning an entirely new paradigm with far fewer resources.
The Object-Relational Model — The Best of Both Worlds
The object-relational model emerged in the 1990s as relational databases began incorporating object-oriented features while maintaining their relational core. The result is a hybrid: the structured, queryable, ACID-compliant relational foundation with the ability to store complex types, user-defined types, and inheritance hierarchies.
PostgreSQL is the most complete implementation of the object-relational model available today. It supports user-defined base types, composite types (struct-like column types), array columns, JSON/JSONB columns with indexing, inheritance between tables, and user-defined functions written in multiple languages. Oracle also has extensive object-relational features.
-- 1. Composite type (structured column type)
CREATE TYPE address_t AS (
street VARCHAR(200),
city VARCHAR(100),
state VARCHAR(50),
pincode CHAR(6)
);
CREATE TABLE customers (
customer_id SERIAL PRIMARY KEY,
name VARCHAR(100),
address address_t, -- structured column — not just a string
phone_numbers TEXT[] -- array column — multiple phones natively
);
-- Insert with composite type
INSERT INTO customers (name, address, phone_numbers)
VALUES (
'Rahul Sharma',
ROW('123 MG Road', 'Bengaluru', 'Karnataka', '560001'),
ARRAY['98765-43210', '87654-32109']
);
-- Query into composite type's components
SELECT name, (address).city, (address).pincode
FROM customers
WHERE (address).city = 'Bengaluru';
-- Query arrays
SELECT name FROM customers
WHERE '98765-43210' = ANY(phone_numbers);
-- 2. JSONB column with indexing — semi-structured data in relational DB
CREATE TABLE products (
product_id SERIAL PRIMARY KEY,
name VARCHAR(200),
category VARCHAR(100),
attributes JSONB -- flexible schema for product-specific attributes
);
-- Zomato restaurant: attributes vary by restaurant type
INSERT INTO products (name, category, attributes) VALUES
('Butter Chicken', 'Main Course',
'{"spice_level": "medium", "is_vegetarian": false, "prep_time_mins": 20, "allergens": ["dairy"]}'),
('Masala Dosa', 'Breakfast',
'{"spice_level": "mild", "is_vegetarian": true, "accompaniments": ["sambar", "chutney"], "calories": 350}');
-- GIN index for fast JSONB queries
CREATE INDEX idx_product_attrs ON products USING GIN (attributes);
-- Query JSONB — fully indexed
SELECT name FROM products
WHERE attributes @> '{"is_vegetarian": true}' -- contains this key-value
AND (attributes->>'spice_level') = 'mild';The object-relational model is why PostgreSQL is the most popular database for complex applications in 2026. It gives you SQL, ACID transactions, and mature query optimisation — while also giving you the flexibility to model complex, varying data structures without leaving the relational ecosystem. Swiggy, Razorpay, and PhonePe run on PostgreSQL precisely because it handles both the structured financial data (relational) and the flexible product/menu data (JSONB) in a single system.
The Document Model — When Your Data Is Already a Document
The document model emerged directly from the pain of internet-scale web applications in the mid-2000s. Companies like Google, Amazon, Facebook, and LinkedIn were building systems with user-generated content, flexible schemas, and requirements for horizontal scale across hundreds of servers that relational databases — designed in an era of single-server systems — struggled to meet elegantly.
The document model stores data as self-contained documents — typically JSON or BSON (Binary JSON) format. A document is analogous to a row in a relational table, but with two critical differences: a document can contain nested sub-documents and arrays (eliminating the need for JOIN), and documents in the same collection are not required to have the same structure (schema flexibility).
The Document Model's Core Philosophy — Colocation
The relational model's approach to related data: normalise it into separate tables, then reconstruct it at query time using JOIN. The document model's approach: store everything that belongs together in one place. This is called colocation.
Colocation is a deliberate trade-off: you accept some data duplication in exchange for read performance. If 95% of the time you read a restaurant's full information (name, location, hours, menu, ratings), it is faster to retrieve it as one document than to execute 5 JOIN operations across 5 tables. The document model optimises for the common-case read pattern.
/* ─── RELATIONAL MODEL ─── 5 tables, 5 JOINs needed ─── */
restaurants (restaurant_id, name, address, city, rating, opening_time, closing_time)
menu_categories (category_id, restaurant_id FK, name, display_order)
menu_items (item_id, category_id FK, name, description, price, is_veg, prep_time)
item_images (image_id, item_id FK, url, is_primary)
item_allergens (item_id FK, allergen)
-- Query to get full restaurant info for display:
SELECT r.name, r.address, mc.name as category,
mi.name as item, mi.price, mi.is_veg,
ii.url as image, ia.allergen
FROM restaurants r
JOIN menu_categories mc ON r.restaurant_id = mc.restaurant_id
JOIN menu_items mi ON mc.category_id = mi.category_id
LEFT JOIN item_images ii ON mi.item_id = ii.item_id AND ii.is_primary = true
LEFT JOIN item_allergens ia ON mi.item_id = ia.item_id
WHERE r.restaurant_id = 'R001';
-- 5 JOINs. Multiple table reads. Complex query. Potentially slow at scale.
/* ─── DOCUMENT MODEL ─── 1 collection, 1 read ─── */
// One MongoDB document — entire restaurant in one place
{
"_id": "R001",
"name": "Biryani House",
"address": { "street": "45 Brigade Road", "city": "Bengaluru", "pincode": "560001" },
"rating": 4.6,
"hours": { "open": "11:00", "close": "23:00" },
"menu": [
{
"category": "Biryani",
"items": [
{
"name": "Chicken Biryani",
"price": 280,
"is_veg": false,
"prep_time_mins": 25,
"images": ["https://cdn.example.com/img/cb1.jpg"],
"allergens": ["gluten", "dairy"]
},
{
"name": "Veg Biryani",
"price": 220,
"is_veg": true,
"prep_time_mins": 20,
"images": ["https://cdn.example.com/img/vb1.jpg"],
"allergens": []
}
]
}
]
}
// Query: db.restaurants.findOne({ "_id": "R001" })
// One network round-trip. One disk read (if cached). Full restaurant returned.When the Document Model Struggles — The JOIN Problem
The document model's strength (colocation) becomes a weakness when you need to query across documents in ways that were not anticipated when the schema was designed. The relational model can answer any question about any combination of attributes. The document model can efficiently answer questions about attributes within a single document — but cross-document queries require either denormalisation (storing the same data in multiple documents) or application-level joins (fetching multiple documents and correlating them in code).
// Relational: "Find all restaurants in Bengaluru that serve veg biryani under ₹200"
// Clean single SQL query
SELECT DISTINCT r.name, r.rating
FROM restaurants r
JOIN menu_categories mc ON r.restaurant_id = mc.restaurant_id
JOIN menu_items mi ON mc.category_id = mi.category_id
WHERE r.city = 'Bengaluru'
AND mc.name = 'Biryani'
AND mi.is_veg = true
AND mi.price < 200;
// MongoDB: same query
db.restaurants.find({
"address.city": "Bengaluru",
"menu": {
$elemMatch: {
"category": "Biryani",
"items": {
$elemMatch: {
"is_veg": true,
"price": { $lt: 200 }
}
}
}
}
}, { name: 1, rating: 1 });
// MongoDB can do this — but it gets complex fast.
// Harder to read, harder to debug, harder to index optimally.
// Now: "For each order placed in the last 7 days, show the customer's name,
// the restaurant name, and total items ordered"
// This requires customers + orders + restaurants to be joined.
// In MongoDB, this requires $lookup (server-side join) which is
// significantly more complex and often slower than relational JOIN.
// This is the query the document model was NOT designed for.Schema Flexibility — Gift and Curse
Document databases have no enforced schema — every document in a collection can have different fields. This is liberating during rapid development: you can add new fields to new documents without migrating existing ones. It is dangerous at production scale: without discipline, you end up with documents in the same collection having 15 different slightly incompatible shapes, and application code becomes a maze of null checks and field existence tests.
Modern MongoDB (since version 3.6) supports schema validation — JSON Schema rules that documents must satisfy before being inserted. This is schema-on-write, similar to relational database constraints. Most mature MongoDB deployments use schema validation heavily. In other words, they re-introduce schema rigidity by choice — because the production pain of fully schemaless data is too high.
The Key-Value Model — The Simplest Model, The Fastest Access
The key-value model is, by design, the simplest data model that can meaningfully be called a database. The entire model consists of one concept: a key that uniquely identifies a record, and a value associated with that key. The database makes no assumptions about the structure of the value — it is treated as an opaque blob (in simple key-value stores) or as a typed data structure (in Redis).
This simplicity is not a limitation — it is the source of the key-value model's primary virtue: speed. With no schema to enforce, no relationships to validate, no indexes to maintain on value contents, and no query parsing (just a direct hash table lookup), key-value operations execute in microseconds. Redis, the most widely used key-value store, serves millions of operations per second from a single server.
Redis — Beyond Simple Key-Value
Amazon's Dynamo paper (2007) popularised the key-value model for distributed systems. Redis (2009) took the model further by defining a rich set of value types that can be operated on atomically at the server — making it far more powerful than a simple GET/SET store.
# ─── STRING ─── The fundamental type
SET user:session:abc123 "{"user_id": 1001, "role": "customer", "cart_id": "C789"}"
GET user:session:abc123
EXPIRE user:session:abc123 3600 # Auto-delete after 1 hour (session expiry)
# Use case: session storage, OTP codes, feature flags, short-term caches
SET otp:9876543210 "284731"
EXPIRE otp:9876543210 300 # OTP expires in 5 minutes
# ─── COUNTER ─── Atomic increment
INCR page_views:blog:dbms-introduction # Thread-safe counter
INCRBY api:rate_limit:user:1001 1 # Rate limiting — incr per request
GET api:rate_limit:user:1001 # Check current request count
# ─── HASH ─── Object storage without serialisation
HSET user:1001 name "Rahul Sharma" city "Bengaluru" tier "gold"
HGET user:1001 name # Returns: "Rahul Sharma"
HGETALL user:1001 # Returns all fields
HINCRBY user:1001 order_count 1 # Atomic field increment
# Use case: user profiles, product metadata, any object with many fields
# ─── SORTED SET ─── Leaderboard
ZADD game:leaderboard 9842 "player:Rahul"
ZADD game:leaderboard 8731 "player:Priya"
ZADD game:leaderboard 9156 "player:Arjun"
ZREVRANGE game:leaderboard 0 9 WITHSCORES # Top 10 players, sorted by score
ZRANK game:leaderboard "player:Priya" # Priya's rank
# ─── LIST ─── Message queue / Recent activity
LPUSH notifications:user:1001 "Your order #ORD-4521 was delivered"
RPOP notifications:user:1001 # Consume notification
LRANGE notifications:user:1001 0 9 # Last 10 notifications
# ─── SET ─── Unique collections
SADD online_users 1001 1002 1003
SISMEMBER online_users 1001 # Is user 1001 online? O(1) check
SUNION online_users:server1 online_users:server2 # Union across serversWhy Every Large Application Uses Redis
Redis sits in front of your primary relational database as a cache layer. The architecture is: application first checks Redis; if the data is there (cache hit), return it immediately (microseconds). If not (cache miss), query the relational database (milliseconds), store the result in Redis, return it. For frequently-read data — user profiles, product information, configuration values, search results — the cache hit rate in a well-tuned system is 80–95%. This means 80–95% of reads never touch the database at all. At Swiggy's scale (10 million orders per day), eliminating even 80% of database reads is the difference between a functioning system and one that requires 5x the database infrastructure.
# Python (Flask + Redis + PostgreSQL)
def get_restaurant(restaurant_id: str):
cache_key = f"restaurant:{restaurant_id}"
# Step 1: Check Redis cache
cached = redis_client.get(cache_key)
if cached:
return json.loads(cached) # Cache hit — microseconds
# Step 2: Cache miss — query PostgreSQL
restaurant = db.query(
"SELECT * FROM restaurants WHERE id = %s",
[restaurant_id]
)
# Step 3: Store in Redis with 5-minute TTL
redis_client.setex(
cache_key,
300, # 300 seconds = 5 minutes
json.dumps(restaurant)
)
return restaurant # Milliseconds on first call, microseconds afterThe Column-Family Model — Designed for Petabyte-Scale Writes
Google's Bigtable paper (2006) and Amazon's Dynamo paper (2007) described internal systems built to handle data volumes that no existing database could manage. Apache HBase (2008, based on Bigtable) and Apache Cassandra (2008, combining Bigtable's data model with Dynamo's distributed architecture) brought these ideas to the open-source world.
The column-family model is often misunderstood because its name sounds like it is simply "column-oriented storage" — a performance optimisation used by analytical databases like DuckDB and ClickHouse. It is not. The column-family model is a fundamentally different data model.
Understanding the Column-Family Structure
In the relational model, a row has a fixed set of columns — every row in the table has the same columns, even if some are NULL. In the column-family model, rows can have completely different sets of columns. Columns are grouped into column families — predefined at schema creation time. Within a column family, rows can have any number of column qualifiers (the actual column names), and these can differ from row to row.
// Cassandra table — user activity with time-series data
// Each row is a user's activity record for a specific time bucket
Table: user_activity
Partition key: (user_id, date) -- determines which server stores this row
Clustering key: event_time -- within a partition, rows sorted by time
Row for user 1001, date 2024-03-15:
[user_id: 1001, date: 2024-03-15]
event_time: 09:14:23 | event_type: "login" | device: "Android"
event_time: 09:15:01 | event_type: "view" | page: "home" | session: "S892"
event_time: 09:16:44 | event_type: "search" | query: "biryani" | results: 47
event_time: 09:17:12 | event_type: "view" | page: "restaurant/R001"
event_time: 09:18:55 | event_type: "order_start"| cart_id: "C341"
event_time: 09:21:03 | event_type: "payment" | amount: 280 | method: "UPI"
// Notice: different events have different columns (extra fields).
// "login" has "device". "search" has "query" and "results".
// "payment" has "amount" and "method". This is legal in column-family model.
// Query: get all events for user 1001 on 2024-03-15
SELECT * FROM user_activity
WHERE user_id = 1001 AND date = '2024-03-15';
-- Returns all 6 rows above. One partition = one server read.
// Query: get events between specific times
SELECT * FROM user_activity
WHERE user_id = 1001 AND date = '2024-03-15'
AND event_time >= '09:15:00' AND event_time <= '09:20:00';
-- Efficient: clustering key is sorted, so this is a range scan within the partition.Cassandra's Distributed Architecture — Why It Scales
Cassandra's write performance comes from its architecture: no master node. Every node in a Cassandra cluster is equal. A write request can be handled by any node. The node receiving the request (the coordinator) uses a consistent hashing algorithm to determine which nodes should store the data, forwards the write to those nodes, and returns success once enough nodes have acknowledged (the number is configurable).
This means Cassandra scales linearly: double the number of nodes, roughly double the write throughput. Netflix uses Cassandra to store viewing history for 280+ million subscribers — approximately 1 trillion write operations per day. No relational database can handle that write volume without heroic infrastructure and complex sharding.
Cassandra's biggest gotcha for developers coming from a relational background: you must design your data model around your queries — not around your entities. In a relational database, you model entities (customers, orders, products), create tables, and then JOIN them in any query pattern. In Cassandra, you model your queries first, then design a table that serves exactly that query pattern. If you need to query the same data in two different patterns, you typically create two separate tables with the same data stored in two different arrangements. Data duplication is not just acceptable — it is expected and designed for. This is a radical inversion of relational normalisation philosophy.
The Graph Model — When Relationships Are More Important Than Data
In the relational model, relationships between entities are represented by storing foreign key values and joining tables at query time. Join operations are computed dynamically — the database searches for matching values across two sets of rows. As the depth of relationships increases (friends of friends of friends), the join operations multiply and the performance degrades dramatically.
The graph model takes a fundamentally different approach: store relationships as first-class data structures — actual edges in the graph, with their own properties and direct pointers to the nodes they connect. Traversing a relationship in a graph database is a direct pointer follow — O(1) per hop — rather than a set intersection operation. For highly connected data, this difference is the difference between a query that returns in 50 milliseconds and one that times out.
Graph Model Structure — Nodes, Edges, Properties
A node is an entity in the graph — a person, a product, a city, a transaction. Nodes have labels (like a type tag) and properties (key-value attributes). A single node can have multiple labels: a person might be labelled both :Customer and :Employee.
An edge is a directed, named relationship between two nodes. Every edge has a type (FRIENDS_WITH, PURCHASED, LIVES_IN, MANAGES) and can also have properties (since, weight, amount). Edges are stored as direct pointers — traversal is a pointer follow, not a search.
Both nodes and edges can have properties — arbitrary key-value pairs. A PURCHASED edge can have a purchase_date and amount property. Properties make the graph model extremely expressive without requiring additional tables.
Labels categorise nodes and allow efficient filtering. A node labelled :Product AND :Electronics can be found by either label. Labels are like light-weight table membership — a node can belong to multiple categories simultaneously.
Cypher — The Declarative Graph Query Language
Neo4j's Cypher query language is designed to express graph patterns in a way that visually resembles the graph structure itself. Nodes are represented as(n), relationships as -[r]->. This visual correspondence makes Cypher queries far more readable than their equivalent SQL for graph-shaped problems.
-- SQL: Find all users who are friends-of-friends-of-friends with user 1001
-- (3-hop traversal in relational model)
SELECT DISTINCT u3.user_id, u3.name
FROM friendships f1
JOIN friendships f2 ON f1.friend_id = f2.user_id
JOIN friendships f3 ON f2.friend_id = f3.user_id
JOIN users u3 ON f3.friend_id = u3.user_id
WHERE f1.user_id = 1001
AND f3.friend_id != 1001
AND f3.friend_id NOT IN (
SELECT friend_id FROM friendships WHERE user_id = 1001
);
-- 3 self-joins on a 500M-row friendships table.
-- On LinkedIn-scale data, this query can take MINUTES.
-- Cypher: Same query — and it reads like plain English
MATCH (me:User {user_id: 1001})-[:FRIENDS_WITH*3]-(recommendation:User)
WHERE NOT (me)-[:FRIENDS_WITH]-(recommendation)
AND recommendation.user_id <> 1001
RETURN DISTINCT recommendation.user_id, recommendation.name;
-- Neo4j traverses the graph directly.
-- On the same data, this query takes MILLISECONDS.
-- The *3 means: traverse FRIENDS_WITH edges exactly 3 hops.
-- Fraud detection: find circular payment patterns (signs of money laundering)
MATCH path = (a:Account)-[:TRANSFERRED_TO*4..6]-(a)
WHERE ALL(tx IN relationships(path) WHERE tx.amount > 50000)
RETURN path;
-- In SQL: 4-6 self-joins on a transactions table. Possibly intractable.
-- In Cypher: efficient graph traversal returning suspicious circular paths.Real-World Graph Database Use Cases
Every recommendation ("Rahul knows Priya, who knows Arjun, who knows you") is a graph traversal. LinkedIn's entire connection network — 900+ million members, billions of connections — is stored in a graph database. A relational JOIN at this scale for 2-hop recommendations would be unusably slow.
Fraudulent behaviour often creates unusual graph patterns — a single entity connected to many accounts, circular fund transfers, unusually dense connection clusters. These patterns are trivial to detect with graph queries and practically impossible to detect with relational queries at transaction volume.
"Customers who bought X also bought Y and Z" is a graph problem: find products that are connected via shared-purchase edges to the products the user has purchased, weighted by frequency. Graph databases store these purchase-connection graphs and traverse them in real time.
A city's road network is literally a graph — intersections are nodes, roads are edges, distances and traffic data are edge properties. Graph databases with specialised shortest-path algorithms (Dijkstra, A*) are the natural tool for routing and ETA calculation.
Choosing the Right Data Model — A Complete Decision Framework
In real engineering, you rarely choose a single data model for an entire system. Modern architectures use multiple data models simultaneously — each selected for the specific access pattern and data type it handles best. This is called polyglot persistence: using multiple database types in a single system, each serving its own purpose.
| Model | Data Shape | Query Style | Scales | ACID? | Best For | Avoid When |
|---|---|---|---|---|---|---|
| Relational | Structured, tabular | SQL — declarative, flexible | Vertically well, horizontal hard | ✓ Full ACID | Financial data, ERP, any structured business data | Deep nesting, web-scale writes, highly variable schema |
| Document | Nested JSON documents | Pattern matching on fields | Horizontally well | ✓ Single-doc ACID | Product catalogs, content, user profiles, rapidly evolving schema | Many cross-document joins, strict consistency |
| Key-Value | Opaque blobs/typed values | GET/SET by key only | Extremely well | ✓ Single-key ops | Caching, sessions, counters, queues, leaderboards | Complex queries, relations between data |
| Column-Family | Wide rows, sparse columns | Partition key + clustering key | Linearly at petabyte scale | ✕ Eventual consistency by default | Time-series, write-heavy, IoT, activity logging | Ad-hoc queries, complex aggregations, joins |
| Graph | Nodes + edges + properties | Pattern matching traversal | Node depth well, breadth medium | ✓ Full ACID (Neo4j) | Social networks, fraud detection, recommendations, routing | Simple tabular data, non-connected data, bulk aggregations |
| Hierarchical | Tree (parent-child) | Navigational / procedural | Single server | ✕ Varies | File systems, XML, bill of materials, org charts | Many-to-many relationships, ad-hoc queries |
The Polyglot Persistence Architecture — A Real Example
Let's design the data architecture for a comprehensive e-commerce platform (think Flipkart) using the right model for each concern:
The Database Selection Meeting — A Real Engineering Discussion
Architecture decisions involving data models happen most visibly in two situations: when a new system is being built from scratch, and when an existing system is hitting the limits of its current data model. Here is what both conversations sound like.
"We're building a peer-to-peer lending platform. Lenders create loan offers. Borrowers apply. We need to match them, track repayments, and calculate risk scores. What database should we use?"
"MongoDB. It's more scalable than SQL and we don't know the full schema yet."
"That would be a serious mistake. Loan data — offer amounts, repayment schedules, interest calculations, transaction records — is exactly the kind of data that requires ACID transactions. If a repayment is processed and the database half-commits, some borrowers will be marked as paid without money actually being transferred. MongoDB's single-document atomicity is insufficient for cross-document financial transactions."
"PostgreSQL for the core financial data. The schema is actually quite clear: users, loan_offers, loan_applications, loan_agreements, repayment_schedule, repayment_transactions. All structured, all interconnected, all requiring strict integrity. We might add Redis later for caching credit scores and session data, but start with PostgreSQL."
"But what about schema flexibility? We might need to add fields as the product evolves."
"PostgreSQL's ALTER TABLE ADD COLUMN is non-blocking for adding nullable columns — it's a metadata-only operation in modern PostgreSQL, it doesn't lock the table. Schema evolution in PostgreSQL is much less painful than people think. And if we genuinely have flexible fields for loan risk metadata, we use a JSONB column for exactly that — structured data in PostgreSQL, flexible data in a JSONB column, same database."
An e-commerce company's product page is taking 4 seconds to load. A senior engineer runs EXPLAIN ANALYZE and finds the culprit:
-- Product detail page query — currently 800ms at 50K concurrent users
SELECT
p.name, p.description, p.base_price,
b.name AS brand,
array_agg(DISTINCT c.name) AS categories,
array_agg(DISTINCT pi.url ORDER BY pi.sort_order) AS images,
array_agg(DISTINCT pa.attribute_name || ': ' || pa.attribute_value) AS attributes,
json_agg(DISTINCT jsonb_build_object(
'sku', pv.sku, 'size', pv.size, 'color', pv.color,
'price', pv.price, 'stock', pv.stock_count
)) AS variants
FROM products p
JOIN brands b ON p.brand_id = b.brand_id
LEFT JOIN product_categories pc ON p.product_id = pc.product_id
LEFT JOIN categories c ON pc.category_id = c.category_id
LEFT JOIN product_images pi ON p.product_id = pi.product_id
LEFT JOIN product_attributes pa ON p.product_id = pa.product_id
LEFT JOIN product_variants pv ON p.product_id = pv.product_id
WHERE p.product_id = 'P-8821'
GROUP BY p.product_id, p.name, p.description, p.base_price, b.name;The diagnosis: this product data is essentially a document — it's always read as a complete unit, never partially. Seven JOINs at this traffic volume is the wrong model for this access pattern. The decision:
// 1. Migrate product catalog to MongoDB
// (keep orders, customers, payments in PostgreSQL)
// New MongoDB document — same data, one read
const product = await db.collection('products').findOne(
{ product_id: 'P-8821' },
{ projection: { name:1, description:1, brand:1, categories:1, images:1, attributes:1, variants:1 }}
);
// Query time: 12ms (was 800ms)
// Zero JOINs. One document. One disk read.
// 2. Add Redis cache layer
// Product data changes rarely — cache for 30 minutes
const cacheKey = `product:${productId}`;
const cached = await redis.get(cacheKey);
if (cached) return JSON.parse(cached); // 0.2ms
const product = await mongo.findOne(...);
await redis.setex(cacheKey, 1800, JSON.stringify(product));
return product;
// After cache warmup: 0.2ms per product page (was 800ms)
// 4000x improvement. No application logic changes. No new features lost.This is what data model knowledge enables: recognising when a data model mismatch is causing performance problems, and knowing which alternative model fits the access pattern better. A developer who knows only SQL would keep adding indexes to the 7-join query. A developer who understands data models correctly diagnoses the root cause and applies the right tool.
🎯 Key Takeaways
- ✓A data model is a formal framework defining structure (how data is organised), operations (what you can do with data), and constraints (what rules must hold). Every DBMS is built on a specific data model.
- ✓The hierarchical model (1960s, IBM IMS) organises data as trees — one parent per child. Fast for known-path queries, but cannot represent many-to-many relationships without duplication. Still running in banks worldwide.
- ✓The network model (1969, CODASYL) allowed multiple parents per record — solving many-to-many. Still navigational and procedural — no declarative query language. Direct ancestor of modern graph databases.
- ✓The relational model (1970, Codd) built on set theory and predicate logic — the most mathematically rigorous data model ever created. Tables, relational algebra, SQL, and ACID transactions. Dominant for 50+ years because of its formal foundation, physical-logical separation, and declarative SQL.
- ✓The object-oriented model (1980s–90s) stored objects directly — solving the impedance mismatch. Failed to replace relational because relational extended itself (object-relational model, PostgreSQL JSONB), and the SQL ecosystem was too entrenched.
- ✓Document model (MongoDB, 2007+) stores self-contained JSON documents — optimised for colocation (read all related data in one operation). Best for flexible schema and nested data. Weak at cross-document joins and strict multi-record consistency.
- ✓Key-value model (Redis) is the simplest and fastest model — direct hash table lookup in microseconds. Used as a cache layer in front of every primary database in high-traffic systems. Not a primary database.
- ✓Column-family model (Cassandra, HBase) designed for petabyte-scale write workloads with no single point of failure. Requires query-first design — data model must be designed around access patterns, not entities. Netflix, IoT, time-series.
- ✓Graph model (Neo4j) makes relationships first-class — direct pointer traversal instead of JOIN computation. Transformative for connected data (social networks, fraud, recommendations, routing). Poor for non-connected tabular data.
- ✓Polyglot persistence is the correct modern architecture: PostgreSQL for structured financial/transactional data, MongoDB or JSONB for flexible nested data, Redis for caching and sessions, Cassandra for write-heavy time-series, graph databases for connection-heavy workloads. No single model wins everything.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.