SQL vs NoSQL — The Real Difference
What each one trades off, four NoSQL families from first principles, and how to choose.
The "SQL vs NoSQL" Framing Is Wrong
The phrase "SQL vs NoSQL" implies a competition — one wins, one loses, you pick a side. This framing has misled entire teams into wrong architectural decisions for over a decade. Companies rewrote perfectly fine relational databases into MongoDB because NoSQL was "web-scale." Others kept cramming document data into rigid relational schemas because SQL felt familiar.
The actual question is never "SQL or NoSQL?" The actual question is: what are the access patterns of this data, and which storage model serves those patterns best? SQL databases and NoSQL databases are not competitors — they solve different problems. Many production systems at Indian tech companies use both simultaneously, each handling the workload it is suited for.
As a data engineer, you are a consumer of both types of systems — you ingest from them, you understand their data models, and you know what can and cannot be efficiently extracted from each. Knowing the trade-offs deeply means you can design ingestion strategies that work with each system's strengths rather than fighting against its limitations.
Relational Databases — What Makes Them Relational
The relational model was invented by Edgar Codd at IBM in 1970. Its central insight: organise data into relations (tables of rows and columns) and use a declarative query language (SQL) to retrieve any combination of data across any number of tables at query time. You do not need to know in advance how data will be queried — the relational model handles arbitrary queries over any combination of columns and tables.
The core properties of relational databases
The relational model's weakness — why NoSQL exists
The relational model's greatest strength — flexible querying through arbitrary JOINs — is also the source of its scaling limitations. A JOIN requires the database to correlate rows across multiple tables, which requires either sorting both sides (merge join) or building a hash table (hash join). At internet scale — billions of rows across dozens of tables with thousands of concurrent users — these operations become expensive enough that the relational model alone cannot keep up.
The other limitation: rigid schema. When your data is genuinely variable in structure — a product catalogue where each product category has completely different attributes — forcing everything into fixed columns produces either enormous NULL-filled tables or complex EAV (Entity-Attribute-Value) anti-patterns that are painful to query and maintain.
Problem: E-commerce product catalogue
Electronics: RAM, storage, display_size, processor, battery_life
Clothing: size, colour, fabric, gender, sleeve_length, care_instructions
Books: author, isbn, pages, publisher, genre, language, edition
Food: weight, expiry_date, allergens, nutritional_info, storage_temp
Forcing this into one relational table:
CREATE TABLE products (
id INT, name VARCHAR, price DECIMAL, category VARCHAR,
-- Electronics
ram VARCHAR, storage VARCHAR, display_size DECIMAL, processor VARCHAR,
battery_life INT,
-- Clothing
size VARCHAR, colour VARCHAR, fabric VARCHAR, gender VARCHAR,
sleeve_length VARCHAR, care_instructions TEXT,
-- Books
author VARCHAR, isbn VARCHAR, pages INT, publisher VARCHAR,
genre VARCHAR, language VARCHAR, edition VARCHAR,
-- Food
weight DECIMAL, expiry_date DATE, allergens TEXT,
nutritional_info TEXT, storage_temp VARCHAR
-- ... and 50 more product-specific columns
);
Result:
A clothing row has 40+ NULL columns for electronics/books/food fields
Adding a new category requires ALTER TABLE (schema migration)
Table has 80+ columns, 75% NULL for any given row
Document database solution (MongoDB):
Electronics product: {"id":1, "name":"iPhone 15", "ram":"6GB", "storage":"128GB"}
Clothing product: {"id":2, "name":"Cotton Kurta", "size":"M", "fabric":"cotton"}
Each document contains only the fields relevant to it.
New categories need no schema migration — just add new documents.NoSQL Is Not One Thing — It Is Four Completely Different Things
"NoSQL" is an umbrella term that groups four completely different database designs under one label. This is the core source of confusion. MongoDB, Redis, Cassandra, and Neo4j are all "NoSQL" — but they have nothing in common beyond not being relational. Each was built for a specific access pattern that relational databases handle poorly. Understanding each family separately is the only way to make correct choices.
CAP Theorem — Why Every Distributed Database Makes a Trade-off
The CAP theorem, stated by Eric Brewer in 2000 and proved by Gilbert and Lynch in 2002, makes a precise claim about distributed databases: a distributed system can provide at most two of three guarantees simultaneously. Understanding it explains why different databases behave differently during network failures — and why that behaviour matters for data pipelines.
Because partition tolerance is not optional in real distributed systems (networks do fail), the practical trade-off is between C and A: CP systems choose consistency over availability during a network partition — they return errors rather than stale data. AP systems choose availability over consistency during a partition — they keep responding but may return stale or inconsistent data.
| Database | Type | Partition behaviour | Why this matters to DEs |
|---|---|---|---|
| PostgreSQL | CP | During a network partition, the primary stops accepting writes rather than risk divergence with replicas | Your CDC pipeline may stall during primary failover — build retry logic |
| MongoDB (default) | CP | Reads from primary by default — no stale reads. Can be configured for AP with eventual consistency reads from secondaries | Change streams from primary are consistent. Secondary reads for bulk extraction may be slightly stale |
| Cassandra | AP | Continues accepting reads and writes during partition. Nodes may have different versions of the same row — reconciled later via Last-Write-Wins | Data extracted from Cassandra may have duplicates or slightly inconsistent values — always deduplicate on order_id or timestamp in Silver layer |
| DynamoDB | AP (default) / CP (opt-in) | Eventually consistent reads by default (stale possible). Strongly consistent reads available at 2× cost | For DE pipelines, always use strongly consistent reads to avoid processing stale records |
| Redis Cluster | AP | Continues serving from available nodes. Keys on failed nodes unavailable until recovery | Cache misses during partition cause DB fallback — pipeline may see temporary slowdown |
How to Choose the Right Database for Any Use Case
The correct database choice is always determined by answering three questions in order. Never start with "which database is most popular?" or "which one does our current stack use?" Start with the data and access pattern.
QUESTION 1: What is the primary access pattern?
"Give me this specific entity by its ID"
→ Key-value (Redis) if speed and simplicity are priorities
→ Document (MongoDB) if the entity has variable structure
"Give me this document with all its related data in one read"
→ Document store (MongoDB, Firestore)
"Give me all records matching these criteria with arbitrary filters"
→ Relational (PostgreSQL, MySQL)
→ OLAP warehouse (Snowflake, BigQuery) for analytical queries
"Write millions of events per second, query by a known key"
→ Column-family (Cassandra, DynamoDB)
→ Time-series (InfluxDB, TimescaleDB) for temporal data
"How are these entities connected? Find N-hop relationships"
→ Graph database (Neo4j, Neptune)
QUESTION 2: How strong must the consistency guarantees be?
"Financial transactions — partial updates are catastrophic"
→ Must have ACID. Relational database.
"Product catalogue — slight staleness is acceptable"
→ AP NoSQL acceptable. MongoDB, DynamoDB.
"User session data — stale session is acceptable, availability must be high"
→ AP is fine. Redis.
"Delivery GPS events — eventual consistency acceptable"
→ AP acceptable. Cassandra.
QUESTION 3: What scale is genuinely needed?
< 10M rows, standard read/write mix
→ PostgreSQL handles this comfortably. No NoSQL needed.
> 100M rows, write-heavy, known access pattern
→ Column-family or key-value if the pattern fits.
> 1B rows, analytical queries
→ Data warehouse (Snowflake/BigQuery), not operational database.
Rule: do not introduce NoSQL complexity until relational cannot
handle the workload. Most applications never reach that scale.The polyglot persistence pattern — using multiple databases together
At a mature Indian tech company, the right answer is almost never "one database for everything." Different parts of the application have different access patterns and consistency requirements. The polyglot persistence pattern uses the best database for each specific need.
MEESHO (e-commerce platform) — representative polyglot architecture:
PostgreSQL (relational, CP):
→ Orders, payments, settlements, user accounts
→ Needs ACID. Cannot tolerate inconsistency. Normalised.
MongoDB (document, CP):
→ Product catalogue (variable attributes per category)
→ Seller profiles and store configurations
→ Flexible schema — categories change frequently
Redis (key-value, AP):
→ User sessions (logged in / cart contents)
→ API rate limiting per seller per minute
→ Caching product details (avoid DB hit per page load)
→ Real-time inventory counters (INCR/DECR atomic operations)
Cassandra (column-family, AP):
→ User activity events (page views, searches, clicks)
→ Notification delivery logs
→ Write-heavy, partition-key access only needed
Elasticsearch (search index — a fifth type!):
→ Full-text product search ("cotton kurta under 500")
→ Inverted index — not relational, not a traditional NoSQL type
→ Synced from MongoDB product catalogue via pipeline
DATA ENGINEER'S ROLE IN THIS ARCHITECTURE:
→ Ingest from ALL five systems into the data lake
→ Each has a different extraction approach:
PostgreSQL: CDC via Debezium WAL reading
MongoDB: Change streams API
Redis: Periodic snapshots (no built-in CDC)
Cassandra: CDC plugin or Spark Cassandra Connector bulk export
Elasticsearch: Scroll API for bulk export, no CDCHow Data Engineers Ingest From Each Database Type
Every database type has different capabilities and constraints for data extraction. A data engineer who knows only SQL ingestion will be blocked when the team needs data from MongoDB or Cassandra. Here are the practical extraction patterns for each type.
| Database Type | Best ingestion method | Incremental approach | Main challenge |
|---|---|---|---|
| Relational (PostgreSQL) | CDC via Debezium reading WAL logical replication; or JDBC incremental extraction | WAL LSN position (CDC) or WHERE updated_at > last_run | Long transactions block WAL cleanup and stall CDC; must monitor replication lag |
| Document (MongoDB) | MongoDB Change Streams API for real-time; mongodump or Spark connector for bulk | Change stream resume token (survives restarts); or _id / updatedAt field | Schema variation between documents requires schema-on-read handling; nested arrays need exploding |
| Key-Value (Redis) | RDB snapshot file parsing; Redis Streams if events are published there; SCAN + DUMP for selective keys | No built-in CDC; Redis Streams provide append-only event log if application writes to them | In-memory store — no persistent history by default; key expiry means data can disappear before extraction |
| Column-Family (Cassandra) | Spark Cassandra Connector for bulk parallel extraction; Debezium Cassandra CDC connector for streaming | writetime() function for last-written timestamp; CDC connector reads commit log | AP consistency means extracted data may have duplicates from concurrent writes; must deduplicate |
| Graph (Neo4j) | APOC export procedures (JSON/CSV); Bolt protocol streaming; Neo4j to Kafka connector | Transaction log (Enterprise edition); or timestamp properties on nodes/edges | Graph-native queries do not map naturally to tabular format; flattening relationships for DE use requires design decisions |
Choosing the Wrong Database — A Real Architectural Mistake
A health-tech startup joined an accelerator in 2023. In the excitement of building fast, their backend team chose MongoDB for everything — user accounts, appointments, prescriptions, billing, lab results, doctor notes. "NoSQL is web-scale," someone said. "We might have millions of users someday."
By 2025 they had 80,000 users. Their data team was asked to build a monthly billing report — total revenue by insurance provider, doctor specialty, and city. The query took 45 seconds on MongoDB and required three sequential aggregation pipeline stages across 2 million documents. It regularly timed out.
You are brought in as the data engineer to fix this. Your analysis:
The fundamental problem: MongoDB was the right choice for patient records (variable structure — diabetes patients have different fields than maternity patients) and doctor notes (free-form text). It was the wrong choice for billing data (highly relational: patient → insurance → doctor → service → claim) and appointment scheduling (strong consistency required — two doctors cannot be double-booked).
What you recommend: Migrate billing and appointment data to PostgreSQL, which has proper FOREIGN KEY constraints, fast aggregation on indexed columns, and ACID transactions for booking operations. Keep patient records and doctor notes in MongoDB where the flexible schema genuinely adds value. Build a data pipeline that ingests from both into Snowflake for reporting — a proper separation of operational and analytical concerns.
The result after migration: The monthly billing report query runs in 800 milliseconds in Snowflake. Double-booking incidents disappear with PostgreSQL's transaction semantics. The team can now add new billing queries in minutes. The patient record system stays in MongoDB where flexibility is genuinely needed.
The lesson: MongoDB did not fail. The team used it for the wrong problems. The correct architecture uses the right database for each access pattern — and a data engineer who understands both SQL and NoSQL deeply can recognise and fix these mismatches.
5 Interview Questions — With Complete Answers
Errors You Will Hit — And Exactly Why They Happen
🎯 Key Takeaways
- ✓"SQL vs NoSQL" is the wrong framing. The right question is: what are the access patterns of this data? Relational databases provide flexible querying and strong consistency. NoSQL databases optimise specific access patterns at the cost of flexibility and sometimes consistency.
- ✓There are four completely different NoSQL families, not one: Key-Value (Redis — extreme speed for key lookups), Document (MongoDB — flexible schema for entity-centric data), Column-Family (Cassandra — extreme write throughput with partition-key access), and Graph (Neo4j — fast multi-hop relationship traversal).
- ✓Key-value stores (Redis) provide sub-millisecond lookups by exact key. They have no query language beyond GET/SET. Use for sessions, caching, rate limiting, and real-time counters. Never as a primary persistent data store.
- ✓Document stores (MongoDB) excel at reading and writing complete entities with variable structure. They break down for cross-document aggregations and strongly relational data. Best for product catalogues, user profiles, and CMS content with changing schemas.
- ✓Column-family stores (Cassandra) are designed around specific query patterns, not general entities. Every query must include the partition key. Design one table per query pattern. Best for write-heavy workloads (IoT, event streams) where queries are known in advance.
- ✓Graph databases (Neo4j) store relationships as first-class edges, making multi-hop traversal fast and constant relative to total graph size. Use for social networks, fraud detection, and recommendation engines. Relational JOIN performance degrades exponentially with hop count; graph traversal does not.
- ✓The CAP theorem: distributed databases can guarantee at most two of Consistency, Availability, and Partition Tolerance. Since P is not optional, the real choice is CP (return error rather than stale data) vs AP (return potentially stale data rather than error). Cassandra is AP — always deduplicate its extracts.
- ✓Mature companies use polyglot persistence — PostgreSQL for financial transactions, MongoDB for product catalogues, Redis for sessions and caching, Cassandra for event streams. A data engineer must ingest from all of them.
- ✓Each database type has a different ingestion approach: PostgreSQL via CDC/WAL, MongoDB via Change Streams, Redis via RDB snapshots, Cassandra via Spark Connector or CDC plugin, Neo4j via APOC export. Know at least the first two deeply.
- ✓Most applications never need NoSQL. PostgreSQL handles hundreds of millions of rows comfortably with proper indexing. Do not introduce NoSQL complexity until relational genuinely cannot handle the workload. The wrong database for the access pattern always causes more problems than it solves.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.