Types of Databases
Relational, Document, Key-Value, Column-Family, Graph, Time-Series — what each one is built for and how Indian companies use them
// Part 01
Why Multiple Types of Databases Exist
When you walk into a hardware store, you do not see one type of screwdriver. You see flathead, Phillips, Torx, hex — each shaped for a specific kind of screw. Using the wrong one strips the head and wastes time. Databases are the same. Relational databases are extraordinarily good at what they do. They are also genuinely wrong for certain problems. The engineers who invented those alternatives were not being contrarian — they were solving problems that relational databases could not solve efficiently at scale.
Here is the core tension that drove the creation of every non-relational database type:
Relational databases guarantee consistency and correctness. But guaranteeing consistency in a distributed system — where data is spread across hundreds of servers across multiple data centres — requires coordination between those servers. That coordination takes time. At extreme scale, the time spent coordinating becomes the bottleneck. Some systems need speed so badly that they are willing to trade a degree of consistency for it. That trade-off is what every NoSQL database is fundamentally about.
This module maps every major type of database — what it is built for, where it breaks down, and which Indian companies actually use it. By the end, you will be able to answer the question "which database should we use?" for any system you are ever asked to design.
// Part 02
The CAP Theorem — The Foundation of Every Database Trade-Off
To understand why different databases make different choices, you need to understand the CAP theorem. It is one of the most important concepts in all of distributed systems, and it directly explains every design decision you will see in this module.
The CAP theorem states that a distributed data system can guarantee at most two of the following three properties simultaneously:
In any real distributed system, network partitions will happen — servers go down, cables get cut, data centres lose connectivity. So Partition Tolerance is not optional. Every production database must tolerate partitions. This means the real choice is always between C and A — consistency or availability — when a partition occurs.
CP systems (Consistent + Partition Tolerant) choose to return an error or wait rather than return potentially stale data. PostgreSQL and MySQL are CP — during a partition, they will refuse to serve requests rather than risk returning incorrect data. AP systems (Available + Partition Tolerant) always respond, but the response might be slightly stale. Cassandra and CouchDB are AP — they always return a result, but it might not reflect the very latest write from a node that is currently unreachable.
Neither choice is wrong. It depends entirely on what your application cannot tolerate. A banking app cannot tolerate incorrect balance information — CP. A social media feed can tolerate showing a post from 2 seconds ago — AP. Every database type in this module makes one of these two choices.
// Part 03
Relational Databases (RDBMS) — The Default Choice
You already know relational databases from Modules 01 and 02. They store data in tables with rows and columns, enforce relationships through foreign keys, guarantee ACID properties, and are queried with SQL. They are the right choice for the vast majority of business applications — and they have been for 50 years.
When relational databases are the right choice
Use a relational database when your data is structured and well-defined — you know in advance what columns exist and what types they hold. When relationships between entities matter and must be enforced — customers have orders, orders have items. When you need complex queries — multi-table joins, aggregations, subqueries, window functions. When ACID guarantees are non-negotiable — financial transactions, inventory management, booking systems.
When relational databases struggle
RDBMS starts to struggle when your schema changes frequently and unpredictably — adding new columns to a 500-million-row table takes hours and locks the table. When data is inherently hierarchical or graph-shaped — storing a social network's friend-of-friend relationships in relational tables requires complex recursive queries. When write throughput requirements exceed what a single master node can handle and you need to shard across hundreds of servers — relational databases can be sharded, but it is operationally complex and breaks some SQL features.
| Database | Who uses it in India | Why |
|---|---|---|
| PostgreSQL | Razorpay, CRED, Zerodha, Groww, PhonePe | Open source, ACID compliant, excellent for fintech — handles complex financial queries and JSON payment metadata with JSONB. |
| MySQL | Swiggy, Nykaa, OYO, BookMyShow | Mature, battle-tested at high traffic, excellent read-replica support for consumer apps with millions of concurrent users. |
| MS SQL Server | HDFC Bank, ICICI Bank, Infosys enterprise clients | Enterprise support, Windows ecosystem, deep compliance tooling for RBI-regulated financial institutions. |
| Oracle | TCS, Wipro clients, IRCTC, LIC, government | Legacy enterprise and government. IRCTC runs one of Asia's highest-volume Oracle installations. Expensive but deeply entrenched. |
| SQLite | Every Android and iOS app (device-local storage) | Serverless, zero-config, runs on the device itself. Swiggy, PhonePe, Cred — all store local user data in SQLite on your phone. |
// Part 04
Document Databases — Flexible Schema for Changing Data
A document database stores data as documents — typically JSON or BSON objects — rather than rows in tables. Each document is a self-contained unit that can have any structure. Two documents in the same collection (the NoSQL equivalent of a table) can have completely different fields.
What a document looks like
Notice what is different from the relational model. The customer's contact details are embedded inside the document as a nested object — not in a separate contacts table. Recent orders are embedded as an array. Preferences are a free-form array. There are no foreign keys and no JOINs needed to get this customer's complete profile — it is all in one document, fetched in one read.
The core advantage — read performance for document-shaped data
In a relational database, getting a customer's profile plus their last 5 orders plus their preferences would require at minimum 3 JOINs across 3 tables. Each JOIN means more disk reads, more coordination, more CPU. In a document database, the same query is a single document fetch — one read, one result. For read-heavy workloads where the access pattern is always "give me everything about this entity," document databases are significantly faster.
The core weakness — no JOINs, limited cross-document consistency
Document databases do not support JOINs. If you need to find all customers who ordered a specific product, you either embed so much data that the document becomes enormous and stale, or you do multiple queries and join them in application code — which is slower and more complex than a SQL JOIN. Transactions across multiple documents in different collections are also either not supported or limited — atomicity within a single document is guaranteed, but cross-document atomicity is not always available.
MongoDB — the dominant document database in India
MongoDB is used by Zomato for restaurant and menu data (menus change constantly — no fixed schema), by Ola for driver and ride metadata, and by many early-stage startups that move fast and cannot afford to define a rigid schema before the product has found its shape. It is also extremely popular for storing event logs, user activity data, and any data where the structure varies per record.
// Part 05
Key-Value Databases — The Fastest Database That Exists
A key-value database is the simplest possible database: every piece of data is stored as a pair — a unique key and a value. The value can be anything: a string, a number, a JSON object, a binary blob. You retrieve data by key. There is no schema, no columns, no relationships, no query language. Just: set(key, value) and get(key).
This radical simplicity is the source of their superpower. Because there is nothing to parse, plan, or coordinate — just a key lookup in memory — key-value databases are extraordinarily fast. Redis, the most popular key-value database, can serve over 1 million operations per second on a single node with sub-millisecond latency. No relational database comes close to this for simple key lookups.
What key-value databases are used for
Redis — the key-value database every Indian startup uses
Redis (Remote Dictionary Server) stores everything in memory, which is what makes it fast. It also supports persistence — writing snapshots to disk so data survives restarts. Redis is not just a simple key-value store — it supports rich data structures: strings, lists, sets, sorted sets, hashes, streams, and geospatial indexes. This makes it useful for a surprisingly wide range of use cases beyond simple caching.
Almost every Indian tech company of any size runs Redis. Swiggy uses Redis to cache restaurant menus — a menu does not change every second, so serving it from Redis instead of PostgreSQL handles the 50× traffic spike at 7 PM without the database breaking a sweat. Razorpay uses Redis for rate limiting API keys. PhonePe uses it for session management across hundreds of millions of users.
// Part 06
Column-Family Databases — Built for Extreme Write Volume
Column-family databases (also called wide-column databases) store data in tables, but the tables work very differently from relational tables. In a column-family database, each row can have a completely different set of columns — and tables can have millions of columns. Data is stored physically by column rather than by row, which makes range scans across specific columns extremely fast.
The dominant column-family database is Apache Cassandra. It was built by Facebook to handle their Inbox search — billions of messages, hundreds of millions of users, write rates that no relational database could sustain. Cassandra is designed from the ground up for massive write throughput and linear horizontal scalability — add more nodes and throughput scales linearly.
How Cassandra is different from PostgreSQL
| Dimension | PostgreSQL | Cassandra |
|---|---|---|
| Write throughput | ~10,000 writes/sec on a single node (tunable) | ~1,000,000 writes/sec across a cluster — writes are always fast |
| Scaling model | Scale up (bigger server) or complex sharding | Scale out — add nodes and throughput scales linearly. No single point of failure |
| Consistency | Strong — always consistent (CP) | Tunable — you choose consistency level per query (AP by default) |
| JOINs | Full SQL JOIN support | No JOINs. One query = one table. Data must be denormalised |
| Schema | Rigid — ALTER TABLE on large tables is painful | Flexible — add new columns without downtime |
| Query patterns | Ad-hoc — query anything with SQL | Fixed — you design tables around your query patterns, not the other way around |
| Best for | Complex relational data, financial systems | Time-series data, event logs, IoT data, write-heavy workloads at massive scale |
Who uses Cassandra in India
Flipkart uses Cassandra for their product catalogue and recommendation engine — hundreds of millions of products, billions of user interaction events, all written at a rate no relational database could absorb. Ola uses Cassandra for ride event data — every GPS ping from every driver, every second, across millions of active rides. Hotstar (now JioCinema) used Cassandra for user watch history and playback state. The pattern is consistent: Cassandra is chosen when write volume is the primary constraint and the data does not need complex relational queries.
// Part 07
Graph Databases — When Relationships Are the Data
In a relational database, relationships are stored implicitly through foreign keys and are reconstructed at query time through JOINs. In a graph database, relationships are first-class citizens — they are stored explicitly as edges with their own properties, and the database is optimised to traverse them. This makes certain types of queries dramatically faster than any relational equivalent.
The graph model
A graph database stores data as nodes (entities — a person, a product, a location) and edges (relationships between entities — FOLLOWS, PURCHASED, LOCATED_IN). Both nodes and edges can have properties. The database is physically designed for traversal — following edges from node to node — which is what makes multi-hop relationship queries fast.
Where relational databases fail at graph queries
Consider this question: "Find all users who might know Aisha Khan — specifically people who are followed by at least 3 of Aisha's direct followers, but who Aisha does not already follow." In SQL this requires multiple levels of self-joins on a users and follows table. On a social network with 100 million users, this query takes minutes — the JOIN fan-out is exponential. In Neo4j (the dominant graph database), the same query is expressed as a simple graph traversal and executes in milliseconds because the edges are physically stored next to their nodes.
Where graph databases are used in India
LinkedIn India's connection graph, fraud detection at fintech companies (Razorpay uses graph analysis to detect fraud rings — accounts that share phones, addresses, or devices form a graph, and suspicious clusters become visible), recommendation engines at e-commerce companies (customers who bought this also bought that — a product graph), and knowledge graphs at content platforms. Neo4j and Amazon Neptune are the most common graph database choices in production.
// Part 08
Time-Series Databases — Built for Metrics, Events, and IoT
A time-series database is optimised for storing and querying data that is indexed by time — measurements taken at regular intervals, or events that occur with a timestamp. Every data point has a timestamp, and the most common query pattern is "give me all values for this metric between time A and time B."
Why time-series data is different
Time-series data has properties that standard databases do not optimise for. It is append-only — you almost never update or delete historical measurements. It arrives in time order — writes are always for "now." It is queried in ranges — "last 6 hours," "last 30 days." It is often aggregated — "average CPU per 5-minute bucket." And it grows without bound — a server emitting metrics every second generates 86,400 data points per day, millions per year.
Relational databases can store this data, but they are not optimised for the write rate (millions of inserts per second across thousands of metric series) or the range-aggregation query pattern. Time-series databases use specialised storage formats — columnar compression, time-partitioned storage — that make them 10–100× more efficient for this specific access pattern.
Popular time-series databases
InfluxDB is the most popular open-source time-series database — used for application metrics, server monitoring, and IoT sensor data. TimescaleDB is PostgreSQL with time-series extensions — you get full SQL plus time-series optimisations, which makes it popular at companies that already run PostgreSQL and want one less database to operate. Prometheus is the standard for infrastructure metrics in Kubernetes environments — almost every Indian startup running on k8s uses Prometheus with Grafana dashboards.
Who uses time-series databases in India
Every company running cloud infrastructure uses time-series databases for monitoring — CPU, memory, latency, error rates, request volume. Swiggy monitors millions of time-series metrics across thousands of microservices. PhonePe tracks transaction success rates per second across payment rails. Ola tracks GPS pings and driver location updates. Any IoT application — smart meters, factory sensors, connected vehicles — is a natural time-series use case. Tata Motors uses time-series databases for vehicle telemetry from their connected car fleet.
// Part 09
How Real Companies Use Multiple Database Types Together
No production company uses just one type of database. Every system of meaningful complexity uses two, three, or four database types simultaneously — each handling the part of the problem it is best suited for. Here is how two well-known Indian companies actually structure their data infrastructure.
Swiggy — food delivery at scale
Razorpay — payment infrastructure
🎯 Pro Tip
When a system design interview asks "which database would you use?", the wrong answer is picking one and defending it absolutely. The right answer is: "It depends on the access pattern. I would use PostgreSQL for transactional data, Redis for caching and sessions, and Cassandra or MongoDB for [the specific high-volume component]." Knowing when to use each type — and being able to articulate why — is a senior-level signal that most candidates miss.
// Part 10
The Decision Framework — How to Choose the Right Database
When you are designing a system and need to choose a database, run through these five questions in order. The answers will almost always point to the right choice.
For this entire SQL course, you will use DuckDB — a modern analytical relational database that runs in your browser. Every query you write here translates directly to MySQL, PostgreSQL, or any other relational database. The relational model and SQL are universal — learn them once and you can work with any relational database on day one.
// Part 11
What This Looks Like at Work
You are a backend engineer at a Series B fintech startup in Hyderabad. The company processes UPI payments for small merchants. The CTO calls a system design meeting — you have hit 50,000 transactions per day and need to plan for 5 million. You are asked to review the current architecture and recommend database changes.
🎯 Pro Tip
The ability to choose the right database for the right problem — and explain why, in business terms — is a rare skill. Most engineers learn one database well and reach for it for every problem. Engineers who understand the landscape choose databases the way a surgeon chooses instruments: precisely, based on the specific problem, with a clear rationale for each choice.
// Part 12
Interview Prep — 5 Questions With Complete Answers
SQL databases are relational databases that store data in tables with rows and columns, enforce relationships through foreign keys, require a defined schema, and are queried using Structured Query Language. They guarantee ACID properties — atomicity, consistency, isolation, and durability — making them the standard choice for transactional data where correctness is critical. Examples: PostgreSQL, MySQL, Oracle.
NoSQL databases is a broad category covering any database that does not use the relational table model. The term covers four main sub-types: document databases (MongoDB — JSON documents), key-value stores (Redis — simple key lookups), column-family databases (Cassandra — wide rows, extreme write throughput), and graph databases (Neo4j — nodes and edges). They were built to solve specific problems that relational databases handle inefficiently: flexible schemas, horizontal scaling across hundreds of nodes, extreme write throughput, or relationship traversal.
The choice between them is not about which is better — it is about which access pattern and consistency trade-off fits the problem. Most production systems at scale use both: relational for transactional data where ACID matters, NoSQL for the components where scale, flexibility, or speed is the primary constraint.
The CAP theorem states that a distributed data system can guarantee at most two of three properties: Consistency (every read receives the most recent write), Availability (every request receives a response), and Partition Tolerance (the system continues operating when network partitions occur). Since network partitions are inevitable in any real distributed system, the practical choice is always between consistency and availability when a partition occurs.
CP systems (Consistent + Partition Tolerant) return an error rather than stale data during a partition. PostgreSQL and MySQL are CP. They are appropriate for financial systems, inventory, and any data where an incorrect read is worse than a temporary unavailability. AP systems (Available + Partition Tolerant) always respond, but the response might be slightly stale. Cassandra and CouchDB are AP. They are appropriate for social feeds, search results, and analytics where serving slightly stale data is better than returning an error.
In an interview, CAP theorem comes up in system design questions. The right answer is never "I will use a CP system" or "I will use an AP system" — it is "for the transactional component I need CP because incorrect balance information is catastrophic, and for the activity feed component I can use AP because a 2-second delay in showing a new post is acceptable."
Redis is appropriate when the access pattern is a key lookup — retrieve a value by its exact key — and the latency requirement is sub-millisecond. The primary use cases are: caching (store the result of expensive database queries and serve from Redis for subsequent identical requests), session storage (user login state, permissions, preferences — retrieved on every authenticated API request), rate limiting (atomic increment of a counter with automatic TTL expiry), OTP storage (a one-time password with a short expiry is a perfect key-value pair), and pub/sub messaging (lightweight event broadcast between services).
Redis should not be used when you need: complex queries (JOIN, GROUP BY, aggregations), strong durability guarantees (Redis is memory-first — data can be lost if the server crashes without persistence configured), or as a primary database for relational data. Redis is most powerful as a layer in front of a relational database — it handles the hot, frequently-accessed data at microsecond speed, while the relational database handles the full dataset with all its query flexibility.
A practical rule: if the question is "give me the value for key X" and the answer is used in less than 1 millisecond, Redis. If the question requires joining, filtering, or aggregating across multiple entities, relational database — possibly with Redis caching the result.
A document database stores data as self-contained documents — typically JSON — rather than as rows in tables. Each document can have any structure, and two documents in the same collection can have entirely different fields. Data that belongs together conceptually is stored together physically — a customer document can embed their address, contact details, and recent orders, rather than spreading that data across multiple tables.
The advantages over relational databases are: flexible schema (add new fields without a migration), fast reads for document-shaped access patterns (one read fetches the complete entity without JOINs), and easier horizontal scaling (documents can be distributed across nodes by a shard key). MongoDB is the dominant document database and is widely used for product catalogues, user profiles, content management systems, and any data where the structure varies per record or evolves frequently.
The trade-offs are significant. No JOINs — if you need to query across collections, you either embed data (creating redundancy) or do multiple queries and join in application code. Cross-document transactions are limited — atomicity within a single document is guaranteed, but operations across multiple documents require careful design. Complex ad-hoc queries are harder — MongoDB's aggregation pipeline is powerful but less expressive than SQL for complex analytical queries. And because schema is not enforced, data quality problems (missing fields, wrong types) accumulate silently over time and must be managed in application code.
Delivery tracking data has specific characteristics that make Cassandra the right choice. First, write volume: at peak dinner time, Swiggy might have 500,000 active deliveries simultaneously, each emitting a GPS update every 5 seconds. That is 100,000 writes per second of location data alone. MySQL on a single node handles approximately 10,000–50,000 writes per second under optimal conditions — this is already at the limit. Cassandra is designed to handle millions of writes per second across a cluster, and adding nodes scales write throughput linearly.
Second, the access pattern is simple and fixed: "give me all location updates for order X in the last 30 minutes." This is a primary key lookup plus a time range — exactly what Cassandra's data model is optimised for. There is no need for JOINs, GROUP BY, or complex analytics on this data.
Third, the data is append-only. Location updates are never updated or deleted while the delivery is active. After delivery completion, the data is rarely accessed. Cassandra's log-structured storage is optimised for append-only workloads — writes are always fast because they never need to find and update existing records. MySQL's write path requires finding the right page, checking constraints, and updating indexes — all slower for pure-append workloads. Swiggy still uses MySQL for the orders table itself (the authoritative record of the transaction), but delegates the high-volume, time-series-like tracking data to Cassandra.
// Part 13
Errors You Will Hit — And Exactly Why They Happen
🎯 Key Takeaways
- ✓The CAP theorem: distributed systems can guarantee at most 2 of Consistency, Availability, and Partition Tolerance. Since partitions are inevitable, the real choice is between consistency (CP) and availability (AP). Relational databases are CP. Cassandra and most NoSQL databases are AP.
- ✓Relational databases (PostgreSQL, MySQL) are the right default for structured data with relationships, complex queries, and ACID requirements — financial systems, inventory, bookings, anything where correctness is critical.
- ✓Document databases (MongoDB) store flexible JSON documents. Best for data with variable structure, content-heavy applications, and entity-centric access patterns. No JOINs — embed related data or accept multiple queries.
- ✓Key-value databases (Redis) are the fastest database type for simple lookups. Best for caching, sessions, rate limiting, OTP storage, and any use case where you retrieve a value by an exact key with sub-millisecond requirements.
- ✓Column-family databases (Cassandra) are built for extreme write throughput and linear horizontal scalability. Best for append-only data: event logs, delivery tracking, IoT sensor data, audit trails. No JOINs. Design tables around queries.
- ✓Graph databases (Neo4j) store relationships as first-class entities. Best when relationships are the data — social networks, fraud ring detection, recommendation engines, knowledge graphs. Multi-hop traversals are dramatically faster than relational equivalents.
- ✓Time-series databases (InfluxDB, TimescaleDB, Prometheus) are optimised for timestamp-indexed measurements. Best for infrastructure metrics, IoT sensor data, application performance monitoring. 10–100× more efficient than relational databases for this access pattern.
- ✓No production system uses just one database type. Swiggy runs MySQL + MongoDB + Redis + Cassandra + Prometheus simultaneously — each handling the component it is best suited for.
- ✓The database selection framework: Does it need relationships enforced? → Relational. What is the access pattern? → determines the NoSQL type. Write volume above 100k/sec? → Cassandra. ACID non-negotiable? → Relational. Schema unstable? → Document.
- ✓For this entire SQL course, DuckDB runs in your browser. Every query you write translates directly to MySQL and PostgreSQL. The relational model and SQL are universal — master them here and you can work with any relational database from day one.
What comes next
In Module 04, you set up your local SQL environment — install MySQL or PostgreSQL, connect with a client, and run your first query on a real database server. If you prefer to keep using the browser playground for now, you can skip ahead to Module 05 where the SQL writing begins.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.