Data Warehouse vs Data Lake vs Lakehouse
Three answers to where we store data — the honest trade-offs and how to choose.
The Same Problem Solved Three Different Ways
Every company that collects data eventually faces the same problem: the operational databases that run the business are not suitable for analysis. They are optimised for fast individual record access, not for scanning millions of rows to compute aggregations. The business needs somewhere to store all its data for analysis. The question is where.
Three distinct architectural answers have emerged over the past 40 years, each built on the lessons and failures of the previous one. Understanding why each was invented — what problem it solved and what new problems it created — is the foundation for making intelligent storage architecture decisions.
This is not a story of old being replaced by new. All three architectures coexist in production today. Many large companies have all three in their data platform. Understanding each one deeply — its design, its trade-offs, and where it fits in a modern data architecture — is essential knowledge for every data engineer.
Data Warehouse
A data warehouse is a centralised relational database designed specifically for analytical queries. It is where cleaned, modelled, business-ready data lives. Business analysts connect their BI tools to the warehouse. Data scientists run training queries against it. Finance teams pull reports from it. The warehouse is the single source of truth for historical, clean data in a traditional data architecture.
How a data warehouse is different from an operational database
Both are relational databases that accept SQL queries. The differences are in storage layout, optimisation target, and design philosophy.
An operational database (PostgreSQL, MySQL) is row-oriented and normalised — optimised for fast individual record access with ACID transactions. An ORDER BY created_at query that scans the whole table is slow and problematic on a production OLTP database.
A data warehouse (Snowflake, BigQuery, Redshift) is columnar and typically denormalised — optimised for aggregate queries that scan many rows across few columns. The same analytical scan that would cripple a production database runs in seconds on a warehouse because of columnar storage, predicate pushdown, and distributed execution across many compute nodes.
BUSINESS QUESTION: What was our daily revenue by city and restaurant
category for the last 90 days?
ON POSTGRESQL (operational database):
SELECT DATE(created_at), r.city, r.category, SUM(o.amount)
FROM orders o
JOIN restaurants r ON o.restaurant_id = r.id
WHERE o.created_at >= NOW() - INTERVAL '90 days'
GROUP BY 1, 2, 3
ORDER BY 1;
Result: 4 minutes 23 seconds (100M orders table, full scan)
Problem: this query evicted hot pages from buffer pool, slowing
every concurrent application query for 30 minutes after
ON SNOWFLAKE (data warehouse):
Same query. Same data (loaded into warehouse).
Result: 3.8 seconds
Reason: columnar storage reads only amount and created_at columns,
partitioning prunes to last 90 days only,
distributed execution across 8 compute nodes
WHY WAREHOUSES EXIST:
→ Separate analytical compute from operational compute
No more slowing down the live app with analytical queries
→ Optimised storage layout for analytical patterns
Columnar, compressed, partitioned, clustered
→ Scalable compute on demand
Scale compute nodes up for heavy queries, down when idle
→ Centralised governance and access control
One place for all business data, one place to manage access
→ Schema enforcement at query time
Data is validated and typed before entering the warehouseThe schema-on-write constraint — the warehouse's biggest limitation
The defining characteristic of a data warehouse is schema-on-write: before any data enters the warehouse, it must conform to a defined schema. Column names, data types, NOT NULL constraints, and foreign keys must all be satisfied. Data that does not conform is rejected.
This constraint is the source of the warehouse's trustworthiness — data in the warehouse conforms to a known structure and can be trusted. It is also the source of the warehouse's biggest limitation: you must know what you want to store before you store it, and changing the schema (adding columns, changing types) requires explicit migrations.
For a company in 2010 that generated structured transactional data from known systems, schema-on-write was acceptable. For a company in 2020 that was also collecting raw click streams, IoT sensor data, unstructured log files, and semi-structured JSON events — schema-on-write became a bottleneck. Someone had to define a schema for every new data source before a single byte could be stored. This bottleneck created the data lake.
Storage cost — the warehouse's second limitation
Traditional on-premises data warehouses (Teradata, IBM Netezza, Oracle Exadata) bundled compute and storage together in proprietary hardware. Storage cost was $10,000–$50,000 per TB. This made it economically impossible to store raw, unprocessed data in the warehouse — only clean, valuable data justified the cost. The warehouse became a highly curated space. Raw data was discarded or stored on separate, cheaper but less queryable storage.
Data Lake
The data lake emerged in the early 2010s as a direct reaction to the warehouse's limitations. The invention of cheap, scalable cloud object storage (Amazon S3 launched in 2006) meant that storing raw data became economically viable for the first time. A terabyte in S3 costs $23 per month. The same terabyte in a traditional warehouse hardware appliance cost thousands.
The data lake philosophy is the opposite of the warehouse: store everything now, define structure when you query it (schema-on-read). Raw JSON, CSV files, log files, images, audio, sensor streams — land everything in object storage in its original format. Do not clean, do not transform, do not reject. Worry about structure later.
Schema-on-read — the lake's defining characteristic
In a data lake, the schema is not defined when data is written — it is defined when data is read. You can store a year of raw JSON events without ever defining their schema. When you need to analyse them, you read the files and apply a schema at query time. Apache Spark and query engines like Athena and Presto support this pattern natively — they infer or accept an externally-defined schema at read time.
SCHEMA-ON-WRITE (warehouse):
Before loading order data, you must define:
CREATE TABLE orders (
order_id BIGINT NOT NULL,
customer_id BIGINT NOT NULL,
amount DECIMAL(10,2) NOT NULL,
status VARCHAR(20) NOT NULL CHECK (status IN (...)),
created_at TIMESTAMP NOT NULL
);
Data that doesn't match this schema → REJECTED
New field added to source (restaurant_tier)?
→ ALTER TABLE, migration, re-load. Days of work.
SCHEMA-ON-READ (data lake):
Store raw JSON as-is in S3:
s3://data-lake/orders/2026-03-17/batch_001.json
No schema definition needed. File lands in seconds.
Query time (Spark / Athena):
spark.read.json("s3://data-lake/orders/2026-03-17/")
→ Spark infers schema from the files
→ Returns DataFrame with inferred column types
New field added to source?
→ Nothing to do. It just appears in new files.
→ Old files that lack the new field return null for it.
→ No migration needed.
Freedom: store ANY format — JSON, CSV, Parquet, images, audio.
Cost: $0.023/GB/month on S3 vs $10,000+/TB for old warehouse hardware.
The tradeoff: you get flexibility and low cost.
You lose: schema enforcement, data quality guarantees, fast SQL.Why the data lake became a data swamp
The "store everything now, worry later" philosophy sounded liberating. In practice, it created a new set of problems that became so pervasive they earned their own name: the data swamp. A data lake that nobody can use effectively is a data swamp.
Lakehouse
The lakehouse is the architecture that combines the low cost and flexibility of a data lake with the reliability and governance of a data warehouse. It achieves this by adding a transaction log layer on top of object storage — a metadata layer that tracks every change made to the data, enforces schema evolution rules, and provides ACID transaction guarantees without moving data to an expensive proprietary system.
Databricks introduced Delta Lake in 2019. Apache Iceberg (originally from Netflix) and Apache Hudi (originally from Uber) are the other dominant table formats. All three work on the same principle: your data files are still plain Parquet on S3 or ADLS — but a transaction log alongside them records every operation, making the whole thing behave like a proper database table.
How the table format layer works
Plain S3 without Delta Lake (data lake):
s3://data-lake/orders/
part-00001.parquet (written Monday)
part-00002.parquet (written Tuesday)
part-00003.parquet (written Tuesday, overlaps with 00002 — BUG)
Problems:
If Tuesday's job failed partway, 00002 and 00003 partially exist
No way to know which files are "committed" vs in-progress
No way to update or delete specific rows
Two simultaneous writers corrupt the data
WITH DELTA LAKE (lakehouse):
s3://data-lake/orders/
part-00001.parquet
part-00002.parquet
part-00003.parquet
_delta_log/
00000000000000000000.json ← transaction 0: initial table creation
00000000000000000001.json ← transaction 1: add part-00001
00000000000000000002.json ← transaction 2: add part-00002
00000000000000000003.json ← transaction 3: add part-00003
00000000000000000004.json ← transaction 4: DELETE WHERE status='test'
Transaction log entry (JSON):
{
"commitInfo": {"timestamp": 1710698072847, "operation": "WRITE",
"operationParameters": {"mode": "Append"}},
"add": {"path": "part-00002.parquet",
"size": 8472931, "dataChange": true,
"stats": {"numRecords": 100000,
"minValues": {"order_id": 9284751},
"maxValues": {"order_id": 9384750}}}
}
WHAT THE TRANSACTION LOG ENABLES:
✓ ACID transactions: readers see only committed transactions
✓ Time travel: read table as it was at any past transaction
spark.read.format("delta").option("versionAsOf", 3).load(path)
✓ Row-level deletes: DELETE FROM orders WHERE customer_id = 4201938
(marks old file as removed, writes new file without the row)
✓ Schema evolution: ALTER TABLE orders ADD COLUMN delivery_fee DECIMAL
✓ Concurrent writes: optimistic concurrency control, conflict detection
✓ Audit log: full history of every operation ever performedThe three table formats — Delta Lake, Iceberg, Hudi
All three table formats solve the same core problems (ACID transactions, time travel, schema evolution on object storage) but differ in ecosystem support, specific capabilities, and design philosophy.
| Feature | Delta Lake | Apache Iceberg | Apache Hudi |
|---|---|---|---|
| Created by | Databricks (2019) | Netflix → Apache (2018) | Uber → Apache (2019) |
| Metadata format | JSON transaction log in _delta_log/ | Metadata JSON tree + manifest files | Timeline + .hoodie metadata directory |
| Primary strength | Deep Spark + Databricks integration, simplest to use | Engine-agnostic, works with Spark + Flink + Trino + Athena + BigQuery | Optimised for high-frequency upserts and near-real-time CDC ingestion |
| Time travel | ✓ Version-based and timestamp-based | ✓ Snapshot-based, very efficient | ✓ Commit-timeline based |
| Schema evolution | ✓ Add columns, drop columns, rename (with config) | ✓ Full evolution with hidden partitioning | ✓ Add columns, evolve types |
| Partition evolution | Manual — must rewrite data to change partitions | ✓ Hidden partitioning — change partitions without rewriting data | Manual partition evolution |
| Best query engines | Apache Spark, Databricks, Trino (via connector) | Spark, Flink, Trino, Athena, BigQuery Omni, Snowflake | Spark, Flink, Hive, Presto |
| Choose when | Databricks is your primary compute platform | Multiple query engines, AWS-heavy or multi-cloud | High-frequency CDC upserts are the dominant workload |
The separation of compute and storage — the lakehouse advantage
Traditional data warehouses bundled compute and storage together — if you needed more storage, you also paid for more compute, and vice versa. This coupling was economically wasteful: most data is rarely queried, so paying for always-on compute to sit next to cold storage made little sense.
The lakehouse completely separates compute from storage. Data lives in S3 or ADLS at $0.023/GB/month. Compute — Spark clusters, Databricks clusters, Trino clusters — spins up when a query runs and shuts down when it finishes. You pay for storage continuously at low cost and pay for compute only while it is actually running. This economic model made petabyte-scale analytics accessible to companies that could never afford traditional warehouse hardware.
Warehouse vs Lake vs Lakehouse — Every Dimension
| Dimension | Data Warehouse | Data Lake | Lakehouse |
|---|---|---|---|
| Storage cost | High — $100–500+/TB/month | Very low — $23/TB/month (S3) | Very low — same object storage as lake |
| Schema approach | Schema-on-write — enforced before load | Schema-on-read — applied at query time | Schema-on-write with evolution — enforced but changeable |
| Data types accepted | Structured only (tables) | Any — structured, semi, unstructured | Structured + semi-structured (unstructured stored separately) |
| ACID transactions | ✓ Full ACID | ✗ None — append only | ✓ Full ACID via transaction log |
| Row-level updates | ✓ Yes (UPDATE/DELETE) | ✗ No — must rewrite files | ✓ Yes (rewrites affected files) |
| Time travel | Depends — Snowflake yes, Redshift no | ✗ No — no history tracking | ✓ Yes — any historical version |
| SQL performance | Excellent — optimised for SQL | Slow without Parquet + partitioning | Good — slightly below pure warehouse for simple SQL |
| ML / Spark access | Via JDBC/export — slower | ✓ Direct file access — fast | ✓ Direct Spark access — fast |
| Governance | ✓ Strong — schema + access control | ✗ Weak — no schema enforcement | ✓ Strong — schema + audit log + access control |
| Maintenance | Low — managed by vendor | Low to medium — file organisation matters | Medium — compaction, VACUUM, OPTIMIZE jobs required |
| Vendor lock-in | High — proprietary format and SQL | Low — open file formats (Parquet) | Low — open formats, table format is open source |
| Best for | Business analytics, BI dashboards, regulated data | Raw data archive, ML training data, unstructured data | Unified analytics + ML + streaming, modern data platforms |
How All Three Coexist in a Modern Data Platform
The choice is not "pick one." Most mature data platforms at Indian tech companies use all three architectural patterns simultaneously, each serving a different purpose. Understanding how they fit together is more useful than treating them as alternatives.
LAYER ARCHITECTURE PURPOSE
─────────────────────────────────────────────────────────────────────
Landing Zone Data Lake Raw files in S3/ADLS exactly
(object storage) as received — any format,
no schema enforcement,
permanent archive
Bronze Layer Data Lake Parquet files, date-partitioned,
(object storage (organised) metadata added. Still no ACID
+ Parquet) but organised for query efficiency
Silver Layer Lakehouse Delta Lake / Iceberg tables.
(object storage (Delta/Iceberg) ACID transactions, schema
+ table format) enforced, row-level deletes,
time travel enabled.
Cleaned and conformed data.
Gold Layer Lakehouse + Delta tables for Spark consumers
(object storage Warehouse Snowflake/BigQuery tables for
+ warehouse) SQL/BI consumers.
Same data, two serving layers.
ML Feature Store Lakehouse Delta tables with low-latency
(object storage) access for model training.
Versioned features for reproducibility.
Unstructured Data Pure Data Lake Images, audio, PDFs in S3.
(object storage) No table format — just organised
object storage with metadata table
tracking what exists.
REAL EXAMPLE — Swiggy-style architecture:
Raw Kafka events (JSON) → S3 landing zone (data lake)
Landing → Bronze Parquet (data lake, organised)
Bronze → Silver Delta tables (lakehouse, ACID, deduplicated)
Silver → Gold Delta tables (lakehouse, aggregated)
Gold → Snowflake (warehouse, SQL for analysts, Power BI)
Gold → Spark ML jobs (lakehouse, model training)
Product images → S3 (pure lake, ML model reads directly)When to use just a warehouse (no lake needed)
Not every company needs a data lake or lakehouse. A Series A startup with structured data from a few PostgreSQL tables, a Salesforce CRM, and a payment processor does not need Delta Lake and a Spark cluster. A direct pipeline from those sources into Snowflake, with dbt transformations, is the right architecture. Simple, fast to build, easy to maintain.
The trigger points for adding a lake or lakehouse layer: data volume exceeds what a warehouse can economically store (usually in the multi-TB range for raw data), data types include unstructured content that cannot go in the warehouse, ML teams need direct file access for model training, or schema flexibility requirements exceed what the warehouse's migration process can handle comfortably.
The modern warehouse — blurring the line
Modern cloud warehouses (Snowflake, BigQuery, Redshift Spectrum) have been adding data lake features. Snowflake can query external S3 files directly. BigQuery supports external tables over GCS. Redshift Spectrum queries S3 Parquet via Athena. Simultaneously, the lakehouse (Databricks SQL) has been adding warehouse-quality SQL performance. The boundary between warehouse and lakehouse is blurring — both are converging on the same goal: cheap lake storage with warehouse-quality SQL.
How to Choose the Right Architecture for Your Company
SITUATION 1: Small to mid-size company, mostly structured data
Team: 1–3 data engineers, 2–5 analysts
Data: PostgreSQL, a few SaaS APIs, payment processor
Volume: < 500 GB total
Answer: WAREHOUSE ONLY (Snowflake or BigQuery)
→ Ingest directly into warehouse via dbt
→ No lake needed — volume is manageable
→ Analysts use SQL directly, BI tools connect to warehouse
→ Simple, fast to build, low maintenance
SITUATION 2: Growth stage company, mixed structured + semi-structured
Team: 3–8 data engineers
Data: PostgreSQL, MongoDB, Kafka events, vendor files
Volume: 500 GB – 10 TB
Answer: LAKE + WAREHOUSE (Medallion architecture)
→ Bronze: raw Parquet in S3, preserves all data cheaply
→ Silver/Gold: dbt transformations into Snowflake
→ Analysts query Snowflake, raw lake is for reprocessing
→ Add Delta Lake when update/delete requirements emerge
SITUATION 3: Scale company, ML platform, streaming data
Team: 10+ data engineers, ML team, analytics team
Data: Multiple DBs, Kafka, IoT, unstructured content
Volume: 10 TB – PB scale
Answer: FULL LAKEHOUSE (Delta Lake or Iceberg)
→ Landing zone: raw files in S3/ADLS
→ Bronze: Parquet with date partitioning
→ Silver: Delta/Iceberg tables (ACID, time travel)
→ Gold: Delta/Iceberg + Snowflake serving layer
→ ML: Spark reads directly from Delta tables
→ Analysts: Snowflake or Databricks SQL
SITUATION 4: Regulated industry (banking, healthcare, insurance)
Requirements: full audit trail, GDPR compliance, data lineage
Answer: WAREHOUSE with strong governance OR LAKEHOUSE with Unity Catalog
→ Warehouses (Snowflake) have mature governance features
→ Lakehouse (Databricks + Unity Catalog) for scale + governance
→ Time travel in both enables point-in-time audit queries
→ Row-level security for column-level data masking (PII)Migrating From a Warehouse-Only Architecture to a Lakehouse
A D2C fashion startup raised its Series C in 2024. They have 2 million active customers, 800,000 orders per month, and a data platform built two years ago as a simple Snowflake warehouse. Their data engineers run five dbt pipelines that load data from PostgreSQL and Shopify into Snowflake. It works. But three new requirements have arrived simultaneously that Snowflake alone cannot handle well.
Requirement 1 — ML team needs training data: The ML team wants to train a recommendation model on 18 months of user behaviour events (200 million rows of click and purchase events). Loading all of this into Snowflake for ML training costs $800/month just in compute credits. They need direct file access.
Requirement 2 — GDPR deletion requests: European customers are submitting data deletion requests. Deleting a customer from Snowflake is straightforward. But the raw PostgreSQL CDC events and Shopify export files in S3 still contain their data. Without row-level deletes on S3 files, full GDPR compliance is impossible.
Requirement 3 — Raw data preservation: A business intelligence query found that Snowflake's computed metrics do not match the raw source data for a period six months ago. Someone ran a data migration that corrupted three weeks of order metrics. They need to reprocess from raw. But raw data was never preserved — only the transformed Snowflake tables exist.
Your migration plan: Add a lakehouse layer below the existing Snowflake warehouse without disrupting what already works.
Step 1: land all raw data from every pipeline run into S3 as Parquet before loading to Snowflake. The landing zone costs $8/month to store a year of raw data. Raw data is now preserved for reprocessing.
Step 2: convert the S3 raw layer to Delta Lake tables. GDPR deletion requests can now be executed with a Delta DELETE statement — Delta writes new Parquet files excluding the deleted customer's records. Full GDPR compliance on the lake layer.
Step 3: expose the Delta Lake Silver tables to the ML team via Databricks. They read training data directly from Delta files on S3 — no Snowflake compute credits consumed. ML training cost drops to near zero.
The result: three blocking requirements resolved without replacing the existing Snowflake warehouse. The analysts still use Snowflake for SQL. The ML team uses Databricks against Delta. Operations use the Delta layer for GDPR compliance. Raw data is preserved for reprocessing. Total additional infrastructure cost: $80/month for S3 storage and Databricks community edition for development.
This is the most common real-world lakehouse adoption pattern — not a full rewrite, but a pragmatic addition of lake and lakehouse layers to a working warehouse architecture.
5 Interview Questions — With Complete Answers
Errors You Will Hit — And Exactly Why They Happen
🎯 Key Takeaways
- ✓The data warehouse (1980s–2000s) is a structured, columnar, SQL-queryable analytical database. It enforces schema on write, delivers fast SQL queries, and provides strong governance — but is expensive per GB and cannot store unstructured data or accept schema changes without migrations.
- ✓The data lake (2010s) is cheap object storage (S3/ADLS/GCS) that accepts any data in any format with no schema requirement. It solves the warehouse's cost and flexibility problems but creates new ones: no ACID transactions, no row-level updates, no schema enforcement, and the risk of becoming a data swamp.
- ✓A data swamp is a data lake that nobody can use — undocumented, inconsistent, ungoverned. Prevent it with four practices: convert raw files to Parquet, use date partitioning, maintain a data catalogue, and add a table format (Delta Lake or Iceberg).
- ✓The lakehouse (2020s) adds a transaction log layer (Delta Lake, Iceberg, Hudi) on top of object storage, giving it ACID transactions, row-level updates and deletes, time travel, and schema evolution — at lake storage prices. The convergence of warehouse reliability with lake economics.
- ✓Delta Lake uses a JSON transaction log (_delta_log/) alongside Parquet files to track every operation. Readers see only committed transactions. Time travel lets you query any historical version. Row-level deletes write new Parquet files excluding the deleted rows.
- ✓The three lakehouse table formats (Delta Lake, Iceberg, Hudi) solve the same core problems but differ in ecosystem: Delta is best for Databricks-heavy stacks, Iceberg is engine-agnostic and best for multi-engine environments, Hudi is best when high-frequency CDC upserts are the dominant workload.
- ✓Modern data platforms use all three architectures simultaneously: raw data in pure lake (S3), Silver/Gold as lakehouse tables (Delta/Iceberg), analytical serving via warehouse (Snowflake/BigQuery). Each layer uses the architecture suited to its purpose.
- ✓Time travel is practically useful for pipeline debugging (compare before/after a bad run), reprocessing (restore previous version after a bug), GDPR audit queries (what was the state on this date), and SCD implementation. Retain snapshots for at least 7 days.
- ✓The choice is not "lake or warehouse" — it is which combination matches your scale, team size, and requirements. A Series A startup with structured data and SQL analysts only needs a warehouse. A scale company with ML workloads, streaming data, and GDPR requirements needs a lakehouse.
- ✓The most impactful single migration a data engineer can make on a legacy CSV data lake is: convert all files to Parquet + add date partitioning + add Delta Lake. This typically reduces query time by 10–20×, reduces storage cost by 5–10×, and adds GDPR-compliant deletion capability — all without changing the downstream SQL interface.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.