Data Lake Architecture
What a data lake is, why it was invented, zone-based organisation, compute-storage separation, and the rise of the lakehouse.
What a Data Lake Actually Is — Beyond the Buzzword
"Data lake" is one of the most overloaded terms in data engineering. It has been used to mean a dump of raw files on S3, a governed analytical platform, a Hadoop cluster, and a marketing term for whatever a vendor is selling. This module gives a precise, technical definition and the historical context that explains why data lakes were invented, what problems they solve, and where they fall short.
A data lake is a centralised storage repository that holds data in its native raw format — structured, semi-structured, and unstructured — at any scale, with compute separated from storage. The defining characteristics are storage format flexibility (not just tabular rows), schema-on-read (structure applied when querying, not when writing), and the separation of durable cheap object storage from the compute engines that query it.
Why Data Lakes Were Invented — The Problem They Solved
Data lakes did not emerge from an abstract architectural preference. They emerged from a specific set of problems that data warehouses could not solve economically in the early 2010s. Understanding those problems makes the lake's architectural choices — raw storage, schema-on-read, object storage — feel inevitable rather than arbitrary.
The three warehouse limitations that created demand for lakes
WAREHOUSE LIMITATION 1: Cost of unstructured and semi-structured data
────────────────────────────────────────────────────────────────────
A data warehouse (Teradata, Oracle, Redshift 2012) stores structured tables.
Loading semi-structured JSON from application logs requires:
- Flattening the JSON into tabular columns (an ETL step)
- Defining a schema before loading
- Storing in the warehouse's proprietary columnar format
Application logs at Flipkart in 2013: 10 TB/day of JSON event logs
Cost to store in Teradata: $50,000/TB × 10 TB × 365 = $182M/year
Cost to store on HDFS (Hadoop): ~$1,000/TB = $3.65M/year
Cost to store on S3 (2015): $23/TB = $83,950/year
For unstructured data (images, audio, ML model outputs):
Warehouses simply could not store them at all.
You needed a separate system regardless.
WAREHOUSE LIMITATION 2: Schema-on-write inflexibility
────────────────────────────────────────────────────────────────────
Warehouse schema must be defined before loading.
A new event type requires:
- Schema design (hours to days)
- Table creation and ALTER TABLE (minutes, but blocking)
- ETL pipeline update (days of engineering work)
- Schema approval from DBA (days to weeks at large companies)
At a fast-growing startup: 5 new event types per week.
Each requires a full schema-change workflow.
Result: data arrives 2 weeks after the event type is shipped.
Business: "We can't analyse this event — the data team hasn't added it yet."
Data lake answer: just drop the JSON file in the landing zone.
No schema required at write time. Define it when you query.
WAREHOUSE LIMITATION 3: Coupled compute and storage scaling
────────────────────────────────────────────────────────────────────
Traditional warehouse: compute and storage are the same hardware.
To add more storage: buy more nodes (also adds unwanted compute).
To add more compute: buy more nodes (also adds unwanted storage).
Hardware refresh every 3-5 years: millions of dollars, planned outage.
At Airbnb 2011: data growing 3× per year, compute needed for batch jobs.
Adding storage for data growth forced adding compute they did not need.
Cost and complexity scaled together with no ability to separate them.
Data lake answer: S3 for storage (infinite, pay per GB).
Spark cluster for compute (spin up/down, pay per hour).
Scale each independently. Pay only for what you use.
THESE THREE PROBLEMS created the Hadoop + HDFS ecosystem (2009-2014)
and then the cloud object storage + Spark ecosystem (2014-present).
The architectural choices of data lakes are direct responses to these
specific warehouse limitations.The evolution: Hadoop → Cloud data lake
GENERATION 1: HADOOP + HDFS (2009-2016)
─────────────────────────────────────────────────────────────
Storage: HDFS (Hadoop Distributed File System)
- Runs on commodity hardware in on-premise data centres
- Files replicated 3× across nodes for durability
- Block size: 128 MB — designed for sequential MapReduce reads
Compute: MapReduce → Hive → Spark (2014 onward)
Format: Text files (CSV, JSON), then Parquet and ORC
Problems with Hadoop:
- HDFS and compute tightly coupled (same cluster)
- Operational complexity: ZooKeeper, NameNode HA, YARN tuning
- S3 became cheaper than HDFS by 2015 ($23/TB vs $100+/TB)
- Cluster always running even when no jobs — wasted cost
GENERATION 2: CLOUD OBJECT STORAGE + SPARK (2015-2022)
─────────────────────────────────────────────────────────────
Storage: S3 (AWS), ADLS Gen2 (Azure), GCS (Google Cloud)
- Infinitely scalable, 99.999999999% (11 9s) durability
- $23/TB/month — cheaper than any on-premise storage
- No cluster management, no replication configuration
Compute: Spark on EMR / Databricks / HDInsight (ephemeral clusters)
- Spin up cluster for a job, terminate when done
- No cost when no jobs running
Format: Parquet, Avro, ORC (columnar, compressed, splittable)
Problems with Gen 2:
- No ACID transactions (concurrent writes corrupt partitions)
- No ability to UPDATE or DELETE single rows (only full partition overwrite)
- No schema enforcement — anyone can write anything anywhere
- Metadata catalog (Glue, Hive Metastore) becomes a bottleneck
- "Data swamp": ungoverned lakes become unusable over time
GENERATION 3: OPEN TABLE FORMATS + LAKEHOUSE (2020-present)
─────────────────────────────────────────────────────────────
Storage: S3 / ADLS / GCS (same as Gen 2)
Format: Parquet files + transaction log layer
- Delta Lake (Databricks): _delta_log/ JSON transaction log
- Apache Iceberg (Netflix): metadata tree + manifest files
- Apache Hudi (Uber): timeline-based record-level upserts
Compute: Spark, Trino, Flink, Athena, Snowflake (via external tables)
What open table formats add:
- ACID transactions: concurrent writes without corruption
- Row-level UPDATE and DELETE: not just partition-level operations
- Time travel: query data as of a specific timestamp or version
- Schema evolution: add/rename columns without rewriting data
- Partition evolution: change partition strategy without rewriting
- Optimistic concurrency: multiple writers, serialisable isolationData Lake vs Data Warehouse — The Real Technical Differences
The lake vs warehouse comparison is one of the most commonly discussed topics in data engineering interviews and architecture reviews. The differences are not just cost — they are architectural choices about schema management, query patterns, data types, and governance that compound over time.
| Dimension | Data Lake | Data Warehouse |
|---|---|---|
| Schema model | Schema-on-read: structure applied at query time | Schema-on-write: structure enforced at load time |
| Data formats | Any format: CSV, JSON, Parquet, images, logs, audio | Structured tabular data only |
| Storage cost | Very low — S3 at $23/TB/month | Higher — proprietary compressed columnar storage |
| Query speed | Slower — metadata overhead, no native indexing | Faster — columnar indexes, result caching, statistics |
| ACID transactions | No (plain), Yes (with Delta/Iceberg/Hudi) | Yes — native, mature |
| Row-level updates | No (plain), Yes (with Delta/Iceberg MERGE) | Yes — standard UPDATE/DELETE |
| Schema enforcement | Optional (schema-on-read by default) | Mandatory (schema-on-write enforced) |
| Data types | Any — unstructured text, images, ML models | Structured only — numbers, strings, dates |
| Access patterns | Batch analytics, ML training, ad-hoc exploration | BI dashboards, SQL analytics, aggregated reporting |
| Governance | Requires explicit effort — data catalogs, access control | Built-in — column-level permissions, row-level security |
| Scaling model | Decouple: storage and compute scale independently | Coupled or semi-coupled (Snowflake separates more) |
| Best tools 2026 | S3 + Delta Lake + Spark + Databricks | Snowflake, BigQuery, Redshift, ClickHouse |
Zones — How a Data Lake Is Organised Into Layers
A raw dump of files on S3 is not a data lake — it is a data swamp. The difference is organisation. Every mature data lake uses a zone-based architecture that separates data by its quality level, transformation state, and access pattern. The zone model is the foundation of both the Medallion Architecture (covered in Module 30) and every well-governed data lake.
The standard four-zone model
DATA LAKE ZONES (S3 bucket organisation for FreshMart):
s3://freshmart-data-lake/
├── landing/ ← ZONE 1: Landing / Raw Ingestion
│ ├── razorpay/
│ │ └── payments_20260317.json (as received from API)
│ ├── shipfast/
│ │ └── deliveries_week_20260317.csv (as received from vendor SFTP)
│ └── postgres_cdc/
│ └── orders/
│ └── 2026/03/17/ (raw CDC events from Debezium)
│
├── bronze/ ← ZONE 2: Bronze / Raw Standardised
│ ├── payments/
│ │ └── year=2026/month=03/day=17/ (Parquet, Hive-partitioned)
│ ├── deliveries/
│ │ └── year=2026/month=03/week=11/
│ └── orders_cdc/
│ └── date=2026-03-17/
│
├── silver/ ← ZONE 3: Silver / Cleaned and Typed
│ ├── orders/
│ │ └── date=2026-03-17/ (validated, typed, deduped)
│ ├── customers/ (cleaned, PII masked)
│ └── payments/ (reconciled with bank data)
│
└── gold/ ← ZONE 4: Gold / Business-Ready
├── daily_revenue/
│ └── date=2026-03-17/
├── customer_ltv/
│ └── date=2026-03-17/
└── store_performance/
└── date=2026-03-17/
ZONE CHARACTERISTICS:
ZONE 1 — LANDING (Raw Ingestion)
Purpose: Exact copy of data as received from source
Transform: None — file written exactly as received
Format: Whatever the source sends (JSON, CSV, XML)
Retention: 30-90 days (after Bronze conversion, can expire)
Access: Pipeline engineers only
Key rule: IMMUTABLE — never modify, never delete during retention
ZONE 2 — BRONZE (Raw Standardised)
Purpose: Landed data converted to standard format (Parquet)
with minimal transformation — types coerced for load compatibility
Transform: Format conversion, Hive partitioning, add ingestion metadata
Format: Parquet + Snappy/ZSTD compression, Hive-partitioned
Retention: 1-3 years (raw data for reprocessing)
Access: Pipeline engineers, data scientists (raw exploration)
Key rule: APPEND ONLY via CDC/incremental — preserve all history
ZONE 3 — SILVER (Cleaned and Typed)
Purpose: Trusted, clean, validated data for analytical use
Transform: Type casting, deduplication, validation, NULL handling,
normalisation, PII masking, business rule application
Format: Delta Lake / Iceberg (ACID, time travel) partitioned by date
Retention: 2-5 years
Access: Data engineers, analysts, BI tools, ML engineers
Key rule: IDEMPOTENT UPDATES via MERGE — not append-only
ZONE 4 — GOLD (Business-Ready Aggregates)
Purpose: Pre-computed metrics and aggregates for dashboards and APIs
Transform: Aggregation, joining, business metric computation
Format: Delta Lake, typically small (GB not TB)
Retention: Long-term (1+ years)
Access: Analysts, BI tools, downstream APIs
Key rule: REBUILT regularly from Silver — not primary source of truthAccess control by zone
# S3 bucket policy zones — principle of least privilege
# Landing zone: write by ingestion pipelines only, read by bronze conversion
# Bronze zone: write by format conversion, read by silver pipelines and data scientists
# Silver zone: write by transformation pipelines, read by analysts and BI tools
# Gold zone: write by aggregation pipelines, read by everyone
# AWS IAM role structure:
# role/pipeline-ingestion: s3:PutObject on landing/*
# role/pipeline-bronze: s3:GetObject on landing/*, s3:PutObject on bronze/*
# role/pipeline-silver: s3:GetObject on bronze/*, s3:PutObject on silver/*
# role/pipeline-gold: s3:GetObject on silver/*, s3:PutObject on gold/*
# role/analyst: s3:GetObject on silver/*, s3:GetObject on gold/*
# (NO access to landing — raw PII)
# role/data-scientist: s3:GetObject on bronze/*, silver/*, gold/*
# (bronze for raw ML training data)
# CRITICAL: landing and bronze zones contain raw PII (emails, phone numbers)
# Analysts and BI tools must NEVER have direct access to landing/bronze.
# Only masked/anonymised Silver and Gold are accessible to analysts.
# Azure ADLS Gen2 equivalent:
# Use POSIX-style ACLs on container directories:
# landing/: execute for pipeline-bronze service principal only
# bronze/: read for data-scientists, execute for pipeline-silver
# silver/: read for analysts, BI service principals
# gold/: read for all authenticated users in the tenant
# Snowflake (for Silver and Gold served via warehouse):
# GRANT SELECT ON SCHEMA silver TO ROLE analyst;
# GRANT SELECT ON SCHEMA gold TO ROLE analyst;
# REVOKE SELECT ON SCHEMA silver.customers_raw FROM ROLE analyst;
# (raw PII tables blocked even within Silver)Object Storage — The Foundation Everything Else Sits On
Every modern data lake is built on object storage — S3, Azure Data Lake Storage Gen2, or Google Cloud Storage. Understanding how object storage works at a level below the API is what lets a data engineer make correct decisions about file sizes, naming, access patterns, and cost optimisation.
How S3 works internally — what every DE must know
S3 OBJECT MODEL:
Everything is an object. An object has:
- Key: the "path" — "bronze/orders/date=2026-03-17/part-001.parquet"
- Value: the bytes of the file
- Metadata: content-type, user-defined key-value pairs, system metadata
- ETag: MD5 hash of the content (used for integrity checking)
- Version ID: if versioning is enabled
There are no real directories — the key is just a string.
"Directories" are a UI fiction — a key ending in / is a zero-byte object.
s3://bucket/bronze/orders/ is just a prefix filter on keys, not a real dir.
IMPLICATION FOR LISTING:
ListObjectsV2 returns all keys matching a prefix.
s3://bucket/bronze/orders/ with 1,000,000 objects:
→ 1,000 API calls (1,000 objects per page)
→ Athena/Spark must list all keys to find relevant partitions
→ Use Hive-style partitioning to reduce list scope
S3 CONSISTENCY MODEL (since Dec 2020):
S3 provides strong read-after-write consistency for all operations.
After a PUT: the GET immediately returns the new object. ✓
After a DELETE: the GET immediately returns 404. ✓
This was eventually consistent before Dec 2020 — old code may still
have unnecessary sleep() calls to work around the old model.
S3 PERFORMANCE:
Request rate limits (per prefix):
3,500 PUT/COPY/POST/DELETE requests per second
5,500 GET/HEAD requests per second
Each unique prefix has its own limits:
s3://bucket/a/ and s3://bucket/b/ are SEPARATE rate limits
Use date-partitioning: orders/date=2026-03-17/ and orders/date=2026-03-18/
give each day its own rate limit bucket — effectively unlimited
Multipart upload:
Objects > 5 MB: use multipart upload (automatic in boto3/s3fs)
Parts uploaded in parallel → fast for large files
Minimum part size: 5 MB (except last part)
Maximum object size: 5 TB
S3 Transfer Acceleration:
Routes uploads through CloudFront edge nodes
Use for: uploading from mobile devices or geographically distant sources
S3 STORAGE CLASSES AND COST (US East, 2026 approximate):
Standard: $0.023/GB/month — frequently accessed
Standard-IA: $0.0125/GB/month — infrequently accessed, instant retrieval
Glacier Instant: $0.004/GB/month — archive, instant retrieval
Glacier Flexible: $0.0036/GB/month — archive, 3-5 hour retrieval
Glacier Deep Archive: $0.00099/GB/month — coldest archive, 12+ hours
Lifecycle policy automatically transitions objects between classes.
Apply per prefix: landing/ transitions to IA after 7 days, delete after 30.
S3 NAMING BEST PRACTICES:
Use lowercase: S3 is case-sensitive but tools treat case inconsistently
Use hyphens not underscores in bucket names: freshmart-data-lake
Use underscores in object keys for compatibility: date=2026-03-17
Avoid special characters: only a-z, 0-9, /, -, _, .
Never put credentials or PII in bucket names or object keys
(keys appear in logs, access logs, CloudTrail)ADLS Gen2 — how it differs from S3
ADLS GEN2 vs S3:
HIERARCHICAL NAMESPACE (HNS):
ADLS Gen2 supports a true hierarchical namespace — directories are real,
not just key prefixes. This has performance implications:
- Rename a directory: atomic O(1) operation
S3: must copy all objects to new prefix, delete originals — O(n) time
- Delete a directory: O(1) in ADLS, O(n) in S3
- Directory listing: O(1) in ADLS (real directory), O(n) in S3 (list all keys)
For Spark jobs that rename output directories (common in Hadoop code):
ADLS is significantly faster than S3 for this operation.
ACCESS CONTROL:
ADLS supports POSIX-style ACLs on files and directories:
- Owner: full rwx
- Owning group: configurable rwx
- Other: configurable rwx
- Named user/group ACLs (like Linux setfacl)
This is finer-grained than S3's bucket policy + IAM combination.
Azure Active Directory integration: assign ADLS ACLs to AAD groups.
PROTOCOL:
ADLS Gen2 supports both:
- Blob storage API (https://account.blob.core.windows.net/container/blob)
- DFS API (https://account.dfs.core.windows.net/filesystem/path)
Use DFS API for hierarchical operations (rename, directory delete).
Use Blob API for compatibility with older tools that do not know DFS.
In Spark: spark.read.parquet('abfss://container@account.dfs.core.windows.net/path')
COST (2026 approximate):
LRS (Locally Redundant): $0.018/GB/month (similar to S3 Standard)
ZRS (Zone Redundant): $0.023/GB/month
GRS (Geo Redundant): $0.036/GB/month
AUTHENTICATION:
Service Principal (recommended for pipelines):
- Client ID + Client Secret → OAuth 2.0 token
- Assign Storage Blob Data Contributor role to service principal
Managed Identity (best for Azure VMs / AKS):
- No credentials needed — identity from the compute resource
Shared Access Signature (SAS):
- Time-limited URL with specific permissions
- Use for temporary external access
GCS (Google Cloud Storage) is similar to S3:
- Eventually consistent (unlike S3's strong consistency since 2020)
- Similar object model, bucket-based
- gs:// URI scheme
- Strong integration with BigQuery (native external tables)Compute-Storage Separation — Why It Changes Everything
The separation of compute from storage is the architectural decision that makes cloud data lakes economically viable and operationally flexible. It sounds like an infrastructure detail but has profound consequences for how a data platform is designed, scaled, and costed.
TRADITIONAL WAREHOUSE (coupled compute and storage):
─────────────────────────────────────────────────────────────────────
Data stored ON the same machines that query it.
Teradata: storage and CPU on the same proprietary blades.
Redshift (original): storage and compute on the same EC2 instances.
IMPLICATIONS:
To add 10 TB storage: buy 10 TB of nodes (also adds ~10× compute).
Peak CPU needs (end-of-month reports): must always have peak nodes running.
Off-peak (2 AM): same nodes idle, same cost.
Hardware utilisation: typically 10-30% of capacity.
Data growth → hardware growth → capital expense every 3-5 years.
CLOUD DATA LAKE (decoupled compute and storage):
─────────────────────────────────────────────────────────────────────
Data stored on S3 (storage-only, no compute).
Query engines (Spark, Athena, Presto) are separate compute resources.
IMPLICATIONS:
To add 10 TB storage: write more data to S3. Compute unchanged.
Peak CPU needs: spin up a large Spark cluster for 4 hours. Terminate after.
Off-peak: S3 still holds data, zero compute running. Zero compute cost.
Data growth → storage cost growth. Compute cost tied to query volume only.
COST COMPARISON FOR A 100 TB DATA PLATFORM:
Storage: 100 TB × $23/TB/month = $2,300/month always-on
Compute: 20 Spark jobs/day × 10 r5.4xlarge × $1/hr × 1 hr = $200/day
Total: $2,300 + ($200 × 30) = $8,300/month
Traditional warehouse equivalent (Redshift):
dc2.8xlarge (16 nodes for 100 TB): $52,000/month
→ Cloud lake is ~6× cheaper for the same data volume
THE TRADE-OFF:
Coupled: faster queries (data near compute, no network hop)
Decoupled: more flexible, cheaper, scales each dimension independently
Modern cloud warehouses (Snowflake, BigQuery) use a hybrid approach:
- Data stored in cloud object storage (S3-equivalent)
- Compute clusters (virtual warehouses) read from storage
- Result cache and local disk cache reduce the network penalty
- Effectively decoupled but with intelligent caching to preserve speed
SPARK ON S3 — THE NETWORK PERFORMANCE GAP:
Spark reading 1 TB from local HDFS: ~100 GB/s aggregate throughput
Spark reading 1 TB from S3: ~10 GB/s per cluster (network bottleneck)
→ S3 reads are ~10× slower than HDFS for the same cluster size
MITIGATION:
1. Use Parquet with predicate pushdown — read only needed columns/rows
2. Use columnar formats — minimise bytes transferred per query
3. Partition data — read only relevant partitions
4. Use Delta Lake data skipping — skip files with no matching rows
5. S3 Express One Zone (new, 2023): 10× lower latency than standard S3
6. Use Snowflake/BigQuery for interactive queries (they cache aggressively)
Use Spark for batch processing where latency is acceptableGovernance — How Data Lakes Become Data Swamps
The data swamp problem is universal: every organisation that builds a data lake without a governance strategy eventually ends up with petabytes of unorganised files where nobody knows what data is in them, who owns it, whether it is trustworthy, or how to query it. The lake degrades from a strategic asset to an expensive dump of uncertain quality.
Governance is not a single tool or policy — it is the ongoing practice of maintaining the discoverability, quality, lineage, and access control of data in the lake. Data lakes require explicit governance investment that data warehouses handle implicitly through schema enforcement.
The four governance pillars
The Lakehouse — Converging Lakes and Warehouses
The lakehouse is an architectural pattern that adds warehouse-quality features (ACID transactions, schema enforcement, row-level updates, time travel) directly to the data lake storage layer using open table formats. It is not a new type of system — it is a new set of capabilities layered on top of existing object storage.
The three open table formats — Delta Lake, Apache Iceberg, and Apache Hudi — each implement the lakehouse pattern differently, but they all solve the same core problems with plain object storage: no transactions, no row-level updates, and no reliable schema enforcement.
Delta Lake — the most widely adopted open table format
DELTA LAKE TABLE STRUCTURE:
s3://freshmart-lake/silver/orders/
├── _delta_log/
│ ├── 00000000000000000000.json ← commit 0 (table creation)
│ ├── 00000000000000000001.json ← commit 1 (first write)
│ ├── 00000000000000000042.json ← commit 42 (latest)
│ └── 00000000000000000010.checkpoint.parquet ← periodic checkpoint
├── date=2026-03-15/
│ └── part-00001.parquet
├── date=2026-03-16/
│ └── part-00001.parquet
└── date=2026-03-17/
├── part-00001.parquet ← written in commit 41
└── part-00002.parquet ← written in commit 42
TRANSACTION LOG ENTRY (00000000000000000042.json):
{
"commitInfo": {
"timestamp": 1710706472847,
"operation": "MERGE",
"operationParameters": {"predicate": "target.order_id = source.order_id"}
}
}
{
"add": {
"path": "date=2026-03-17/part-00002.parquet",
"partitionValues": {"date": "2026-03-17"},
"size": 52428800,
"stats": {"numRecords": 50000, "minValues": {"order_id": 9200000},
"maxValues": {"order_id": 9250000}}
}
}
{
"remove": {
"path": "date=2026-03-17/part-00001.parquet",
"deletionTimestamp": 1710706472000
}
}
HOW DELTA ACHIEVES ACID:
Atomicity: a write either commits a new JSON log entry (visible)
or does not (Parquet files written but not referenced)
Consistency: schema validation at commit time
Isolation: optimistic concurrency — concurrent writes detected via
log entry conflicts, one writer succeeds, other retries
Durability: S3 11-nines durability — once PUT, object is durable
TIME TRAVEL:
Every commit in the log represents a version of the table.
Read version 40 (before commit 41 and 42):
spark.read.format('delta').option('versionAsOf', 40).load(path)
spark.read.format('delta').option('timestampAsOf', '2026-03-16').load(path)
Useful for: debugging, auditing, recovering from bad writes
Retention: versions kept for 30 days by default (configurable)
DELTA OPERATIONS:
OPTIMIZE: compact small files into larger ones
VACUUM: remove files no longer referenced (respects retention)
Z-ORDER: co-locate related data by a column (improves skip rate)
RESTORE: revert table to a previous version
-- Z-ORDER example (Databricks / Spark):
OPTIMIZE silver.orders ZORDER BY (store_id, order_date);
-- Co-locates rows with same store_id and order_date in same files
-- Queries filtering by store_id + order_date skip ~80% of filesDelta Lake vs Iceberg vs Hudi — choosing between the three
| Dimension | Delta Lake | Apache Iceberg | Apache Hudi |
|---|---|---|---|
| Origin | Databricks (2019) | Netflix (2018) | Uber (2019) |
| Primary strength | Spark integration, DML, time travel | Engine-agnostic design, partition evolution | Upserts and incremental processing |
| ACID support | Full ACID | Full ACID (Iceberg v2) | Full ACID |
| Row-level updates | MERGE INTO, UPDATE, DELETE | MERGE INTO, UPDATE, DELETE (v2) | Native — designed for record-level upserts |
| Partition evolution | Requires data rewrite | Native — no data rewrite needed | Limited |
| Engine support | Spark (best), Flink, Trino (via DeltaReader) | Spark, Flink, Trino, Athena, BigQuery — widest | Spark (best), Flink, Hive, Presto |
| Cloud native | Databricks native, S3 / ADLS / GCS | All cloud storage — most portable | S3, ADLS, GCS |
| Best for | Databricks-centric stack, strong Spark use | Multi-engine, multi-cloud portability | High-velocity CDC ingestion |
Designing FreshMart's Data Lake From Scratch
FreshMart is migrating from an on-premise Hadoop cluster to a cloud data lake on AWS. You are asked to design the architecture. Here is the complete design decision process a senior data engineer follows.
DECISION 1: Storage choice
Options: S3 Standard vs ADLS Gen2 vs GCS
FreshMart is AWS-native (EC2, RDS, Lambda already on AWS).
Decision: S3 — zero integration friction with existing AWS services.
Bucket structure: one bucket per environment (dev/staging/prod).
prod: s3://freshmart-data-lake-prod/ (versioning on, default encryption)
DECISION 2: Zone naming and organisation
landing/ bronze/ silver/ gold/ — clear, industry-standard names
Within each zone: {entity}/{partition_cols}/
Partition by date for all time-series data, avoid high-cardinality columns.
DECISION 3: File format
Landing: as-received (JSON from APIs, CSV from vendors)
Bronze: Parquet + ZSTD, Hive-partitioned by date
Silver: Delta Lake (ACID, time travel, row-level updates needed for CDC)
Gold: Delta Lake (small tables, need MERGE semantics from dbt incremental)
DECISION 4: Compute engine
Batch transformation: Spark on Databricks (existing team expertise)
Interactive queries: Athena (analyst self-service, serverless, no clusters)
dbt (Silver→Gold transformations): runs against Databricks SQL Warehouse
ML training: Spark on Databricks + SageMaker for model training
DECISION 5: Catalog
AWS Glue Data Catalog — integrated with S3, Athena, EMR, Glue ETL.
Landing and Bronze tables registered manually by pipelines.
Silver and Gold tables: dbt generates schema + docs → synced to Glue.
DataHub deployed for business-facing catalog (lineage, ownership).
DECISION 6: Access control
landing/: only ingestion Lambda + Spark ingestion roles
bronze/: ingestion roles + data engineers + data scientists
silver/: transformation roles + analysts (read-only) + Athena
gold/: transformation roles + analysts + Metabase BI service account
PII data: blocked at Silver — analysts never see raw emails/phones.
DECISION 7: Cost optimisation
Lifecycle policies:
landing/ → Standard-IA after 7 days → delete after 30 days
bronze/ → Standard-IA after 90 days → Glacier after 365 days
silver/ → Standard-IA after 180 days → Glacier after 730 days
gold/ → no lifecycle (small, frequently queried)
Delta VACUUM: weekly, retain 30 days
Delta OPTIMIZE: daily, compact files > 7 days old
DECISION 8: Governance baseline (phase 1)
Data catalog: required for all new Silver/Gold tables before merge
dbt tests: not_null + unique on all primary keys, freshness checks
Row count anomaly detection: alert if count deviates > 20% from 7-day avg
Access audit: CloudTrail → Athena table → weekly access report
RESULTING ARCHITECTURE:
Sources → Kafka → Debezium → Kafka topics
→ Lambda (API polling) → s3://freshmart-data-lake-prod/landing/
landing/ → Spark (Databricks) → s3://freshmart-data-lake-prod/bronze/
(format conversion, minimal typing, Hive partitioning)
bronze/ → dbt on Databricks SQL → s3://freshmart-data-lake-prod/silver/
(Delta Lake, ACID, typed, validated, PII-masked)
silver/ → dbt on Databricks SQL → s3://freshmart-data-lake-prod/gold/
(Delta Lake, pre-computed metrics, business-ready)
gold/ → Athena (self-service SQL) + Metabase (dashboards)5 Interview Questions — With Complete Answers
Errors You Will Hit — And Exactly Why They Happen
🎯 Key Takeaways
- ✓A data lake stores data in native raw format at any scale, with compute separated from storage and schema applied at query time. The four defining properties: any data format, any scale (object storage), schema-on-read, and compute-storage separation.
- ✓Data lakes were invented to solve three warehouse limitations: warehouse storage was too expensive for large volumes of unstructured data, schema-on-write made adding new data types slow and rigid, and coupled compute-storage forced hardware scaling in both dimensions simultaneously.
- ✓Zone-based organisation prevents data swamps. Four zones: Landing (raw as-received, immutable), Bronze (format-converted to Parquet, minimal typing), Silver (cleaned, validated, ACID via Delta Lake), Gold (pre-computed business aggregates). Each zone has distinct access controls, retention policies, and transformation semantics.
- ✓Object storage (S3) has no real directories — keys are just strings with prefixes. S3 provides strong read-after-write consistency since 2020. Rate limits are per prefix, so date-partitioned paths effectively have unlimited throughput. Lifecycle policies automate tier transitions for cost optimisation.
- ✓Compute-storage separation means storage (S3) and compute (Spark, Athena) scale independently. The lake is 6–10× cheaper than a warehouse for equivalent data volume. The trade-off: S3 reads are ~10× slower than HDFS reads — mitigated by Parquet predicate pushdown, columnar formats, and partition pruning.
- ✓The data swamp problem is universal without explicit governance. The four governance pillars: Data Catalog (what data exists and what it means), Data Lineage (where data came from), Data Quality (automated tests on every pipeline run), Access Control (zone-based permissions with PII protection).
- ✓Delta Lake adds ACID transactions, row-level updates (MERGE/UPDATE/DELETE), time travel, and schema enforcement to plain Parquet on S3 via a transaction log (_delta_log/). The log makes multi-file writes atomic — files are invisible until the log entry commits.
- ✓The lakehouse converges lake storage costs with warehouse query capabilities. Delta Lake (Databricks-native), Apache Iceberg (widest engine support — best for portability), and Apache Hudi (optimised for high-velocity CDC upserts) are the three open table formats. Iceberg is the most portable choice for new projects in 2026.
- ✓Schema-on-read gives write flexibility but requires explicit quality enforcement downstream. Use schema-on-read in landing/bronze (raw data preserved), schema-on-write semantics in silver (Delta enforces schema at MERGE). Never leave Silver as an ungoverned schema-on-read zone.
- ✓VACUUM must never run with retention below 7 days (168 hours). The 7-day minimum protects concurrent readers and time travel users. Run VACUUM weekly with 168-hour retention. OPTIMIZE (compaction) should run daily on recently written partitions. Z-ORDER on the most common filter columns reduces data skipping overhead.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.