Lakehouse Architecture
How the lakehouse converges lake and warehouse, open table format mechanics, ACID on object storage, Unity Catalog, Iceberg in practice, and when to choose it.
Why Two Systems Were Worse Than One
Before the lakehouse, every serious data platform maintained two separate systems: a data lake for raw storage, ML training data, and large-scale batch processing, and a data warehouse for structured SQL analytics served to BI tools. This two-system architecture was expensive, inconsistent, and operationally complex in ways that compounded over time.
The lakehouse is the architectural answer: a single storage layer with open table formats that adds warehouse-quality features — ACID transactions, schema enforcement, row-level updates, time travel — directly to the lake. One system. One copy of the data. Every engine that supports the open format can query it.
What the Lakehouse Architecture Looks Like
A lakehouse is not a product — it is an architectural pattern. It consists of three components: cheap, durable object storage at the bottom; an open table format layer that adds ACID semantics and metadata management to the files on that storage; and multiple query engines on top that all speak the same table format protocol.
LAKEHOUSE ARCHITECTURE (FreshMart example):
┌─────────────────────────────────────────────────────────────────┐
│ QUERY / COMPUTE LAYER │
│ │
│ Spark (batch + streaming) Databricks SQL Warehouse │
│ Trino / Athena (ad hoc) dbt (transformations) │
│ Flink (streaming) TensorFlow / PyTorch (ML) │
│ │
│ All engines speak the SAME table format protocol │
└──────────────────────────┬──────────────────────────────────────┘
│ reads/writes through table format API
┌──────────────────────────▼──────────────────────────────────────┐
│ OPEN TABLE FORMAT LAYER │
│ │
│ Delta Lake (Databricks) │ Apache Iceberg │ Apache Hudi │
│ │
│ Provides: │
│ ✓ ACID transactions ✓ Time travel │
│ ✓ Schema enforcement ✓ Row-level DELETE/UPDATE │
│ ✓ Partition evolution ✓ Data skipping / pruning │
│ ✓ Concurrent writer safety ✓ Metadata management │
│ │
│ Implemented as: transaction log + Parquet data files │
└──────────────────────────┬──────────────────────────────────────┘
│ raw files on object storage
┌──────────────────────────▼──────────────────────────────────────┐
│ OBJECT STORAGE LAYER │
│ │
│ AWS S3 │ Azure ADLS Gen2 │ Google Cloud Storage │
│ │
│ - Petabyte-scale, cheap ($23/TB/month) │
│ - 11 nines durability │
│ - All data stored as open-format Parquet files │
│ - No vendor lock-in at storage level │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ GOVERNANCE LAYER (cross-cutting) │
│ │
│ Unity Catalog (Databricks) │ Apache Polaris (open source) │
│ AWS Glue Data Catalog │ Nessie (open source) │
│ │
│ - Table discovery and registration │
│ - Column-level access control │
│ - Data lineage across all engines │
│ - Audit logging │
└─────────────────────────────────────────────────────────────────┘
KEY INSIGHT: every layer is replaceable.
Replace Spark with Trino: same data, same governance, same ACID.
Replace Delta Lake with Iceberg: change metadata format, same Parquet data.
Replace S3 with ADLS: Iceberg adapts its path format.
This composability is what the lakehouse pattern uniquely provides.Lake vs warehouse vs lakehouse — the three patterns today
| Property | Plain Lake | Data Warehouse | Lakehouse |
|---|---|---|---|
| Storage cost | Very low (S3) | Higher (managed) | Very low (S3) |
| ACID transactions | No | Yes | Yes (via open table format) |
| Row-level updates | No (partition overwrite) | Yes | Yes (MERGE, UPDATE, DELETE) |
| SQL analytics | Limited (Presto/Athena) | Excellent | Good (Databricks SQL, Trino) |
| ML training | Excellent (Spark, Python) | Poor (no Spark native) | Excellent (Spark reads same tables) |
| Streaming ingestion | Yes (Spark Streaming) | Limited | Yes (Spark Streaming → same tables) |
| Schema enforcement | No (schema-on-read) | Yes (schema-on-write) | Yes (enforced at commit time) |
| Time travel | No (plain Parquet) | Limited | Yes (transaction log versions) |
| Open format | Yes (Parquet) | No (proprietary) | Yes (Parquet + open table format) |
| Vendor lock-in | Low | High (Snowflake-specific SQL) | Low (open standards) |
| Data duplication | Single copy (no warehouse) | Often double (lake + warehouse) | Single copy (no warehouse needed) |
| Governance maturity | Low (DIY) | High (built-in) | Medium–High (Unity Catalog, Polaris) |
How Open Table Formats Implement ACID on Object Storage
S3 is not a database. It has no transaction coordinator, no locking, no concept of "uncommitted writes." Two concurrent writers can overwrite each other's files silently. Object storage's atomicity guarantee is only at the level of a single object PUT. Getting ACID semantics on top of this requires careful protocol design — which is exactly what Delta Lake, Iceberg, and Hudi implement.
The transaction log as the source of ACID
DELTA LAKE ACID IMPLEMENTATION:
The transaction log (_delta_log/) is the source of truth for table state.
Parquet data files are just bytes — they have no meaning without the log.
The log determines which files are part of the current table version.
ATOMICITY (all or nothing):
Writer's steps:
1. Write new Parquet files to the table directory (side effect-free)
Files written: part-00001.parquet, part-00002.parquet
These files EXIST on S3 but are INVISIBLE to readers — not in the log yet
2. Write a new commit entry to _delta_log/000...042.json
This is a SINGLE S3 PUT — an atomic operation
If step 1 fails: no log entry written → files invisible → table unchanged
If step 2 fails: log entry not written → files invisible → table unchanged
If step 2 succeeds: both files visible simultaneously → atomic commit ✓
CONSISTENCY (schema enforced at every commit):
Before writing log entry, Delta checks:
- New data schema is compatible with table schema
- Column types match (or schema evolution is explicitly allowed)
- Required columns are not missing
If schema check fails: log entry rejected → no data written → consistent ✓
ISOLATION (concurrent writers do not corrupt each other):
Delta uses optimistic concurrency control:
Writer A reads current table version: v41
Writer B reads current table version: v41
Writer A writes Parquet files, attempts to commit log entry v42
Writer B writes Parquet files, attempts to commit log entry v42
→ S3 atomic PUT: only one can succeed (S3 does not have true CAS,
Delta uses conditional PUT or atomic rename strategies per storage)
Writer A succeeds → table is now v42
Writer B detects conflict (log entry v42 already exists):
→ Checks if conflict is real (did A touch the same data?)
→ If appending non-overlapping partitions: REBASE and commit as v43
→ If touching overlapping partitions: ABORT and retry
Result: only valid committed states are visible → isolation ✓
DURABILITY (committed data survives failures):
S3 has 99.999999999% (11 nines) durability for stored objects.
Once the log entry is committed (written to S3 successfully):
The data is durable — S3's redundancy protects it.
Log entries are immutable — once written, never modified.
Recovery after failure: read the log from the beginning (or last checkpoint)
to reconstruct the current table state ✓
ICEBERG'S APPROACH (slightly different mechanism):
Instead of a sequential JSON log, Iceberg uses a tree of metadata files:
metadata/
v1.metadata.json ← snapshot list, schema history, partition spec
snap-001-manifest-list.avro ← list of manifest files for this snapshot
manifests/
manifest-001.avro ← list of data files and their statistics
data/
part-00001.parquet, ...
COMMIT PROTOCOL:
1. Write new data files (Parquet) — invisible until committed
2. Write new manifest file listing new data files
3. Write new manifest list referencing new manifest
4. Atomically swap the metadata file pointer:
v1.metadata.json → v2.metadata.json
(catalog-level atomic pointer swap — varies by catalog implementation)
5. If pointer swap succeeds: new snapshot is current state ✓
KEY DIFFERENCE FROM DELTA:
Iceberg uses a tree of metadata objects (not a sequential log)
This enables better performance for very large tables (millions of files)
— reading the manifest list is O(1) rather than scanning the full logRow-level deletes and updates — how MERGE works
PARQUET FILES ARE IMMUTABLE — you cannot modify bytes inside them.
UPDATE and DELETE work by writing NEW files, not modifying existing ones.
COPY-ON-WRITE (CoW) — Delta Lake and Iceberg default for updates:
UPDATE silver.orders SET status = 'delivered'
WHERE order_id = 9284751 AND order_date = '2026-03-17';
Step 1: Read micro-partition containing order_id 9284751
(file: date=2026-03-17/part-00042.parquet)
Step 2: Apply update in memory:
row where order_id=9284751: status changed to 'delivered'
Step 3: Write ENTIRE FILE with the update applied:
new file: date=2026-03-17/part-00043.parquet (full partition)
Step 4: Commit new log entry:
REMOVE: date=2026-03-17/part-00042.parquet
ADD: date=2026-03-17/part-00043.parquet
RESULT: The new snapshot shows part-00043 (updated), not part-00042 (old).
Old file is still on S3 until VACUUM removes it (supports time travel).
CoW write amplification:
1 row updated in a 128 MB file → 128 MB rewritten
For high-update-rate tables: CoW is expensive at write time
Trade-off: reads are fast (no merge needed — just read the file)
MERGE-ON-READ (MoR) — Hudi default, Iceberg option:
Instead of rewriting the full file on every update:
Write a small "delete file" recording which rows are deleted
Write a small "position delete" or "equality delete" file
New data written as new small files
On READ: engine merges base files + delete files → current state
MoR write cost: cheap (write small delta files only)
MoR read cost: more expensive (must merge delete files on every read)
Use when: high write velocity, low read frequency (e.g. CDC ingestion)
Use CoW when: high read frequency, moderate write rate (e.g. analytics tables)
MERGE INTO — the SQL syntax for upserts:
-- Snowflake / Databricks / BigQuery equivalent:
MERGE INTO silver.orders AS target
USING (
SELECT order_id, status, amount, updated_at
FROM bronze.orders_cdc
WHERE _bronze_date = '2026-03-17'
) AS source
ON target.order_id = source.order_id
WHEN MATCHED AND target.updated_at < source.updated_at
THEN UPDATE SET
status = source.status,
amount = source.amount,
updated_at = source.updated_at
WHEN NOT MATCHED
THEN INSERT (order_id, status, amount, updated_at)
VALUES (source.order_id, source.status, source.amount, source.updated_at);
-- Delta Lake executes this as:
-- 1. Hash join target and source on order_id
-- 2. For matched rows where condition is true: mark old file for removal
-- Write updated rows to new file (CoW)
-- 3. For unmatched rows: write new rows to new file
-- 4. Commit: REMOVE old files, ADD new files in one atomic log entryTime Travel — Querying Historical Table Versions
Time travel is the ability to query a table as it existed at a previous point in time or at a specific transaction version. It is one of the most practically valuable features of the lakehouse — both for debugging ("what did the data look like before that pipeline bug?") and for regulatory compliance ("prove what we reported to the regulator on March 17").
DELTA LAKE TIME TRAVEL:
-- Query table at a specific version number:
SELECT * FROM silver.orders VERSION AS OF 41;
-- Returns the table as it was after commit 41
-- Query table at a specific timestamp:
SELECT * FROM silver.orders TIMESTAMP AS OF '2026-03-16 23:59:59';
-- Returns the table state at that exact moment in time
-- Using Spark API:
df = spark.read .format("delta") .option("versionAsOf", 41) .load("s3://freshmart-lake/silver/orders")
df = spark.read .format("delta") .option("timestampAsOf", "2026-03-16 23:59:59") .load("s3://freshmart-lake/silver/orders")
-- View table history:
DESCRIBE HISTORY silver.orders;
-- Returns: version, timestamp, userId, operation, operationParameters, ...
-- version 42: MERGE (2026-03-17 06:14:32) — 48,234 rows merged
-- version 41: MERGE (2026-03-16 06:11:47) — 47,892 rows merged
-- ...
-- version 0: CREATE TABLE (2026-01-01 00:00:00)
-- Restore table to a previous version:
RESTORE TABLE silver.orders TO VERSION AS OF 41;
-- or:
RESTORE TABLE silver.orders TO TIMESTAMP AS OF '2026-03-16 23:59:59';
-- Creates a new commit (v43) that points to the same files as v41
-- v41 and v42 still exist in history — restore is non-destructive
ICEBERG TIME TRAVEL:
-- Query at snapshot ID:
SELECT * FROM silver.orders FOR SYSTEM_VERSION AS OF 5765671814693002000;
-- Query at timestamp:
SELECT * FROM silver.orders FOR SYSTEM_TIME AS OF '2026-03-16 23:59:59';
-- View snapshots:
SELECT * FROM silver.orders.snapshots;
-- Returns: committed_at, snapshot_id, operation, manifest_list
-- Rollback to snapshot:
CALL system.rollback_to_snapshot('freshmart.silver.orders', 5765671814693002000);
TIME TRAVEL RETENTION:
Delta Lake: controlled by delta.logRetentionDuration (default: 30 days)
and delta.deletedFileRetentionDuration (default: 7 days)
Files referenced by transactions within retention window: preserved
VACUUM removes files older than the retention window
SET TBLPROPERTIES on Delta:
delta.logRetentionDuration = 'interval 90 days' -- keep 90 days of history
delta.deletedFileRetentionDuration = 'interval 30 days'
CAUTION: longer retention = more storage cost
For GDPR/compliance: long retention is valuable for audit
For high-write tables: long retention is expensive (many old files)
PRACTICAL TIME TRAVEL USE CASES:
1. Debug a pipeline bug:
Pipeline had a bug on 2026-03-10 that wrote wrong revenue figures.
Bug fixed and Silver corrected on 2026-03-11.
To see what was wrong:
SELECT SUM(amount) FROM silver.orders TIMESTAMP AS OF '2026-03-10 23:59:59';
Compare to:
SELECT SUM(amount) FROM silver.orders TIMESTAMP AS OF '2026-03-11 23:59:59';
2. Regulatory audit:
Regulator asks: "What was your total active customer count as of Q4 end?"
SELECT COUNT(*) FROM silver.customers TIMESTAMP AS OF '2026-03-31 23:59:59';
Returns the exact count from that date — provable, reproducible.
3. ML reproducibility:
ML team trained a model on March 1 data.
Need to reproduce the exact training dataset for audit:
df = spark.read.format("delta") .option("timestampAsOf", "2026-03-01 00:00:00") .load(silver_orders_path)
Returns the exact same data used for training.
4. Recover from accidental delete:
Someone ran DELETE FROM silver.orders WHERE store_id = 'ST001' by mistake.
RESTORE TABLE silver.orders TO VERSION AS OF (current_version - 1);
Table recovered to state before the accidental delete.Apache Iceberg — The Most Portable Open Table Format
Module 29 introduced the three open table formats. This module goes deeper on Apache Iceberg specifically, because its engine-agnostic design is the most relevant for teams building multi-engine platforms in 2026. Iceberg is natively supported by Spark, Flink, Trino, Athena, Snowflake, BigQuery, and Hive — you can write with Spark and query with Snowflake on the same Iceberg table, with no conversion.
Iceberg metadata hierarchy
ICEBERG TABLE STRUCTURE ON S3:
s3://freshmart-lake/silver/orders/
├── metadata/
│ ├── v1.metadata.json ← initial table creation
│ ├── v2.metadata.json ← after first write
│ ├── v3.metadata.json ← after second write (current)
│ ├── snap-001-manifest-list.avro ← snapshot 1's manifest list
│ ├── snap-002-manifest-list.avro ← snapshot 2's manifest list
│ └── snap-003-manifest-list.avro ← snapshot 3's manifest list (current)
├── manifests/
│ ├── manifest-001.avro ← lists data files for snapshot 1
│ ├── manifest-002.avro ← lists data files for snapshot 2 additions
│ └── manifest-003.avro ← lists data files for snapshot 3
└── data/
├── date=2026-03-15/
│ └── part-00001-abc123.parquet
├── date=2026-03-16/
│ └── part-00001-def456.parquet
└── date=2026-03-17/
└── part-00001-ghi789.parquet
CATALOG: the external service that stores the current metadata pointer
The catalog maps: table_name → current metadata file path
The catalog is the authoritative source of "which version is current"
Without the catalog, you cannot know which metadata file is current.
Supported Iceberg catalog implementations:
├── Hive Metastore (HMS): stores pointer in Hive's MySQL/PostgreSQL
├── AWS Glue Data Catalog: stores pointer in Glue's managed catalog
├── Apache Nessie: git-like catalog with branching and tagging
├── REST Catalog: generic REST API standard (Tabular, Databricks Unity Catalog)
└── JDBC Catalog: stores pointer in any JDBC database (dev only)
HOW A QUERY READS AN ICEBERG TABLE:
1. CATALOG LOOKUP: client asks catalog for table "silver.orders"
Catalog returns: metadata file = "v3.metadata.json"
2. READ METADATA FILE: v3.metadata.json
Contains: current snapshot ID, schema history, partition spec
Current snapshot: snap-003-manifest-list.avro
3. READ MANIFEST LIST: snap-003-manifest-list.avro
Contains: list of manifest files for this snapshot
Each manifest entry has: partition range stats
Apply partition pruning here: skip manifests with no matching partitions
4. READ RELEVANT MANIFESTS: manifest-003.avro
Contains: list of data files, per-file statistics (min/max/null_count)
Apply data file pruning: skip files with no matching rows
5. READ RELEVANT PARQUET FILES
Apply predicate pushdown within each file (row group level)
PERFORMANCE:
Steps 1-4 are metadata-only (small files, fast)
Step 5 is where actual data I/O happens
Well-pruned queries skip 90-99% of data files → dramatic speedupIceberg partition evolution — the feature Delta Lake lacks
PROBLEM IN DELTA LAKE:
Table created with PARTITION BY (order_date) in 2024.
In 2026: table has grown to 50 TB, queries filter by (store_id, order_date).
Changing partition strategy in Delta: requires full table rewrite (50 TB!)
Duration: 8-12 hours of compute. Cost: hundreds of dollars. Risk: outage.
ICEBERG PARTITION EVOLUTION (no data rewrite needed):
-- Original partition spec (applied to all data until now):
ALTER TABLE silver.orders ADD PARTITION FIELD months(order_date);
-- All existing data has metadata: partitioned by month(order_date)
-- New partition spec (applied to new data only):
ALTER TABLE silver.orders DROP PARTITION FIELD months(order_date);
ALTER TABLE silver.orders ADD PARTITION FIELD days(order_date);
ALTER TABLE silver.orders ADD PARTITION FIELD identity(store_id);
-- New data written with: partitioned by (day(order_date), store_id)
HOW ICEBERG HANDLES MIXED PARTITION SPECS:
Old data files: still have month-based partition metadata in their manifests
New data files: have (day, store_id) partition metadata
Each manifest records which partition spec was used for its files
QUERY: SELECT * FROM silver.orders WHERE order_date = '2026-03-17' AND store_id = 'ST001'
Iceberg reads manifest list
For old manifests (month spec): prune by month('2026-03') — keeps March
For new manifests (day+store spec): prune by day=2026-03-17 AND store='ST001'
→ No data rewrite. Both old and new data queried efficiently with their
respective partition specs. Pruning works for both.
HIDDEN PARTITION TRANSFORMS:
Iceberg partition transforms allow partitioning by a derived value,
not the raw column value:
years(ts): partition by year of timestamp
months(ts): partition by year-month of timestamp
days(ts): partition by date of timestamp
hours(ts): partition by hour of timestamp
bucket(N, col): hash-partition into N buckets
truncate(W, col): truncate string/number to width W
These are HIDDEN from query writers:
Query: WHERE order_date = '2026-03-17'
Iceberg internally: find all files in the days(2026-03-17) partition
User does not need to know the partition is day-level, not raw timestamp
BENEFIT: queries do not break when partition granularity changes.
Old query: WHERE order_date BETWEEN '2026-03-01' AND '2026-03-17'
Old partitioning: months(order_date) → scans all March files
New partitioning: days(order_date) → scans only March 1-17 files
Query unchanged. Performance improved. No user migration needed.Unity Catalog — Governance for the Lakehouse Era
A lakehouse with no governance is just a large lake with ACID semantics. Unity Catalog (Databricks) is the most mature lakehouse governance layer in production as of 2026. It provides a three-level namespace (catalog.schema.table), column-level access control, row-level security, cross-engine lineage, and audit logging — all from a single control plane.
UNITY CATALOG THREE-LEVEL NAMESPACE:
catalog.schema.table
│ │ └── table or view
│ └── schema (like a PostgreSQL schema / database)
└── catalog (top-level container, maps to a cloud storage account)
freshmart_prod.silver.orders ← production Silver orders
freshmart_prod.gold.daily_revenue ← production Gold metrics
freshmart_dev.silver.orders ← dev environment Silver (separate catalog)
freshmart_prod.ml_features.order_features ← ML feature store tables
UNITY CATALOG ACCESS CONTROL:
Grant/revoke at any level of the three-level hierarchy:
GRANT SELECT ON CATALOG freshmart_prod TO GROUP analysts;
GRANT SELECT ON SCHEMA freshmart_prod.silver TO USER priya@freshmart.com;
GRANT SELECT ON TABLE freshmart_prod.gold.daily_revenue TO GROUP finance;
-- Deny access to PII columns in Silver:
REVOKE SELECT ON TABLE freshmart_prod.silver.customers FROM GROUP analysts;
-- Grant access to specific non-PII columns only:
GRANT SELECT (customer_id, tier, city, lifetime_orders)
ON TABLE freshmart_prod.silver.customers TO GROUP analysts;
COLUMN MASKING (dynamic data masking):
Create a masking policy for PII columns:
CREATE OR REPLACE MASKING POLICY mask_email AS (val STRING)
RETURNS STRING ->
CASE WHEN is_account_group_member('data_engineers')
THEN val -- engineers see raw email
ELSE SHA2(val, 256) -- others see hash
END;
ALTER TABLE freshmart_prod.bronze.customers
ALTER COLUMN email
SET MASKING POLICY mask_email;
-- Everyone querying this table sees hashed email unless they are data_engineers
ROW-LEVEL SECURITY:
CREATE OR REPLACE ROW ACCESS POLICY store_partition_policy
AS (store_id STRING)
RETURNS BOOLEAN ->
is_account_group_member('all_stores') -- global access
OR store_id = current_user_store_id() -- or just their store
;
ALTER TABLE freshmart_prod.gold.store_performance
ADD ROW ACCESS POLICY store_partition_policy ON (store_id);
-- Store managers only see their own store's performance data
UNITY CATALOG LINEAGE:
Unity Catalog automatically captures data lineage from:
- Spark queries (via SparkListener integration)
- Databricks SQL queries
- dbt runs (via dbt Unity Catalog integration)
For any table, you can query:
SELECT * FROM system.information_schema.column_lineage
WHERE target_table_name = 'daily_revenue';
Returns: upstream columns, transformations, and source tables
→ gold.daily_revenue.net_revenue ← silver.orders.order_amount, silver.orders.discount_amount
METASTORE ARCHITECTURE:
Unity Catalog uses a Databricks-managed metastore (not Hive Metastore)
One Unity Catalog metastore per cloud region per Databricks account
Multiple workspaces share the same metastore → same governance
Tables stored on customer-owned S3/ADLS — Databricks does not hold data
Metastore endpoint: managed by Databricks, not customerApache Polaris — the open source Unity Catalog alternative
Apache Polaris (incubating, donated by Snowflake in 2024) is the open source implementation of the Iceberg REST Catalog specification with governance features. It provides a vendor-neutral catalog that any Iceberg-compatible engine can use — enabling Unity Catalog-like governance without Databricks lock-in. As of 2026, Polaris is production- ready and used by teams that want multi-engine governance without a single vendor dependency.
ICEBERG REST CATALOG SPECIFICATION:
Standard REST API for catalog operations:
GET /v1/namespaces ← list all namespaces
GET /v1/namespaces/{ns}/tables ← list tables in namespace
GET /v1/namespaces/{ns}/tables/{table} ← get table metadata
POST /v1/namespaces/{ns}/tables ← create table
POST /v1/namespaces/{ns}/tables/{table}/metrics ← report metrics
ANY engine that implements this REST client can use ANY catalog server:
Spark → Polaris REST catalog
Flink → same Polaris REST catalog
Trino → same Polaris REST catalog
Athena → same Polaris REST catalog (via AWS Glue Iceberg support)
This is the open standard equivalent of Unity Catalog's proprietary API.
CONFIGURING SPARK TO USE POLARIS:
spark.sql.catalog.freshmart = org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.freshmart.type = rest
spark.sql.catalog.freshmart.uri = https://polaris.freshmart.internal/api/catalog
spark.sql.catalog.freshmart.credential = client_id:client_secret
spark.sql.catalog.freshmart.warehouse = s3://freshmart-lake/iceberg
-- Query Iceberg tables via Polaris catalog:
spark.sql("SELECT * FROM freshmart.silver.orders WHERE order_date = '2026-03-17'")
APACHE NESSIE — the git-like alternative:
Nessie adds git-like branching and tagging to Iceberg catalogs.
Create a branch for data experimentation without affecting production:
nessie branch create experiment-new-transform
spark.sql.catalog.nessie.ref = experiment-new-transform
-- Write experimental data to this branch
-- Merge to main when validated: nessie merge experiment-new-transform --into main
Used by: Project Nessie, Arctic (Dremio's managed Nessie offering)Streaming Ingestion Into Lakehouse Tables
One of the lakehouse's most compelling properties is that streaming and batch workloads can read and write the same tables. A Spark Structured Streaming job appends events to a Delta Lake table in near-real-time. A dbt batch job reads the same table for daily Gold aggregation. An ML training job reads the same table for feature extraction. No copies, no synchronisation.
# CDC STREAM → BRONZE DELTA LAKE (near-real-time)
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, from_json, current_timestamp
from pyspark.sql.types import StructType, LongType, StringType
spark = SparkSession.builder .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") .getOrCreate()
# Read CDC events from Kafka:
orders_cdc = spark.readStream .format("kafka") .option("kafka.bootstrap.servers", "kafka:9092") .option("subscribe", "freshmart.cdc.public.orders") .load()
# Parse and write to Bronze Delta Lake:
def write_to_bronze(batch_df, batch_id):
batch_df .select(from_json(col("value").cast("string"), order_schema).alias("e")) .select("e.*") .withColumn("_bronze_ts", current_timestamp()) .withColumn("_batch_id", batch_id) .write .format("delta") .mode("append") .option("mergeSchema", "true") .save("s3://freshmart-lake/bronze/orders")
query = orders_cdc.writeStream .foreachBatch(write_to_bronze) .option("checkpointLocation", "s3://freshmart-lake/checkpoints/orders_bronze") .trigger(processingTime="1 minute") # micro-batch every 1 minute
.start()
# WHILE THIS STREAMING WRITE IS HAPPENING:
# dbt runs against silver.orders (reads Bronze) — NO conflict
# Analysts query silver.orders — NO conflict
# Delta Lake's optimistic concurrency handles concurrent readers
# CHANGE DATA FEED — reading only changed rows from a Delta table:
# Instead of scanning the entire Bronze table to find new rows,
# use Change Data Feed to read only the delta since last processed version:
changes = spark.readStream .format("delta") .option("readChangeFeed", "true") .option("startingVersion", "latest") .load("s3://freshmart-lake/bronze/orders")
# changes includes: _change_type (insert/update_preimage/update_postimage/delete)
# Use this to feed Silver MERGE with only the changed Bronze rows:
def merge_to_silver(batch_df, batch_id):
# Filter to latest version per order_id (most recent CDC event):
from delta.tables import DeltaTable
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
latest = batch_df .filter(col("_change_type").isin("insert", "update_postimage")) .withColumn("rn", row_number().over(
Window.partitionBy("order_id").orderBy(col("updated_at").desc())
)) .filter(col("rn") == 1)
DeltaTable.forPath(spark, "s3://freshmart-lake/silver/orders") .alias("target") .merge(
latest.alias("source"),
"target.order_id = source.order_id"
) .whenMatchedUpdateAll() .whenNotMatchedInsertAll() .execute()
silver_stream = changes.writeStream .foreachBatch(merge_to_silver) .option("checkpointLocation", "s3://checkpoints/orders_silver") .trigger(processingTime="5 minutes") .start()
# ENABLE CHANGE DATA FEED ON A DELTA TABLE:
ALTER TABLE bronze.orders SET TBLPROPERTIES (delta.enableChangeDataFeed = true);
-- From this point, Delta records change type alongside each row change.
-- CDF adds storage overhead (~10-20% for typical workloads).
-- Only enable when Change Data Feed is actively consumed.When the Lakehouse Is the Right Choice — and When It Is Not
The lakehouse is not the right architecture for every team or every use case. It adds operational complexity over a plain warehouse, and it does not match the query performance of Snowflake or BigQuery for pure SQL analytics workloads. The decision is architectural — based on the specific characteristics of the platform being built.
Migrating FreshMart From Two Systems to a Lakehouse
FreshMart runs two separate systems: a Spark + S3 data lake (used by ML engineers) and a Snowflake warehouse (used by analysts). The daily ETL job that copies Silver tables from S3 to Snowflake costs $1,800/month in Snowflake credits and takes 4 hours. Analysts see data that is 4 hours stale. ML engineers and analysts have different versions of Silver — each team has found discrepancies. The CTO asks you to propose a consolidation.
CURRENT STATE:
S3 (lake): raw/bronze/silver/gold layers
ML team reads silver via Spark
Snowflake: silver/gold tables (copy of lake silver, 4h stale)
Analyst team queries via Snowflake SQL
ETL pipeline: copies silver from S3 to Snowflake daily (4h, $1,800/mo)
Problem: two copies, inconsistency, stale data, unnecessary cost
TARGET STATE:
S3 (Delta Lake): raw/bronze/silver/gold layers — single copy
Databricks SQL: queries Silver/Gold Delta tables directly (replaces Snowflake)
ML team: reads same Delta tables via Spark (no change)
Analyst team: queries via Databricks SQL (new tool, same SQL)
ETL pipeline: eliminated
MIGRATION PHASES:
PHASE 1 (Weeks 1-4): Enable Delta Lake on existing S3 tables
- Convert Silver and Gold S3 Parquet tables to Delta Lake format:
spark.read.parquet("s3://freshmart-lake/silver/orders") .write.format("delta").save("s3://freshmart-lake/silver_delta/orders")
- Validate: Delta and Parquet versions produce identical query results
- Set up Unity Catalog: register Delta tables in Unity Catalog
- Create Databricks SQL Warehouse for analysts
PHASE 2 (Weeks 5-8): Dual-write period
- Silver pipeline writes to BOTH S3 Parquet AND Delta simultaneously
- Analysts get Databricks SQL access to Delta tables
- Analysts run queries in parallel against Snowflake (old) and Databricks SQL (new)
- Compare results: if Databricks SQL results match Snowflake → validation passed
- ML team switches to Delta tables: verify training pipelines unchanged
PHASE 3 (Weeks 9-12): Cutover
- Stop writing to S3 Parquet (keep Delta as sole format)
- Stop running ETL copy to Snowflake
- Decommission Snowflake warehouse (or retain for special use cases)
- Migrate all dbt models to run against Delta tables (dbt Databricks adapter)
- Redirect all analyst tools (Metabase) to Databricks SQL endpoint
PHASE 4 (Week 13+): Optimise
- Add CLUSTER BY on Delta tables for analyst query patterns
- Tune Databricks SQL warehouse sizes (X-Small for dashboards, Medium for ad-hoc)
- Configure auto-suspend appropriately
- Enable Unity Catalog lineage and access control
EXPECTED OUTCOMES:
ETL cost eliminated: -$1,800/month
Snowflake compute eliminated: -$3,200/month (warehouse credits)
Snowflake storage eliminated: -$800/month
Databricks SQL new cost: +$1,600/month (smaller warehouse, less data)
Net saving: $4,200/month ($50,400/year)
Data freshness: 4 hours → 15 minutes (pipeline interval)
ML/analyst consistency: guaranteed (one copy)
Operational pipelines to maintain: from 2 (lake + ETL) to 1 (lake only)5 Interview Questions — With Complete Answers
Errors You Will Hit — And Exactly Why They Happen
🎯 Key Takeaways
- ✓The lakehouse solves the two-system problem: organisations with both a data lake and a data warehouse maintain two copies of data, an ETL pipeline between them, and two sources of truth that diverge over time. The lakehouse eliminates the copy by adding ACID semantics directly to the lake storage layer.
- ✓The lakehouse architecture has three components: cheap object storage (S3/ADLS/GCS) at the bottom, an open table format (Delta Lake, Iceberg, or Hudi) that adds ACID semantics in the middle, and multiple compute engines (Spark, Databricks SQL, Trino, Flink) that all read/write the same table format at the top.
- ✓Delta Lake achieves ACID on S3 via a transaction log (_delta_log/). New Parquet data files are written but invisible to readers until an atomic log entry commits them. The log entry is a single S3 PUT — atomic, all-or-nothing. Concurrent writers use optimistic concurrency with conflict detection and retry.
- ✓Copy-on-Write (CoW) rewrites the entire affected file on every UPDATE/DELETE — fast reads, expensive writes. Merge-on-Read (MoR) writes small delete/change files and merges on read — cheap writes, more expensive reads. Choose CoW for analytics-heavy tables, MoR for high-velocity write workloads like CDC.
- ✓Time travel queries return table state at a previous version or timestamp. Delta Lake uses VERSION AS OF and TIMESTAMP AS OF. Iceberg uses FOR SYSTEM_VERSION AS OF and FOR SYSTEM_TIME AS OF. Retention is configurable — longer retention enables longer time travel windows but increases storage cost.
- ✓Iceberg partition evolution allows changing the partition strategy without rewriting any existing data. Old data retains its original partition spec in manifest metadata. New data uses the new spec. Queries efficiently prune both old and new data using the appropriate spec. Delta Lake lacks this feature — changing partition strategy requires full table rewrite.
- ✓Unity Catalog provides the governance layer for the Databricks lakehouse: three-level namespace (catalog.schema.table), column-level masking policies, row-level security, automatic cross-engine lineage, and audit logging. Apache Polaris is the open source equivalent for teams that want catalog-level governance without vendor lock-in.
- ✓Streaming and batch can share the same Delta/Iceberg tables. Spark Structured Streaming writes CDC events to Bronze Delta Lake in micro-batches. dbt batch jobs read the same Silver Delta tables for Gold aggregation. ML training Spark jobs read the same tables for feature extraction. One copy, all consumers.
- ✓Change Data Feed (CDF) on Delta Lake records which rows changed and how (insert/update/delete) alongside each write. Downstream Silver MERGE jobs can use CDF to read only the changed Bronze rows rather than scanning the entire Bronze table. Enable with ALTER TABLE SET TBLPROPERTIES (delta.enableChangeDataFeed = true).
- ✓Choose the lakehouse when ML and SQL analytics must share data without duplication, data volume is large enough that duplication is expensive, multiple engines are required, or building a new platform. Stick with a managed warehouse (Snowflake, BigQuery) when the workload is primarily SQL analytics, team has no Spark expertise, sub-second interactive performance is critical, or an existing warehouse investment is in place.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.