Medallion Architecture — Bronze, Silver, Gold
What belongs in each layer, the exact transformations at each boundary, dbt and Spark integration, SCD handling, and governing a three-tier lakehouse.
The Medallion Architecture — Three Layers, One Principle
The Medallion Architecture is the dominant data organisation pattern in modern data lakehouses. It structures data into three progressively refined layers — Bronze (raw), Silver (cleaned and trusted), and Gold (business-ready aggregates) — with each layer building on the previous one through well-defined transformation steps. Every layer has a specific contract: what data it accepts, what it guarantees, and who can use it.
The pattern was popularised by Databricks and is now the standard for any platform using Delta Lake or Iceberg. Its value is not the three-layer count — it is the principle each layer embodies: each successive layer makes stronger guarantees about data quality, structure, and fitness for analytical use. Moving data from Bronze to Silver is not just renaming a folder; it is a transformation that adds a contractual guarantee.
Bronze Layer — Raw Data, Preserved Forever
The Bronze layer is the truth of what the source system actually sent. If a vendor CSV had wrong data on 2026-03-15, the Bronze layer still has that wrong data. This is not a failure — it is a feature. When the vendor sends a corrected file, you can prove what the original said and when the correction arrived. When a bug is found in the Silver transformation, Bronze is the raw material for reprocessing.
What the Bronze layer does — and does not do
BRONZE LAYER CONTRACT:
Accepts: Any data from the landing zone
Produces: Parquet files partitioned by ingestion date
Guarantees: Format standardised, ingestion metadata added
TRANSFORMATIONS PERFORMED IN BRONZE:
✓ Format conversion: JSON/CSV/XML → Parquet
✓ Hive-style partitioning: adds date= partition column
✓ Ingestion metadata: adds ingested_at, source_system, pipeline_run_id
✓ Light type coercion: minimum changes to make data loadable as Parquet
(e.g., parse timestamps to TimestampType for Parquet compatibility)
✓ Schema capture: record the schema seen from the source in catalog
TRANSFORMATIONS NOT PERFORMED IN BRONZE:
✗ Business validation (valid statuses, amount ranges) — Silver's job
✗ Deduplication — Silver's job
✗ NULL handling — Silver's job
✗ Join to other tables — Silver's job
✗ Business rule application (tier classification, etc.) — Gold's job
✗ Dropping columns — preserve everything, even columns you don't use yet
THE BRONZE IMMUTABILITY RULE:
Bronze data is append-only via CDC / incremental ingestion.
Once written, a Bronze row is never modified.
If the source sends a correction: append the corrected row.
Silver deduplication will keep the latest version.
This preserves the complete change history — invaluable for audits.
BRONZE DATA VOLUME REALITY:
Bronze retains all CDC events — every INSERT, UPDATE, DELETE.
For a table with 10M rows and 5% daily update rate:
Daily CDC events: 500,000 inserts + 500,000 updates = 1M events/day
After 1 year: 365M Bronze rows for 10M current rows in source
This is expected and correct — Bronze is the event log, Silver is the state.Bronze implementation — Spark and Python patterns
"""
Bronze pipeline: landing JSON → Bronze Parquet (Delta Lake)
Preserves all source fields, adds ingestion metadata.
"""
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import TimestampType
from delta.tables import DeltaTable
from datetime import date
spark = SparkSession.builder .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") .getOrCreate()
BRONZE_PATH = "s3://freshmart-data-lake-prod/bronze/payments"
LANDING_PATH = "s3://freshmart-data-lake-prod/landing/razorpay"
def load_to_bronze(run_date: date) -> dict:
"""
Load raw Razorpay JSON from landing to Bronze Delta Lake.
Converts to Parquet, adds metadata, preserves all source fields.
"""
landing_file = f"${LANDING_PATH}/payments_${run_date.strftime('%Y%m%d')}.json"
# Read raw JSON — schema inferred from source (schema-on-read at Bronze)
raw = spark.read.json(landing_file)
# Only transformations allowed at Bronze:
bronze = raw .withColumn("_ingested_at", F.current_timestamp()) .withColumn("_source_system", F.lit("razorpay")) .withColumn("_pipeline_run_id", F.lit(run_id)) .withColumn("_source_date", F.lit(str(run_date))) .withColumn("_bronze_date", F.to_date(F.current_timestamp()))
# Note: _ prefix on all metadata columns to distinguish from source columns
# Write to Delta Lake with date partition:
bronze.write .format("delta") .mode("append") # Bronze is APPEND ONLY
.partitionBy("_bronze_date") .option("mergeSchema", "true") # allow new source columns to flow through
.save(BRONZE_PATH)
# Register in Glue catalog if partition is new:
row_count = bronze.count()
return {
"rows_written": row_count,
"path": BRONZE_PATH,
"partition": str(run_date),
}
# BRONZE SCHEMA EVOLUTION STRATEGY:
# mergeSchema=true: if source adds a new column, Bronze accepts it
# automatically. Silver must handle the new column explicitly.
# When Silver fails because of an unexpected Bronze column:
# 1. The Bronze data is safe (mergeSchema preserved it)
# 2. Update Silver model to handle the new column
# 3. Reprocess Silver from Bronze — no re-ingestion from source needed
# WHAT TO DO WITH SCHEMA CHANGES:
# Source adds new column "refund_amount":
# Bronze: mergeSchema=true → new column appears in Parquet automatically
# Silver: update stg_payments.sql to SELECT refund_amount
# Gold: update downstream models that need refund_amount
# No re-ingestion from source. No data loss. Controlled migration.Silver Layer — Trusted, Typed, Deduplicated
Silver is where raw data becomes trustworthy. An analyst can query a Silver table and know that every column has the right type, every required field is populated, the data has been deduplicated, and the values are within expected business ranges. Silver is the single source of truth for the current state of every entity in the platform.
The critical Silver contract: Silver contains entity state, not event history. The Bronze layer has every INSERT, UPDATE, and DELETE that happened over three years. The Silver layer has one row per order_id — the current state of that order, updated whenever Bronze receives a change. Silver answers "what does the order look like now?" Bronze answers "what happened to this order and when?"
The Silver transformation checklist
-- models/silver/stg_payments.sql
WITH source AS (SELECT * FROM ${ ref('bronze_payments') })
SELECT
payment_id::BIGINT AS payment_id,
order_id::BIGINT AS order_id,
amount::DECIMAL(10,2) AS amount,
currency AS currency,
LOWER(TRIM(status)) AS status,
created_at::TIMESTAMPTZ AS created_at,
_ingested_at AS bronze_ingested_at
FROM source
WHERE TRY_CAST(payment_id AS BIGINT) IS NOT NULL -- reject if PK unparseableWITH deduplicated AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY payment_id
ORDER BY updated_at DESC, _ingested_at DESC
) AS rn
FROM stg_payments
)
SELECT * EXCLUDE (rn)
FROM deduplicated
WHERE rn = 1-- Hard rejection for NULL primary key:
WHERE payment_id IS NOT NULL
-- Soft handling for optional fields:
COALESCE(delivery_fee, 0) AS delivery_fee,
COALESCE(promo_code, '') AS promo_code, -- only if empty string is valid business default
notes AS notes -- leave NULL as NULL — NULL has meaning hereWHERE amount >= 0 -- reject negative amounts
AND status IN ('pending','captured','failed','refunded')
AND created_at > '2020-01-01'::TIMESTAMPTZ
AND created_at < NOW() + INTERVAL '1 day' -- reject future timestampsSHA2(customer_email, 256) AS customer_email_hashed,
REGEXP_REPLACE(phone, '[0-9]', 'X', 1, -1, 'i')
AS phone_masked, -- +91-XXXXXXXX-XX
-- Raw fields NOT included in Silver SELECT — they stay in Bronze only-- dbt relationship test (schema.yml):
-- - name: order_id
-- tests:
-- - relationships:
-- to: ref('silver_orders')
-- field: order_id
-- severity: warn -- warn not error: some payments may arrive before ordersSilver as Delta Lake — MERGE for idempotent upserts
-- Silver layer in dbt uses the 'incremental' materialisation with Delta MERGE.
-- This is more efficient than full refresh for large tables.
-- models/silver/payments.sql
{{ config(
materialized = 'incremental',
unique_key = 'payment_id',
incremental_strategy = 'merge', -- Delta MERGE INTO
file_format = 'delta',
merge_update_columns = [ -- only update these columns on match
'status', 'amount', 'refund_amount', 'updated_at', 'silver_updated_at'
],
) }}
WITH source AS (
SELECT * FROM {{ ref('stg_payments') }}
{% if is_incremental() %}
-- On incremental runs: only process new/changed Bronze rows
WHERE bronze_ingested_at > (
SELECT MAX(silver_updated_at) FROM {{ this }}
)
{% endif %}
),
deduplicated AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY payment_id
ORDER BY updated_at DESC
) AS rn
FROM source
),
final AS (
SELECT
payment_id, order_id, amount, currency, status,
refund_amount, created_at, updated_at,
CURRENT_TIMESTAMP() AS silver_updated_at
FROM deduplicated
WHERE rn = 1
AND payment_id IS NOT NULL
AND amount >= 0
AND status IN ('pending','captured','failed','refunded')
)
SELECT * FROM final
-- What dbt generates under the hood:
-- MERGE INTO silver.payments AS target
-- USING new_rows AS source
-- ON target.payment_id = source.payment_id
-- WHEN MATCHED THEN UPDATE SET status=..., amount=...
-- WHEN NOT MATCHED THEN INSERT (payment_id, order_id, ...) VALUES (...)
-- IMPORTANT: merge_update_columns prevents overwriting
-- columns like created_at that should never change once set.Gold Layer — Business Logic Applied, Ready to Serve
Gold is not a copy of Silver with aggregations on top. It is a purpose-built layer — each Gold model is designed for a specific analytical consumer. The daily revenue model exists because the finance team needs it. The customer LTV model exists because the product team needs it. Gold is the translation layer between "what the data says" (Silver) and "what the business asks" (dashboards, reports, APIs).
The Gold layer design principles
GOLD PATTERN 1: PRE-AGGREGATED METRICS (most common)
────────────────────────────────────────────────────────────
Purpose: Replace expensive on-the-fly aggregations with pre-computed results.
A query that joins 500M orders, 10M customers, and 10M payments
in real time takes 4 minutes. Pre-aggregated Gold: < 1 second.
Example — daily store revenue:
-- models/gold/daily_store_revenue.sql
SELECT
DATE(o.created_at AT TIME ZONE 'Asia/Kolkata') AS order_date,
o.store_id,
s.store_name,
s.city,
COUNT(DISTINCT o.order_id) AS order_count,
SUM(o.order_amount) AS gross_revenue,
SUM(o.discount_amount) AS total_discount,
SUM(o.order_amount - o.discount_amount) AS net_revenue,
AVG(o.order_amount) AS avg_order_value,
COUNT(DISTINCT o.customer_id) AS unique_customers,
SUM(CASE WHEN o.status = 'cancelled' THEN 1 ELSE 0 END) AS cancellations,
ROUND(SUM(CASE WHEN o.status = 'cancelled' THEN 1 ELSE 0 END)
/ COUNT(o.order_id) * 100, 2) AS cancellation_rate_pct
FROM {{ ref('silver_orders') }} o
JOIN {{ ref('silver_stores') }} s USING (store_id)
WHERE o.status IN ('delivered', 'cancelled')
GROUP BY 1, 2, 3, 4
ORDER BY 1 DESC, 6 DESC
GOLD PATTERN 2: WIDE DENORMALISED FACT TABLE (for BI tools)
────────────────────────────────────────────────────────────
Purpose: Join multiple Silver tables into one wide flat table.
BI tools (Metabase, Tableau, Looker) work best with single flat tables.
Analysts should not need to join — Gold does it for them.
Example — fct_orders (the "order 360"):
-- models/gold/fct_orders.sql
SELECT
o.order_id, o.order_amount, o.status, o.created_at,
-- Customer attributes (from silver.customers at order time):
c.customer_id, c.city AS customer_city, c.tier AS customer_tier,
-- Store attributes:
s.store_id, s.store_name, s.store_city,
-- Payment attributes:
p.payment_method, p.payment_status, p.captured_at,
-- Delivery attributes:
d.delivery_minutes, d.delivery_partner,
-- Derived metrics:
CASE WHEN o.order_amount >= 2000 THEN 'premium'
WHEN o.order_amount >= 500 THEN 'standard'
ELSE 'economy' END AS order_tier,
CASE WHEN p.captured_at IS NOT NULL THEN TRUE
ELSE FALSE END AS is_paid
FROM {{ ref('silver_orders') }} o
LEFT JOIN {{ ref('silver_customers') }} c USING (customer_id)
LEFT JOIN {{ ref('silver_stores') }} s USING (store_id)
LEFT JOIN {{ ref('silver_payments') }} p USING (order_id)
LEFT JOIN {{ ref('silver_deliveries') }} d USING (order_id)
GOLD PATTERN 3: ENTITY SNAPSHOTS (for historical analysis)
────────────────────────────────────────────────────────────
Purpose: Capture the state of an entity (customer, product) at a specific
point in time for cohort analysis and trend reporting.
Example — daily customer snapshot:
-- models/gold/customer_daily_snapshot.sql
SELECT
snapshot_date,
customer_id,
tier,
total_lifetime_orders,
total_lifetime_spend,
days_since_last_order,
CASE WHEN days_since_last_order <= 30 THEN 'active'
WHEN days_since_last_order <= 90 THEN 'at_risk'
ELSE 'churned' END AS lifecycle_status
FROM {{ ref('silver_customers') }}
CROSS JOIN (
SELECT DISTINCT DATE(created_at) AS snapshot_date
FROM {{ ref('silver_orders') }}
WHERE created_at >= CURRENT_DATE - 365
) dates
-- One row per (customer, date) — track how the customer changed over timeGold layer governance — who defines business logic
One of the most important governance questions in a data platform is: who owns the Gold layer? Business logic (how to calculate customer tier, what counts as a completed order, how to attribute revenue) should not be scattered across BI tool calculated fields, Excel files, and individual analyst queries. Gold centralises business logic so that every consumer gets the same answer to the same question.
# ANTI-PATTERN: Business logic scattered across consumers
# BI Tool A: customer_tier = order_amount > 1000 THEN 'premium'
# BI Tool B: customer_tier = order_amount > 1500 THEN 'premium'
# Analyst C: customer_tier = lifetime_spend > 10000 THEN 'premium'
# → Three different answers to "how many premium customers do we have"
# → Finance, product, and marketing each have different numbers
# CORRECT PATTERN: Business logic in Gold, one definition
# models/gold/customer_segments.sql:
# customer_tier =
# CASE WHEN lifetime_spend >= 50000 THEN 'platinum'
# WHEN lifetime_spend >= 10000 THEN 'gold'
# WHEN lifetime_spend >= 2000 THEN 'silver'
# ELSE 'standard'
# END
# This definition is in git, tested, reviewed, and versioned.
# Every dashboard that queries customer_tier gets the same answer.
# THE DATA CONTRACT:
# Gold models have explicit consumers listed in their documentation:
# - Finance dashboard: uses daily_store_revenue, customer_ltv
# - Operations dashboard: uses fct_orders, store_performance
# - Product analytics: uses customer_daily_snapshot, feature_adoption
# Before changing a Gold model, check all downstream consumers.
# Breaking changes require:
# 1. Announcing the change to all consumer teams
# 2. Adding the new column alongside the old (backward-compatible)
# 3. Migration period for consumers to update
# 4. Removing the old column only after all consumers have migrated
# dbt docs generate creates a lineage graph showing all downstream models.
# Use it before any Gold model modification.Bronze vs Silver vs Gold — Every Dimension
| Dimension | 🥉 Bronze | 🥈 Silver | 🥇 Gold |
|---|---|---|---|
| Purpose | Preserve raw source data exactly | Trusted, typed, current entity state | Pre-computed business-ready metrics |
| Data model | Append-only event log | Current state (one row per entity) | Aggregates and denormalised facts |
| Schema model | Schema-on-read (mergeSchema=true) | Schema enforced at MERGE | Explicit, stable, consumer-specific |
| Transformations | Format conversion, metadata only | Types, dedup, nulls, validation, PII mask | Joins, aggregations, business rules |
| Update semantics | APPEND only (CDC events) | MERGE (upsert on business key) | Full rebuild or incremental MERGE |
| Table format | Delta Lake (mergeSchema) | Delta Lake (ACID, time travel) | Delta Lake (small, fast queries) |
| dbt model type | Not in dbt (Python/Spark pipeline) | incremental (merge strategy) | table or incremental |
| Retention | 1-3 years (reprocessing source) | 2-5 years | 1-2 years (rebuilt from Silver) |
| Row count ratio | 10-50× Silver (all history) | 1× (current state) | 0.001-0.1× (aggregated) |
| Access | Engineers, ML scientists only | All engineers + analysts (PII masked) | Everyone + BI tools + APIs |
| Contains PII | Yes — raw data unmasked | No — masked/hashed in transformation | No — derived from masked Silver |
| Source of truth | For raw historical data | For current entity state | Never — always rebuilt from Silver |
Where dbt Fits — And Where It Does Not
dbt is the standard transformation tool for the Silver and Gold layers. It handles the Bronze-to-Silver and Silver-to-Gold transformations with version-controlled SQL, automated testing, lineage documentation, and CI/CD integration. Understanding where dbt belongs — and where Python or Spark is more appropriate — is what produces a well-structured platform.
dbt PROJECT STRUCTURE FOR MEDALLION ARCHITECTURE:
freshmart_dbt/
├── models/
│ ├── staging/ ← Bronze → Silver staging (one-to-one source mapping)
│ │ ├── _sources.yml ← define Bronze tables as dbt sources
│ │ ├── stg_payments.sql ← cast types, rename, light cleaning
│ │ ├── stg_orders.sql
│ │ └── stg_customers.sql
│ │
│ ├── intermediate/ ← optional layer: complex joins / business prep
│ │ └── int_orders_with_payment.sql
│ │
│ ├── silver/ ← Silver layer (trusted, current state)
│ │ ├── _silver.yml ← schema.yml with tests
│ │ ├── orders.sql ← materialised='incremental', merge
│ │ ├── customers.sql
│ │ └── payments.sql
│ │
│ └── gold/ ← Gold layer (business metrics)
│ ├── finance/
│ │ ├── daily_revenue.sql
│ │ └── customer_ltv.sql
│ ├── operations/
│ │ └── store_performance.sql
│ └── product/
│ └── customer_segments.sql
│
├── tests/
│ └── generic/ ← custom dbt tests
│ └── assert_positive_revenue.sql
│
└── macros/
└── generate_schema_name.sql ← route models to correct schemas
MATERIALISATION STRATEGY BY LAYER:
staging models: materialised='view' (no storage cost, always fresh)
intermediate: materialised='view' or 'ephemeral'
silver models: materialised='incremental', incremental_strategy='merge'
gold aggregates: materialised='table' (full rebuild, small size)
gold large facts: materialised='incremental', merge on business key
dbt DOES WELL:
✓ Bronze → Silver staging (SQL type casting, renaming, light cleaning)
✓ Silver → Gold joins and aggregations
✓ Incremental MERGE updates (via Delta Lake connector)
✓ Test automation (not_null, unique, relationships, custom)
✓ Documentation and lineage graphs
✓ CI/CD integration (dbt test on every PR)
dbt DOES NOT DO WELL (use Python/Spark instead):
✗ File format conversion (CSV → Parquet) — use Spark
✗ Large-scale deduplication on billions of rows — use Spark
✗ ML feature engineering — use Python/Spark
✗ Complex JSON flattening with nested arrays — Spark handles better
✗ Real-time/streaming transformations — use Flink or Spark Streaming
✗ PII masking with custom encryption libraries — use Python
THE BOUNDARY RULE:
If the transformation can be expressed as a SQL SELECT and runs within
Snowflake/Databricks SQL Warehouse in reasonable time → use dbt.
If it requires Python libraries, non-SQL operations, or Spark's distributed
execution for scale → use a Python/Spark pipeline that writes to Bronze,
then let dbt handle Bronze → Silver → Gold from there.Late-Arriving Data — Handling Corrections Across All Three Layers
Late-arriving data — records that are delivered to the platform after their logical timestamp — is handled differently at each layer. The Medallion Architecture's layer separation makes late data handling tractable: Bronze appends it, Silver merges it, Gold rebuilds or corrects it.
SCENARIO:
An order was placed on 2026-03-17.
The mobile app was offline and synced late on 2026-03-19.
The event arrives in the landing zone on 2026-03-19 at 14:00 UTC.
The event's event_time is 2026-03-17 22:15 UTC (when it actually happened).
LAYER 1 — BRONZE: appended with current ingestion date
The Bronze pipeline runs on 2026-03-19.
Row written to: bronze/orders/_bronze_date=2026-03-19/
Row has:
event_time: 2026-03-17 22:15 UTC (from the event — the real time)
_bronze_date: 2026-03-19 (partition — when we received it)
_ingested_at: 2026-03-19 14:02 UTC (when the pipeline ran)
→ Bronze correctly stores it in the 2026-03-19 partition (when received)
→ Bronze DOES NOT retroactively write to the 2026-03-17 partition
→ Both the event time and ingestion time are preserved
LAYER 2 — SILVER: merged using event_time as the business key basis
The Silver pipeline runs with a 30-minute overlap window.
The overlap catches late-arriving Bronze rows.
The MERGE on order_id updates the Silver row if:
source.updated_at > target.updated_at (conditional update)
If the order existed in Silver already:
→ The MERGE updates to the latest state (if source is newer)
If the order did NOT exist in Silver yet (was truly new):
→ The MERGE inserts the row with event_time as created_at
Silver.orders for order_id 9284751 now has:
created_at: 2026-03-17 22:15 UTC (correct business time)
updated_at: 2026-03-19 14:02 UTC (when we last saw it)
silver_updated_at: 2026-03-19 14:07 UTC
LAYER 3 — GOLD: correction depends on Gold model type
CASE A: Gold built from Silver with incremental MERGE
Silver has the correct state for order_id 9284751.
The next Gold dbt run sees the Silver row changed.
Gold is updated via MERGE — the 2026-03-17 aggregate is updated.
The order now appears in the 2026-03-17 daily revenue correctly.
CASE B: Gold built with full date-partition rebuild
The Gold model selects from Silver WHERE order_date = 2026-03-17.
After Silver has the late-arriving order, the next full rebuild
of the 2026-03-17 partition includes it correctly.
CASE C: Gold aggregate has ALREADY been used in finance report
Finance saw ₹42,11,500 for 2026-03-17 in Monday's report.
After the late arrival, the correct total is ₹42,12,380.
Decision: does the business want the correction to appear?
If yes: rebuild 2026-03-17 Gold partition, send correction notice.
If no: accept the lag as a known data characteristic.
The key: the CAPABILITY to correct exists because Silver is correct.
OVERLAP WINDOW CONFIGURATION:
The Silver incremental model uses an overlap to catch late Bronze rows:
{% if is_incremental() %}
WHERE bronze_ingested_at > (
SELECT MAX(silver_updated_at) - INTERVAL '30 minutes' FROM {{ this }}
)
{% endif %}
30 minutes catches most late-arriving Bronze rows without expensive reprocessing.
For data that can be days late: use a larger overlap (24h) + weekly full refresh.Slowly Changing Dimensions — How Each Layer Handles Them
Module 15 covered SCD patterns in SQL. In the Medallion Architecture, SCDs manifest differently at each layer. Bronze has the complete event history including all changes. Silver has the current state. Gold may need point-in-time state for historical analysis. The three-layer model makes all three representations possible from the same raw data.
BRONZE LAYER — Complete change history (all SCD types possible)
Bronze is append-only CDC events. Every change is stored:
order_id=9284751, status='placed', created_at=2026-03-17 14:00, op='c'
order_id=9284751, status='confirmed', updated_at=2026-03-17 14:02, op='u'
order_id=9284751, status='delivered', updated_at=2026-03-17 15:30, op='u'
Bronze effectively stores all history needed for any SCD type.
SILVER LAYER — Current state (SCD Type 1 equivalent)
Silver MERGE keeps one row per order_id — the most recent state.
This is SCD Type 1: overwrite. No history in Silver itself.
Silver.orders for order_id=9284751:
order_id=9284751, status='delivered', created_at=2026-03-17 14:00
If you need history: query Bronze directly.
SCD TYPE 2 IN SILVER — for entities that need historical state
Some entities genuinely need SCD Type 2 in Silver.
Example: customers change city and we need to know their city at order time.
customers_scd2 table (Silver):
customer_sk ← surrogate key (unique per version)
customer_id ← business key (same across versions)
city ← tracked attribute
valid_from ← when this version became active
valid_to ← when this version was superseded (NULL = current)
is_current ← boolean flag for easy filtering
dbt snapshot generates SCD2 automatically:
-- snapshots/customers_snapshot.sql
{% snapshot customers_snapshot %}
{{ config(
target_schema = 'silver',
unique_key = 'customer_id',
strategy = 'check',
check_cols = ['city', 'tier'], -- track changes to these columns
invalidate_hard_deletes = True,
) }}
SELECT customer_id, city, tier, updated_at
FROM {{ source('silver', 'customers') }}
{% endsnapshot %}
dbt runs the snapshot, compares to current state, and:
- If city changed: expires old row (sets valid_to = today)
inserts new row (valid_from = today, valid_to = NULL)
- If unchanged: does nothing
GOLD LAYER — Point-in-time joins using SCD2
The fct_orders model needs to know the customer's city
AT THE TIME OF THE ORDER, not the current city.
-- models/gold/fct_orders.sql
SELECT
o.order_id,
o.customer_id,
c.city AS customer_city_at_order_time,
o.order_amount,
o.created_at
FROM {{ ref('silver_orders') }} o
-- SCD2 join: find the version of the customer active at order time
JOIN {{ ref('customers_snapshot') }} c
ON o.customer_id = c.customer_id
AND o.created_at BETWEEN c.valid_from
AND COALESCE(c.valid_to, '9999-12-31')
This gives historically accurate customer city for every order.
If Priya moved from Bangalore to Hyderabad on 2026-02-01:
Orders before 2026-02-01: city = 'Bangalore'
Orders from 2026-02-01: city = 'Hyderabad'
IMPLEMENTATION DECISION TREE:
Does the entity change over time?
No: SCD Type 1 in Silver (MERGE overwrite) — simplest
Yes: Does the business need history?
No: SCD Type 1 in Silver — overwrite, no history needed
Yes: Does Gold need point-in-time join accuracy?
No: Silver has current state, that's sufficient for reporting
Yes: dbt snapshot → SCD Type 2 in Silver, Gold uses dated joinTracing a Wrong Revenue Number Back to the Root Cause
The finance team reports that March 15th revenue in the dashboard shows ₹41,83,000 but the bank statement shows ₹42,15,400. The difference is ₹32,400 — 12 orders worth. You are asked to trace it.
STEP 1: Check Gold — is the dashboard reading the right table?
SELECT SUM(net_revenue) FROM gold.daily_store_revenue
WHERE order_date = '2026-03-15';
-- Returns: ₹41,83,000 (confirms dashboard is reading correct Gold table)
STEP 2: Check Silver — does Silver match Gold?
SELECT SUM(order_amount - discount_amount) AS silver_net
FROM silver.orders
WHERE DATE(created_at AT TIME ZONE 'Asia/Kolkata') = '2026-03-15'
AND status IN ('delivered', 'cancelled');
-- Returns: ₹41,83,000 (Gold and Silver agree)
-- Root cause is upstream of Gold — it is in Silver or Bronze.
STEP 3: Compare Silver row count to expected
SELECT COUNT(*) FROM silver.orders
WHERE DATE(created_at AT TIME ZONE 'Asia/Kolkata') = '2026-03-15';
-- Returns: 9,847 orders
SELECT COUNT(DISTINCT order_id) FROM source.orders
WHERE DATE(created_at AT TIME ZONE 'Asia/Kolkata') = '2026-03-15';
-- Returns: 9,859 orders ← 12 orders missing in Silver!
-- The 12 missing orders × avg ₹2,700 = ₹32,400 — matches the gap exactly.
STEP 4: Identify the missing orders
SELECT s.order_id
FROM source.orders s
LEFT JOIN silver.orders sv USING (order_id)
WHERE DATE(s.created_at AT TIME ZONE 'Asia/Kolkata') = '2026-03-15'
AND sv.order_id IS NULL;
-- Returns 12 order_ids. All have status='refunded'.
STEP 5: Check Silver validation rule
SELECT * FROM silver.orders
WHERE order_id IN (9284891, 9284892, ...) -- the 12 missing ones
-- Returns nothing — they were rejected from Silver.
-- Check DLQ:
SELECT error_message, COUNT(*) FROM pipeline.dead_letter_queue
WHERE pipeline_name = 'orders_incremental'
AND run_date = '2026-03-15'
AND status = 'pending'
GROUP BY 1;
-- Returns:
-- invalid_status: 'refunded' → 12 rows
-- Root cause found: the Silver validation rule has:
-- status IN ('placed','confirmed','delivered','cancelled')
-- 'refunded' was added to the source system on 2026-03-10
-- but was never added to the Silver validation allowlist.
-- 12 orders with status='refunded' were silently rejected to DLQ.
STEP 6: Fix
-- 1. Add 'refunded' to VALID_STATUSES in pipeline/validate.py
-- 2. Reprocess DLQ records for 2026-03-15:
python dlq_reprocess.py --pipeline orders_incremental --run-date 2026-03-15
-- 3. Silver is updated via MERGE — 12 new rows inserted.
-- 4. dbt run --select gold.daily_store_revenue --full-refresh (2026-03-15 partition)
-- 5. Dashboard now shows ₹42,15,400
TOTAL TIME: 24 minutes from investigation to resolved.
KEY ENABLER: The DLQ preserved the rejected records with their error reason.
Without DLQ: the 12 orders would have been silently lost with no trace.The Medallion Architecture made this diagnosis possible. Bronze had the raw data. Silver had a clear rejection record in the DLQ. Gold was rebuilt from corrected Silver without re-extracting from source. The three-layer separation gave precise control over exactly which layer had the problem and exactly what to fix.
5 Interview Questions — With Complete Answers
Errors You Will Hit — And Exactly Why They Happen
🎯 Key Takeaways
- ✓The Medallion Architecture organises data into three layers with increasing quality guarantees. Bronze: raw and preserved. Silver: trusted, typed, deduplicated. Gold: business logic applied, ready to serve. Each layer has a contract — what it accepts, what it guarantees, who can use it.
- ✓Bronze is append-only and immutable. It stores data exactly as received from the source. The only transformations are format conversion (to Parquet), Hive-style partitioning, and adding ingestion metadata. Never apply business logic, validation, or deduplication in Bronze.
- ✓Silver is current entity state (one row per business key). Bronze is event history (all changes over time). The Silver MERGE collapses Bronze events into current state. For complete change history, query Bronze. For current state, query Silver. This separation is fundamental to the architecture.
- ✓The Silver transformation checklist: type casting, deduplication with ROW_NUMBER(), NULL enforcement on required fields, business value validation, PII masking, referential validation. Rows that fail any check go to the DLQ — not to Silver, not silently discarded.
- ✓Gold is built from Silver and is never the source of truth — Silver always is. When Gold is wrong, fix Silver and rebuild Gold. Gold serves specific consumers (finance dashboard, operations dashboard) — each Gold model is purpose-built, not a generic copy of Silver with aggregations.
- ✓dbt handles Bronze→Silver staging and Silver→Gold transformations. Bronze ingestion (format conversion, partitioning) uses Spark/Python. Silver uses incremental materialisation with merge strategy. Gold uses table or incremental materialisation. The rule: SQL SELECT that fits in the warehouse → dbt. Python libraries, distributed compute, streaming → Spark/Python first.
- ✓Late-arriving data is handled differently per layer. Bronze: append to current ingestion partition (preserve event_time from source). Silver: overlap window catches late Bronze rows, MERGE updates entity state. Gold: rebuild or merge on the affected date partition. The correct historical metric is restored without re-extracting from source.
- ✓SCD handling in Medallion: Bronze has all history (any SCD type derivable). Silver defaults to SCD Type 1 (current state via MERGE). For dimensions that need history (customer city at order time), use dbt snapshots to create SCD Type 2 in Silver. Gold joins to the snapshot with date-range conditions for historically accurate point-in-time queries.
- ✓Schema changes propagate in controlled steps. Bronze accepts new columns automatically (mergeSchema=true). Silver requires explicit code change (update the staging model SELECT, add tests, code review). Gold includes only what each consumer needs. A source schema change never silently breaks Silver or Gold — it is either automatic (Bronze) or requires deliberate code.
- ✓The DLQ is the most valuable diagnostic tool in the Medallion Architecture. When Silver has fewer rows than expected, the DLQ shows exactly which rows were rejected and why. When a metric is wrong, the DLQ often contains the missing records with the validation error that explains the discrepancy. Without DLQ, missing data is invisible. With DLQ, it is queryable.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.