Slowly Changing Dimensions (SCD)
Every SCD type in depth — when each is the right choice, full SQL implementations, dbt snapshot patterns, and the operational pitfalls.
The Problem SCD Solves — When Dimension Attributes Change
Dimension tables describe business entities — customers, stores, products, employees. These entities are not static. A customer moves from Bangalore to Hyderabad. A store changes its manager. A product gets recategorised from "snacks" to "premium snacks." A salesperson moves from one region to another.
When a dimension attribute changes, you face a design question: what should happen to the historical facts that reference the old value? Should past orders show the customer's old city or their new city? Should historical sales reports show the product in its old category or its new one? The answer depends on the business question being answered — and "slowly changing dimension" patterns are the formalised set of answers.
Type 0 — Fixed, and Type 1 — Overwrite
SCD Type 0 — Fixed attributes
Type 0 attributes never change after initial load. They represent immutable facts about the entity. If a value arrives that differs from what is already stored, it is ignored — the original value is the correct one by definition.
TYPE 0 EXAMPLES:
customer.registration_date ← when the customer first registered (never changes)
customer.original_city ← city where the customer first signed up (immutable)
store.opening_date ← when the store opened (historical fact, fixed)
product.sku ← product identifier (never reassigned)
employee.hire_date ← when they joined the company
TYPE 0 IMPLEMENTATION:
On dimension load: INSERT new rows, NEVER update Type 0 columns.
-- dbt Silver model for customers:
INSERT INTO silver.customers (customer_id, registration_date, ...)
VALUES (...)
ON CONFLICT (customer_id) DO UPDATE SET
-- Type 0 columns NOT in the update list:
-- registration_date = EXCLUDED.registration_date ← omitted intentionally
-- Type 1 columns in the update list:
city = EXCLUDED.city,
tier = EXCLUDED.tier,
updated_at = EXCLUDED.updated_at
;
-- registration_date is never overwritten even if source sends a different value.
-- Verification: confirm no Type 0 column was ever changed
SELECT customer_id, COUNT(DISTINCT registration_date) AS date_versions
FROM dimension_history_table
GROUP BY customer_id
HAVING COUNT(DISTINCT registration_date) > 1;
-- Returns: 0 rows — if any rows returned, a Type 0 column was changed incorrectly.SCD Type 1 — Overwrite
Type 1 overwrites the existing value with the new value. No history is preserved — the dimension always shows the current state. Historical fact rows that were loaded when the old value was active now show the new value when joined to the dimension.
TYPE 1 EXAMPLES:
customer.phone_number ← updated when customer changes phone
customer.email ← updated when customer updates email
store.manager_name ← current manager (past manager irrelevant to most reports)
product.description ← updated when product copy is revised
store.is_active ← current operational status
TYPE 1 WHEN TO USE:
✓ The old value was genuinely wrong (data correction)
✓ History is not needed — reports always want current value
✓ The attribute has no analytical significance historically
✗ When historical accuracy matters for past events
(then use Type 2 instead)
TYPE 1 IMPLEMENTATION:
-- Upsert that overwrites changed attributes:
INSERT INTO dim_store
(store_sk, store_id, store_name, manager_name, is_active, updated_at)
VALUES
(1, 'ST001', 'FreshMart Koramangala', 'Rahul Sharma', TRUE, NOW())
ON CONFLICT (store_id)
DO UPDATE SET
manager_name = EXCLUDED.manager_name, -- Type 1: always overwrite
is_active = EXCLUDED.is_active, -- Type 1: always overwrite
updated_at = EXCLUDED.updated_at
;
EFFECT ON HISTORICAL FACT ROWS:
Before update: manager = 'Priya Nair'
After update: manager = 'Rahul Sharma' (overwritten)
fct_orders joined to dim_store WHERE store_id = 'ST001':
ALL historical orders now show manager_name = 'Rahul Sharma'
— even orders placed when Priya Nair was the manager.
This is the correct behaviour for Type 1.
If you want historical orders to show who the manager was at the time,
you need Type 2 — Type 1 explicitly gives up that capability.
TYPE 1 IN dbt:
-- Silver models use incremental merge with no special SCD logic:
{{ config(materialized='incremental', unique_key='store_id',
incremental_strategy='merge') }}
-- All tracked columns are in the merge update set.
-- No valid_from, valid_to, or is_current needed for Type 1.SCD Type 2 — Full History Preserved
SCD Type 2 is the most important and most widely used SCD pattern. When a tracked attribute changes, a new row is inserted for the new version, and the old row is expired with a valid_to date. The dimension table accumulates one row per version per entity. The surrogate key uniquely identifies each version, enabling fact tables to join to the exact version that was active at the time of the fact.
Type 2 — the complete data model and change mechanics
DIM_CUSTOMER WITH SCD TYPE 2:
Initial state — customer 4201938 registered from Bangalore:
customer_sk customer_id city tier valid_from valid_to is_current
───────────────────────────────────────────────────────────────────────────────────
1 4201938 Bangalore silver 2024-01-15 NULL TRUE
Customer places order 9284751 on 2024-06-10:
fct_orders: order_sk=..., customer_sk=1, order_amount=380 ← joins to row 1
Customer moves to Hyderabad, updates profile on 2026-02-01:
CHANGE OPERATION:
-- Step 1: expire the current row
UPDATE dim_customer
SET valid_to = '2026-01-31',
is_current = FALSE
WHERE customer_id = 4201938
AND is_current = TRUE;
-- Step 2: insert the new version
INSERT INTO dim_customer
(customer_sk, customer_id, city, tier, valid_from, valid_to, is_current)
VALUES
(2, 4201938, 'Hyderabad', 'silver', '2026-02-01', NULL, TRUE);
RESULTING TABLE STATE:
customer_sk customer_id city tier valid_from valid_to is_current
────────────────────────────────────────────────────────────────────────────────
1 4201938 Bangalore silver 2024-01-15 2026-01-31 FALSE ← expired
2 4201938 Hyderabad silver 2026-02-01 NULL TRUE ← current
Customer places order 9284755 on 2026-03-01:
fct_orders: order_sk=..., customer_sk=2, order_amount=460 ← joins to row 2
POINT-IN-TIME QUERIES:
-- What city was customer 4201938 in when they placed order 9284751 (2024-06-10)?
SELECT c.city
FROM fct_orders f
JOIN dim_customer c ON f.customer_sk = c.customer_sk
WHERE f.order_sk = <order_sk_for_9284751>;
-- Returns: 'Bangalore' ← correct — the fact stored customer_sk=1 at load time
-- What is customer 4201938's current city?
SELECT city FROM dim_customer
WHERE customer_id = 4201938 AND is_current = TRUE;
-- Returns: 'Hyderabad' ← correct
-- Revenue by customer city, historically accurate:
SELECT c.city, SUM(f.order_amount)
FROM fct_orders f
JOIN dim_customer c ON f.customer_sk = c.customer_sk
GROUP BY c.city;
-- order_sk from 2024: joins to customer_sk=1 → Bangalore
-- order_sk from 2026: joins to customer_sk=2 → Hyderabad
-- Both cities get credit for orders placed when the customer was there ✓
WRONG APPROACH (joining on natural key with is_current):
JOIN dim_customer c ON f.customer_id = c.customer_id AND c.is_current = TRUE
-- This joins ALL orders (including 2024 ones) to the CURRENT version
-- The 2024 Bangalore order now shows 'Hyderabad' — historically wrong ✗Type 2 — choosing which attributes to track
Not every dimension attribute should be Type 2. Applying Type 2 to all attributes creates excessive historical versions and makes queries complex. The decision rule is simple: does a change in this attribute affect the interpretation of historical facts?
Type 2 — handling multiple changes in one load
# PRODUCTION TYPE 2 LOAD PROCEDURE (Python):
# Handles: new entities, Type 2 tracked changes, Type 1 changes
def load_dim_customer_scd2(
source_rows: list[dict],
dest_conn,
type2_columns: list[str], # ['city', 'tier']
type1_columns: list[str], # ['phone_masked', 'email_hashed']
) -> dict:
"""
Load dimension with SCD Type 2 for tracked columns.
Type 2 columns: expire old row + insert new row on change.
Type 1 columns: update in-place on current row (no new row).
"""
stats = {'new': 0, 'type2_change': 0, 'type1_change': 0, 'unchanged': 0}
for row in source_rows:
customer_id = row['customer_id']
# Find existing current row:
existing = dest_conn.execute("""
SELECT * FROM dim_customer
WHERE customer_id = %s AND is_current = TRUE
""", (customer_id,)).fetchone()
if existing is None:
# New entity — insert first version:
sk = generate_surrogate_key(customer_id, row['updated_at'])
dest_conn.execute("""
INSERT INTO dim_customer
(customer_sk, customer_id, city, tier, phone_masked,
valid_from, valid_to, is_current)
VALUES (%s, %s, %s, %s, %s, %s, NULL, TRUE)
""", (sk, customer_id, row['city'], row['tier'],
row['phone_masked'], row['updated_at'].date()))
stats['new'] += 1
continue
# Check Type 2 columns for changes:
type2_changed = any(
row[col] != existing[col] for col in type2_columns
)
if type2_changed:
# Expire old row:
dest_conn.execute("""
UPDATE dim_customer
SET valid_to = %s, is_current = FALSE
WHERE customer_sk = %s
""", (row['updated_at'].date() - timedelta(days=1), existing['customer_sk']))
# Insert new version:
sk = generate_surrogate_key(customer_id, row['updated_at'])
dest_conn.execute("""
INSERT INTO dim_customer
(customer_sk, customer_id, city, tier, phone_masked,
valid_from, valid_to, is_current)
VALUES (%s, %s, %s, %s, %s, %s, NULL, TRUE)
""", (sk, customer_id, row['city'], row['tier'],
row['phone_masked'], row['updated_at'].date()))
stats['type2_change'] += 1
else:
# No Type 2 change — check Type 1:
type1_changed = any(
row[col] != existing[col] for col in type1_columns
)
if type1_changed:
# Update in place (no new row, no expiry):
dest_conn.execute("""
UPDATE dim_customer
SET phone_masked = %s, dim_updated_at = NOW()
WHERE customer_sk = %s
""", (row['phone_masked'], existing['customer_sk']))
stats['type1_change'] += 1
else:
stats['unchanged'] += 1
dest_conn.commit()
return statsSCD Type 2 in dbt — Snapshots
dbt provides first-class support for SCD Type 2 through its snapshot feature. A dbt snapshot monitors a source query for changes to specified columns and automatically manages the valid_from, valid_to, and is_current columns. It is the standard way to implement Type 2 dimensions in a dbt-based platform.
-- STRATEGY 1: timestamp — detects changes via an updated_at column
-- Use when: source table has a reliable updated_at timestamp
-- snapshots/customers_snapshot.sql
{% snapshot customers_snapshot %}
{{ config(
target_database = 'freshmart_prod',
target_schema = 'snapshots',
unique_key = 'customer_id',
strategy = 'timestamp',
updated_at = 'updated_at', -- column dbt monitors for changes
invalidate_hard_deletes = True, -- expire rows when source row disappears
) }}
SELECT
customer_id,
customer_name,
email_hashed,
city,
state,
tier,
acquisition_channel,
registration_date,
updated_at
FROM {{ source('silver', 'customers') }}
WHERE is_current = TRUE
{% endsnapshot %}
-- dbt adds these columns automatically:
-- dbt_scd_id VARCHAR — unique ID per version (hash of key + dbt_valid_from)
-- dbt_updated_at TIMESTAMP — when dbt last processed this row
-- dbt_valid_from TIMESTAMP — when this version became active
-- dbt_valid_to TIMESTAMP — when this version expired (NULL = current)
-- STRATEGY 2: check — compares column values directly
-- Use when: no reliable updated_at, or when you need to track specific columns only
{% snapshot customers_snapshot %}
{{ config(
target_schema = 'snapshots',
unique_key = 'customer_id',
strategy = 'check',
check_cols = ['city', 'tier'], -- ONLY these columns trigger a new version
-- Changing phone_masked does NOT create a new version (Type 1 for that column)
invalidate_hard_deletes = True,
) }}
SELECT * FROM {{ source('silver', 'customers') }}
{% endsnapshot %}
-- HOW dbt SNAPSHOT RUNS:
-- dbt snapshot reads the source query
-- For each row, checks if any check_cols changed since last run
-- If changed: expires old row (sets dbt_valid_to = NOW())
-- inserts new row (dbt_valid_from = NOW(), dbt_valid_to = NULL)
-- If unchanged: no action
-- If row disappeared from source AND invalidate_hard_deletes=True:
-- expires the current row (marks it as deleted)
-- BUILDING dim_customer FROM THE SNAPSHOT:
-- models/gold/dims/dim_customer.sql
{{ config(materialized='table') }}
WITH snapshot AS (
SELECT * FROM {{ ref('customers_snapshot') }}
)
SELECT
{{ dbt_utils.generate_surrogate_key(['customer_id', 'dbt_valid_from']) }}
AS customer_sk,
customer_id,
customer_name,
email_hashed,
city,
CASE
WHEN state IN ('Karnataka','Tamil Nadu','Kerala','Andhra Pradesh','Telangana')
THEN 'South'
WHEN state IN ('Maharashtra','Gujarat','Goa') THEN 'West'
WHEN state IN ('Delhi','Uttar Pradesh','Haryana','Punjab','Rajasthan') THEN 'North'
ELSE 'East'
END AS region,
tier,
acquisition_channel,
registration_date,
CAST(dbt_valid_from AS DATE) AS valid_from,
CAST(dbt_valid_to AS DATE) AS valid_to,
CASE WHEN dbt_valid_to IS NULL THEN TRUE ELSE FALSE END AS is_current
FROM snapshot
-- RUNNING SNAPSHOTS:
-- dbt snapshot ← run all snapshots
-- dbt snapshot -s customers_snapshot ← run one snapshot
-- IMPORTANT: dbt snapshot should run MORE FREQUENTLY than dbt run.
-- If a customer changes city twice in one day and snapshot only runs nightly:
-- the intermediate city is never captured — only the final day-end state.
-- For high-change dimensions: run snapshot every 15-30 minutes.Snapshot backfill — what to do when deploying Type 2 to an existing table
CHALLENGE: you are deploying SCD Type 2 on the customers dimension for the
first time. The dimension currently exists as a Type 1 table (no history).
You need to populate the snapshot with the existing customer data.
OPTION A: full-refresh (simplest, loses any history that existed)
dbt snapshot --full-refresh
This drops and recreates the snapshot table from scratch.
All existing customers get one row with:
dbt_valid_from = NOW()
dbt_valid_to = NULL
is_current = TRUE
Result: going forward, all changes are captured. Past history: lost.
Acceptable when: no meaningful historical changes existed before this point.
OPTION B: seed historical versions from a separate data source
If you have an audit log, CDC history in Bronze, or source system history:
Build a seed file or staging model with historical versions:
customer_id city tier updated_at
4201938 Bangalore silver 2024-01-15 ← original registration
4201938 Hyderabad silver 2026-02-01 ← after move
Manually insert these into the snapshot table in the correct format
BEFORE running dbt snapshot for the first time.
Then dbt snapshot manages all future changes.
OPTION C: change_cols detection requires source data to carry history
If the Silver customers table has change history (via CDC + Bronze):
Create a staging model that produces one row per version:
SELECT customer_id, city, tier,
change_timestamp AS updated_at
FROM silver.customers_cdc_history
ORDER BY customer_id, change_timestamp
Point the dbt snapshot at this staging model.
dbt snapshot processes each row, creating version rows as they appear.
Result: full historical SCD2 table built from CDC history.
MONITORING SNAPSHOT HEALTH:
-- Check how many version rows exist per customer:
SELECT customer_id, COUNT(*) AS version_count
FROM customers_snapshot
GROUP BY customer_id
ORDER BY version_count DESC
LIMIT 20;
-- If max versions > 50: investigate — rapid changes may indicate bad data
-- Check for gaps in valid_from/valid_to continuity:
SELECT customer_id
FROM customers_snapshot
WHERE dbt_valid_to IS NOT NULL
AND NOT EXISTS (
SELECT 1 FROM customers_snapshot s2
WHERE s2.customer_id = customers_snapshot.customer_id
AND s2.dbt_valid_from = customers_snapshot.dbt_valid_to
)
-- Returns rows where an expired version has no successor — data gapSCD Type 3 — Previous Value in a Separate Column
Type 3 adds a column to store the previous value of a tracked attribute, alongside the current value. It captures exactly one change — the current value and the immediately preceding value. It sacrifices full history for simplicity and the ability to query both current and previous values from a single row without any joins.
TYPE 3 TABLE STRUCTURE:
dim_customer:
customer_sk BIGINT PRIMARY KEY
customer_id BIGINT
city VARCHAR(100) ← CURRENT city
previous_city VARCHAR(100) ← PREVIOUS city (one level back)
city_changed_at DATE ← when the city last changed
tier VARCHAR(20) ← CURRENT tier
previous_tier VARCHAR(20) ← PREVIOUS tier
...
INITIAL STATE:
customer_sk customer_id city previous_city tier previous_tier
1 4201938 Bangalore NULL silver NULL
CUSTOMER MOVES TO HYDERABAD (2026-02-01):
UPDATE dim_customer
SET previous_city = city, -- save current → previous
city = 'Hyderabad', -- new current
city_changed_at = '2026-02-01'
WHERE customer_id = 4201938;
RESULTING ROW:
customer_sk customer_id city previous_city tier previous_tier
1 4201938 Hyderabad Bangalore silver NULL
QUERIES ENABLED BY TYPE 3:
-- Revenue from customers who recently moved to each city:
SELECT
city AS current_city,
SUM(CASE WHEN f.order_date > c.city_changed_at THEN f.order_amount ELSE 0 END)
AS revenue_after_move,
previous_city AS came_from
FROM fct_orders f
JOIN dim_customer c USING (customer_sk)
WHERE c.city_changed_at IS NOT NULL -- only customers who have moved
GROUP BY 1, 3;
TYPE 3 LIMITATIONS:
✗ Only one level of history (current + previous)
If customer moves again (Hyderabad → Mumbai):
previous_city becomes Hyderabad (Bangalore is LOST)
✗ No point-in-time accuracy for fact table joins
All orders always join to the same single row — no version control
A 2024 order in Bangalore and a 2026 order in Hyderabad both join
to the same row (current city = Mumbai eventually)
✗ Works only when the change trajectory is: old → new (two states)
Not suitable for attributes that change frequently
TYPE 3 WHEN TO USE:
✓ When you need a simple "compare current vs previous" view
✓ When the attribute changes at most once or twice in the entity's lifetime
✓ When simplicity is more important than full history
✓ Common use case: sales territory reassignment
salesperson.territory: shows current territory + previous territory
"How did revenue change after the territory reshuffle?"
TYPE 3 IS RARELY THE BEST CHOICE:
Type 1: history not needed
Type 2: full point-in-time history needed
Type 3: somewhere between — often superseded by Type 6 (hybrid)Type 4 — History Table, and Type 6 — The Hybrid
SCD Type 4 — separate history table
Type 4 keeps the current dimension table small and fast by separating all historical versions into a separate history table. The main dimension always contains only the current version. The history table contains all previous versions. This pattern is useful when the main dimension table is queried frequently for current values and must remain as lean as possible.
TYPE 4 STRUCTURE:
dim_customer (current versions only — lean table):
customer_sk customer_id city tier updated_at
──────────────────────────────────────────────────────
1 4201938 Hyderabad silver 2026-02-01
dim_customer_history (all historical versions):
customer_history_sk customer_id city tier valid_from valid_to
────────────────────────────────────────────────────────────────────────────
100 4201938 Bangalore silver 2024-01-15 2026-01-31
101 4201938 Hyderabad silver 2026-02-01 NULL
BENEFITS:
dim_customer stays small → fast for current-state queries
dim_customer_history contains full history → available for audit / time travel
WHEN TYPE 4 IS USEFUL:
✓ Very large dimension tables where adding version rows slows down current queries
✓ When 95% of queries only need current values and the history table is rarely joined
✓ Compliance / audit use cases where a separate history table is required by policy
LIMITATION:
More complex to query — must choose between dim_customer (current) and
dim_customer_history (full history) depending on use case.
Most teams prefer Type 2 — the version rows in one table is simpler.SCD Type 6 — the hybrid (Type 1 + Type 2 + Type 3)
Type 6 combines Types 1, 2, and 3 in a single row. It preserves full historical accuracy (Type 2) while also making the current value of a tracked attribute available in every row (Type 1 overwrite) and storing the previous value in a separate column (Type 3). The result is a dimension that supports both historical analysis and simple current-state queries without joins to the current row.
TYPE 6 TABLE STRUCTURE:
dim_customer (Type 6 — history + current value in every row):
customer_sk BIGINT PK ← unique per version (surrogate)
customer_id BIGINT ← natural key
city VARCHAR ← city AS OF THIS VERSION (historical accuracy)
current_city VARCHAR ← current city for all versions (Type 1 overwrite)
previous_city VARCHAR ← previous city (Type 3)
tier VARCHAR ← tier AS OF THIS VERSION
current_tier VARCHAR ← current tier for all versions
valid_from DATE ← when this version became active
valid_to DATE ← when this version expired (NULL = current)
is_current BOOLEAN
TABLE STATE (customer moved Bangalore → Hyderabad):
customer_sk customer_id city current_city valid_from valid_to is_current
──────────────────────────────────────────────────────────────────────────────────────
1 4201938 Bangalore Hyderabad 2024-01-15 2026-01-31 FALSE
2 4201938 Hyderabad Hyderabad 2026-02-01 NULL TRUE
NOTE: current_city = 'Hyderabad' in BOTH rows, even the historical row.
city = 'Bangalore' in the historical row (point-in-time accurate).
QUERIES ENABLED BY TYPE 6:
-- Historical revenue by city (point-in-time accurate):
SELECT c.city AS historical_city, SUM(f.order_amount)
FROM fct_orders f JOIN dim_customer c ON f.customer_sk = c.customer_sk
GROUP BY c.city;
-- Uses c.city (the version-specific city) ← historically correct ✓
-- Current revenue by city (where customers ARE TODAY):
SELECT c.current_city, SUM(f.order_amount)
FROM fct_orders f JOIN dim_customer c ON f.customer_sk = c.customer_sk
GROUP BY c.current_city;
-- Uses c.current_city ← all orders attributed to Hyderabad (where customer is now) ✓
-- No need to join only to is_current=TRUE rows — current_city is in every row
-- Both queries from ONE fact table join — no separate dim query needed.
-- This is Type 6's key advantage over Type 2 alone.
UPDATE PROCEDURE (when city changes):
-- Step 1: update current_city in ALL existing rows for this customer:
UPDATE dim_customer
SET current_city = 'Hyderabad' -- Type 1 overwrite on all versions
WHERE customer_id = 4201938;
-- Step 2: expire the current row + insert new version (Type 2):
UPDATE dim_customer
SET valid_to = '2026-01-31', is_current = FALSE
WHERE customer_id = 4201938 AND is_current = TRUE;
INSERT INTO dim_customer
(customer_sk, customer_id, city, current_city, valid_from, valid_to, is_current)
VALUES (2, 4201938, 'Hyderabad', 'Hyderabad', '2026-02-01', NULL, TRUE);
dbt SNAPSHOT does NOT natively support Type 6.
Type 6 requires a custom dbt macro or a Python pipeline.SCD Type 7 — Dual Foreign Keys in the Fact Table
Type 7 solves the same problem as Type 6 — enabling both historical and current-state queries — but using two foreign keys in the fact table rather than redundant columns in the dimension. The fact table stores both a history_customer_sk (the surrogate key for the version active at the time of the fact) and a current_customer_sk (always pointing to the is_current=TRUE row). This keeps the dimension table pure Type 2 without any Type 1 overwrite columns.
SCD TYPE 7 TABLE STRUCTURE:
dim_customer (pure Type 2 — no current_city column needed):
customer_sk customer_id city tier valid_from valid_to is_current
───────────────────────────────────────────────────────────────────────────────
1 4201938 Bangalore silver 2024-01-15 2026-01-31 FALSE
2 4201938 Hyderabad silver 2026-02-01 NULL TRUE
fct_orders (with DUAL surrogate keys):
order_sk history_customer_sk current_customer_sk order_amount order_date
─────────────────────────────────────────────────────────────────────────────
100 1 2 380.00 2024-06-10
101 2 2 460.00 2026-03-01
history_customer_sk: the SK active at order time (stored at fact load time)
current_customer_sk: the SK of the current version (updated when customer changes)
QUERIES:
-- Historical revenue by city (point-in-time accurate):
SELECT c.city, SUM(f.order_amount)
FROM fct_orders f
JOIN dim_customer c ON f.history_customer_sk = c.customer_sk
GROUP BY c.city;
-- Order 100: joins to SK=1 → city='Bangalore'
-- Order 101: joins to SK=2 → city='Hyderabad' ✓ historically accurate
-- Current revenue by city (where customers are TODAY):
SELECT c.city, SUM(f.order_amount)
FROM fct_orders f
JOIN dim_customer c ON f.current_customer_sk = c.customer_sk
WHERE c.is_current = TRUE
GROUP BY c.city;
-- Both orders join to SK=2 → city='Hyderabad'
-- "How much revenue came from customers who are NOW in Hyderabad?" ✓
TYPE 7 COMPLEXITY:
Requires updating current_customer_sk in the fact table when
a customer's current version changes — this means updating fact rows,
which is expensive for large fact tables.
Most teams avoid this unless the use case specifically requires it.
Type 6 is more common in practice (current_city column in the dimension
row is cheaper to maintain than updating millions of fact table rows).Choosing the Right SCD Type — The Decision Framework
| SCD Type | History preserved | Point-in-time joins | Current-state queries | Complexity | Best for |
|---|---|---|---|---|---|
| Type 0 (Fixed) | N/A — never changes | N/A | ✓ Simple | Lowest | Immutable attributes: registration date, original city, hire date |
| Type 1 (Overwrite) | ✗ No history | ✗ No — joins always to current | ✓ Simple — one row | Low | Corrections, contact info, flags where history irrelevant |
| Type 2 (Add Row) | ✓ Full | ✓ Yes — via surrogate key at load time | Need is_current=TRUE filter | Medium | Most tracked attributes — customer city/tier, product category |
| Type 3 (Add Column) | ✓ One level only | ✗ No — one row, no version control | ✓ Current column | Low–Medium | Attributes that change once: territory reassignment, store type upgrade |
| Type 4 (History Table) | ✓ Full (in history table) | ✓ Via history table join | ✓ Fast via main table | Medium–High | Very large dimensions where current queries must be fast |
| Type 6 (1+2+3) | ✓ Full | ✓ Via city column | ✓ Via current_city column (no filter) | High | When BOTH historical and current queries are equally important and frequent |
| Type 7 (Dual FK) | ✓ Full | ✓ Via history_sk FK | ✓ Via current_sk FK | Highest | Rare — when Type 6 overhead in dimension table is unacceptable |
DECISION TREE:
Does the attribute change?
No → Type 0 (fixed)
Yes → Does historical accuracy matter for analysis?
No → Type 1 (overwrite)
Yes → How many historical states do you need?
"Just current and previous" → Type 3
"Full history required" → Continue...
Do you need both historical and current queries from the same join?
No → Type 2 (standard — use is_current filter when needed)
Yes → Type 6 (add current_city column to dimension)
Is the dimension table very large (10M+ rows)?
Yes → Consider Type 4 (separate history table)
PRACTICAL GUIDANCE (2026):
80% of use cases: TYPE 2 (with dbt snapshot)
15% of use cases: TYPE 1 (for corrections and non-analytical attributes)
5% of use cases: TYPE 6, 3, or 4 (special requirements)
TYPE 7: almost never needed — Type 6 handles the same use case more simply
REAL EXAMPLES FROM FOOD DELIVERY PLATFORMS:
customer.city: Type 2 (revenue attribution changes with location)
customer.tier: Type 2 (LTV and cohort analysis by acquisition tier)
customer.phone_masked: Type 1 (contact info — never needed historically)
customer.registration_date: Type 0 (immutable — when they first joined)
store.manager_name: Type 1 (most reports — manager history not tracked)
store.region: Type 2 (if region changes affect territory reporting)
product.category: Type 2 (if sales reporting by category matters)
product.price: NOT in dimension — put price in fact table as a factDiscovering That a Key Dimension Was Never Type 2 — The Revenue Attribution Fix
The growth team presents a report showing Hyderabad revenue growing 50% in Q1 2026. The Bangalore team disputes the numbers — several major customers they know personally appear to be attributed to Hyderabad. You are asked to investigate.
-- Step 1: Check the customer dimension for suspected customers
SELECT customer_id, city, valid_from, valid_to, is_current
FROM dim_customer
WHERE customer_id IN (4201938, 4201939, 4201940, 4201941)
ORDER BY customer_id;
-- Returns only ONE row per customer:
-- 4201938 Hyderabad 2024-01-15 NULL TRUE
-- 4201939 Hyderabad 2024-03-02 NULL TRUE
-- etc.
-- Only one row per customer — no version history.
-- is_current = TRUE for all (only one version exists).
-- valid_from = their registration date (no type 2 tracking).
-- dim_customer was built with Type 1 (overwrite) — not Type 2.
-- Step 2: Check Bronze CDC history for actual customer location changes
SELECT customer_id, city, updated_at, _change_type
FROM bronze.customers_cdc
WHERE customer_id = 4201938
ORDER BY updated_at;
-- Returns:
-- 4201938 Bangalore 2024-01-15 insert ← registered in Bangalore
-- 4201938 Hyderabad 2026-02-01 update ← moved to Hyderabad
-- Confirmed: customer 4201938 was in Bangalore until 2026-02-01.
-- The dim_customer table overwrote 'Bangalore' with 'Hyderabad' (Type 1).
-- ALL historical orders from 2024 and 2025 now show city = 'Hyderabad'.
-- This is why Hyderabad revenue looks inflated and Bangalore looks deflated.
-- Step 3: Estimate the impact
SELECT
c.city AS wrong_city,
DATE_PART('year', f.order_date) AS year,
COUNT(*) AS affected_orders,
SUM(f.order_amount) AS misattributed_revenue
FROM fct_orders f
JOIN dim_customer c ON f.customer_sk = c.customer_sk
JOIN bronze.customers_cdc cdc ON f.customer_id = cdc.customer_id
AND f.order_date < '2026-02-01' -- orders placed before the move
WHERE c.city = 'Hyderabad' -- currently attributed to Hyderabad
AND cdc.city = 'Bangalore' -- but were actually in Bangalore
AND cdc._change_type = 'insert' -- initial registration city
GROUP BY 1, 2;
-- Shows: 18,234 orders, ₹1.47 crore misattributed from Bangalore to Hyderabad
-- MIGRATION PLAN:
-- 1. Build SCD Type 2 snapshot from Bronze CDC history
-- 2. Rebuild dim_customer with version rows from CDC
-- 3. Reload fct_orders — re-lookup customer_sk using the date-range join
-- 4. Rebuild Gold revenue models
-- Full migration: 4 days of engineering, 1 day of validation
-- PREVENTION:
-- dbt snapshot runs every hour on the silver.customers source
-- strategy = 'check', check_cols = ['city', 'tier', 'state']
-- Future city changes create a new version row automaticallyThis is one of the most common SCD incidents in production — a dimension was built with Type 1 when the business question required Type 2. The data engineer who built it did not ask "do historical facts need to reflect the value that was true at the time?" Revenue attribution by city requires exactly that. The fix required rebuilding the dimension from the preserved Bronze CDC history — which is why preserving raw Bronze data is so valuable.
5 Interview Questions — With Complete Answers
Errors You Will Hit — And Exactly Why They Happen
🎯 Key Takeaways
- ✓SCD patterns answer the question: when a dimension attribute changes, what should happen to the historical facts that referenced the old value? The answer depends entirely on whether historical accuracy matters for the business questions being answered.
- ✓Type 0 (fixed) — attribute never changes. Type 1 (overwrite) — update in place, no history kept. Type 2 (add row) — new row per version, full history. Type 3 (add column) — current + one previous value. Type 4 (history table) — separate table for history. Type 6 (hybrid) — full history + current value in every row. Type 7 (dual FK) — two surrogate keys in the fact table.
- ✓Type 2 is the most important and most widely used SCD pattern. When a tracked attribute changes: expire the current row (set valid_to, is_current=FALSE) and insert a new version row (valid_from=today, valid_to=NULL, is_current=TRUE). Surrogate keys uniquely identify each version, enabling point-in-time fact joins.
- ✓The key to correct Type 2 joins: the fact table must store the surrogate key at load time (the SK of the version active when the fact occurred). At query time, join on f.customer_sk = c.customer_sk — not on customer_id with is_current filter. The latter assigns all historical orders to the current version, destroying historical accuracy.
- ✓dbt snapshots implement Type 2 automatically. Two strategies: timestamp (uses updated_at column to detect changes — efficient but depends on accurate source timestamps) and check (compares listed column values directly — more explicit, works without a reliable updated_at). Run snapshots more frequently than dbt runs for high-change dimensions.
- ✓Not every attribute should be Type 2. Attribute needs Type 2 when: changing it would make historical fact analysis wrong. customer.city → Type 2 (revenue by city requires historical accuracy). customer.phone → Type 1 (no analytical use for historical phone). store.manager → depends on whether "performance by manager" is a business question.
- ✓Type 3 adds a previous_value column alongside the current value. Supports exactly one level of history. Useful for territory reassignments and once-in-a-lifetime changes. Unsuitable for frequently changing attributes or when full point-in-time accuracy is required.
- ✓Type 6 (hybrid 1+2+3) stores the current_city value in EVERY version row via Type 1 overwrite, while preserving the historically accurate city value for each version. Enables both "where was the customer when they ordered?" (use city column) and "where is the customer today?" (use current_city column) from the same join without extra filtering.
- ✓dbt snapshot operational pitfalls: running too infrequently loses intermediate states and causes valid_from dates to be set to snapshot run time rather than actual change time. Running too frequently (every minute on large tables) adds significant metadata table load. The right frequency matches the business's tolerance for staleness and the frequency of actual attribute changes.
- ✓The most common SCD incident in production: a dimension built with Type 1 when the business question required Type 2. Revenue attribution, cohort analysis, and territory performance all depend on historical accuracy. When diagnosed, the fix requires rebuilding the dimension from Bronze CDC history and reloading fact table surrogate keys. This is why Bronze CDC history preservation is so valuable.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.