Data Modelling — Dimensional Modelling, Star Schema, Facts and Dimensions
Dimensional modelling from first principles — grain, fact types, dimension design, surrogate keys, conformed dimensions, and the modern wide-table pattern.
Data Modelling — The Design Decision That Compounds
Data modelling is the discipline of organising data into structures that serve analytical questions efficiently and accurately. A poorly modelled Gold layer means analysts write complex SQL for simple questions, BI tools are slow, metrics disagree across reports, and the platform accrues technical debt that compounds every quarter. A well-modelled Gold layer means analysts answer new questions with familiar patterns, dashboards load in milliseconds, and the same metric always returns the same answer.
This module covers dimensional modelling — the dominant approach for analytical data modelling since Ralph Kimball formalised it in the 1990s and still the most relevant framework for data engineers in 2026. It also covers the modern wide-table pattern that has emerged as a practical complement or replacement for star schemas in lakehouse architectures.
Grain — The Most Important Decision in Dimensional Modelling
The grain of a fact table is the precise definition of what one row represents. Declaring the grain before any other design decision is the first and most critical step. Every column must be true at that grain. Violating the grain — mixing measurements at different levels of granularity — is the most common modelling mistake and the hardest to fix.
GRAIN EXAMPLES FOR A FOOD DELIVERY PLATFORM:
OPTION A: One row per order
Grain declaration: "Each row represents one order placed by a customer"
Valid fact columns at this grain:
order_amount ← total for this order ✓
discount_amount ← discount applied to this order ✓
delivery_fee ← delivery charge for this order ✓
num_items ← count of items in this order ✓
Valid dimension FKs at this grain:
customer_sk, store_sk, order_date_sk, payment_method_sk
VIOLATES this grain:
product_id ✗ — an order has many products (that is the line-item grain)
promo_code ✗ — an order might have multiple promos
OPTION B: One row per order line item
Grain declaration: "Each row represents one product line within one order"
Valid facts: line_quantity, line_amount, line_discount
Valid FKs: order_sk, product_sk, customer_sk, store_sk, date_sk
VIOLATES this grain:
delivery_fee ✗ — delivery applies to the whole order, not each line
OPTION C: One row per delivery event
Grain: "Each row represents one delivery attempt for one order"
Valid facts: delivery_duration_mins, distance_km, driver_rating
CHOOSING THE GRAIN:
Ask: what business questions must be answered at the most atomic level?
"What is total revenue by store per day?" → order grain ✓
"Which products sell most?" → line item grain ✓
RULE: declare the atomic grain first. Higher-level summaries are derived
by aggregating. You CANNOT disaggregate — a sum at order level cannot
tell you which product within the order contributed what.
Common mistake: choosing order grain then adding product_id "sometimes."
Adding product_id forces multiple rows per order → secretly changes grain
to line-item without declaring it → SUM(order_amount) double-counts.
Declare the grain explicitly and keep every column honest to it.Fact Tables — Four Types, Each for a Different Use Case
Fact tables are not all alike. Kimball identified four distinct fact table types based on the nature of the business process being measured. Choosing the wrong type produces models that cannot answer the questions the business actually asks.
FACT TABLE TYPE 1: TRANSACTION FACT TABLE (most common)
One row per business transaction event.
Immutable — once written, never updated.
Row count grows with business volume.
fct_orders:
order_sk, customer_sk, store_sk, order_date_sk, payment_method_sk
order_amount DECIMAL, discount_amount DECIMAL, delivery_fee DECIMAL
num_items INT, is_first_order BOOLEAN
Use when: measuring events that occurred — orders, payments, clicks.
FACT TABLE TYPE 2: PERIODIC SNAPSHOT FACT TABLE
One row per entity per time period — state at regular intervals.
Appends new snapshot for each period — does NOT replace old rows.
Row count = entities × time periods.
fct_daily_inventory:
snapshot_date_sk INT, product_sk BIGINT, store_sk BIGINT
units_on_hand INT, units_received INT, units_sold INT
days_of_supply DECIMAL
Use when: tracking ongoing state — inventory levels, account balances,
open tickets. "How many?" questions at a specific point in time.
FACT TABLE TYPE 3: ACCUMULATING SNAPSHOT FACT TABLE
One row per business process instance — updated as process progresses.
Tracks lifecycle through multiple stages.
Row count = number of process instances (not events).
fct_order_fulfillment:
order_sk BIGINT
placed_date_sk INT, confirmed_date_sk INT
picked_up_date_sk INT, delivered_date_sk INT, cancelled_date_sk INT
current_status VARCHAR(20)
confirm_lag_minutes DECIMAL, pickup_lag_minutes DECIMAL
delivery_minutes DECIMAL, total_fulfillment_minutes DECIMAL
Use when: tracking multi-step processes — order fulfillment, loan approvals.
Rows are UPDATED as stages complete — unlike transaction facts.
Multiple date FKs — one per stage.
FACT TABLE TYPE 4: FACTLESS FACT TABLE
No numeric facts — just dimension FKs recording that an event occurred.
fct_promotional_coverage:
promo_date_sk INT, product_sk BIGINT
store_sk BIGINT, promotion_sk BIGINT
(no numeric columns — the ROW EXISTING is the fact)
Use when: "Did this product run this promotion at this store on this date?"
Bridge tables for many-to-many relationships between dimensions.
Query: "which promoted products had zero sales?"
LEFT JOIN fct_orders WHERE fct_orders.order_sk IS NULLAdditive, semi-additive, and non-additive facts
ADDITIVE: can be summed across ALL dimensions
order_amount: SUM across stores ✓ SUM across dates ✓ SUM across customers ✓
delivery_fee: fully additive
Most numeric business measures are additive.
Store atomic values — do NOT pre-aggregate in the fact table.
SEMI-ADDITIVE: can be summed across SOME dimensions (not time)
units_on_hand (inventory):
SUM across stores ✓ (total inventory across all stores on one day)
SUM across dates ✗ (Mon + Tue + Wed inventory is meaningless)
For time dimension: use AVG, MAX, or MIN — never SUM.
account_balance:
SUM across accounts ✓, SUM across months ✗ (snapshot, not flows)
NON-ADDITIVE: cannot be meaningfully summed across ANY dimension
is_first_order (boolean): COUNT(WHERE is_first_order = TRUE)
cancellation_rate (ratio): compute from numerator/denominator at query time
avg_order_value: store order_amount + num_orders, compute AVG at query
RULE: never store derived ratios or percentages in fact tables.
Store components (numerator, denominator) and compute ratios at query time.
Storing cancellation_rate = 0.12 is correct today, wrong if denominator changes.
Storing cancellation_count=12 and order_count=100 is always correct.
DEGENERATE DIMENSIONS:
A dimension with only one attribute (the key itself), no descriptive context.
Store it directly in the fact table — no separate dimension table needed.
Examples: order_number, invoice_number, transaction_id.
fct_orders.order_number VARCHAR(50) — no dim_order_number table.Dimension Tables — The Context That Makes Facts Meaningful
A fact without context is just a number. ₹380 has no analytical value until you know it was an order from a premium customer in Bangalore at a FreshMart store on a weekday evening. Dimension tables provide that context. Understanding what belongs in dimensions, how to structure them, and how surrogate keys work is essential for building models analysts can use intuitively.
DIMENSION TABLE: dim_customer
CREATE TABLE gold.dim_customer (
customer_sk BIGINT NOT NULL PRIMARY KEY, -- surrogate key
customer_id BIGINT NOT NULL, -- natural key
customer_name VARCHAR(200),
email_hashed VARCHAR(64) NOT NULL, -- PII masked
phone_masked VARCHAR(20),
registration_date DATE,
city VARCHAR(100),
state VARCHAR(50),
region VARCHAR(50), -- hierarchy: city → state → region
tier VARCHAR(20), -- standard/silver/gold/platinum
acquisition_channel VARCHAR(50),
-- SCD Type 2 tracking:
valid_from DATE NOT NULL,
valid_to DATE, -- NULL = current version
is_current BOOLEAN NOT NULL DEFAULT TRUE,
dim_updated_at TIMESTAMPTZ NOT NULL
);
DIMENSION TABLE: dim_date (always pre-built, shared across all fact tables)
CREATE TABLE gold.dim_date (
date_sk INT NOT NULL PRIMARY KEY, -- integer YYYYMMDD: 20260317
full_date DATE NOT NULL,
day_of_week INT NOT NULL, -- 1=Monday, 7=Sunday
day_name VARCHAR(10),
day_of_month INT,
week_of_year INT,
month_number INT,
month_name VARCHAR(10),
quarter INT,
year INT,
fiscal_year INT,
fiscal_quarter INT,
is_weekday BOOLEAN,
is_weekend BOOLEAN,
is_holiday BOOLEAN,
holiday_name VARCHAR(100), -- 'Diwali', 'Independence Day', ...
is_sale_day BOOLEAN,
season VARCHAR(20) -- 'festive', 'regular', 'summer'
);
-- Generate 2000-01-01 through 2030-12-31 (11,000 rows — tiny table)
-- dbt: {{ dbt_utils.date_spine(datepart="day", start_date="'2020-01-01'",
-- end_date="'2030-12-31'") }}
DIMENSION TABLE: dim_store (hierarchy embedded — denormalised)
CREATE TABLE gold.dim_store (
store_sk BIGINT NOT NULL PRIMARY KEY,
store_id VARCHAR(10) NOT NULL, -- natural key: 'ST001'
store_name VARCHAR(200),
store_type VARCHAR(50), -- 'dark_store', 'franchise', 'owned'
city VARCHAR(100),
city_tier VARCHAR(10), -- 'tier1', 'tier2', 'tier3'
state VARCHAR(50),
region VARCHAR(50), -- hierarchy all in ONE table (not snowflaked)
latitude DECIMAL(9,6),
longitude DECIMAL(9,6),
opening_date DATE,
is_active BOOLEAN,
city_population BIGINT,
city_metro_area VARCHAR(100)
);
HIERARCHY DESIGN NOTE:
dim_store embeds city, state, and region directly (denormalised).
Do NOT create a separate dim_city table and join dim_store → dim_city.
That is "snowflaking" — adds join complexity for negligible benefit.
Analysts filter by dim_store.region — no extra join needed.Surrogate keys vs natural keys — why surrogates are non-negotiable
NATURAL KEY: identifier from the source system
customer_id = 4201938 (from PostgreSQL application DB)
store_id = 'ST001' (from store management system)
SURROGATE KEY: warehouse-generated integer, one per dimension row
customer_sk = 1 (first row in dim_customer — Bangalore version)
customer_sk = 2 (second row — same customer after moving to Hyderabad)
WHY SURROGATE KEYS ARE REQUIRED:
REASON 1: SCD Type 2 — each historical version needs a unique key
Customer 4201938 moved from Bangalore to Hyderabad on 2026-02-01.
dim_customer has two rows:
customer_sk=1, customer_id=4201938, city='Bangalore', valid_from=2024-01-15, valid_to=2026-01-31
customer_sk=2, customer_id=4201938, city='Hyderabad', valid_from=2026-02-01, valid_to=NULL
Fact table stores customer_sk at load time:
ORDER 9284751 placed 2026-01-10: stored customer_sk=1 → city='Bangalore' ✓
ORDER 9284755 placed 2026-03-01: stored customer_sk=2 → city='Hyderabad' ✓
Without surrogates: join on customer_id matches BOTH dimension rows.
With is_current=TRUE filter: ALL orders show 'Hyderabad' — historically wrong.
Surrogates are the only correct solution for point-in-time accuracy.
REASON 2: Source system independence
Source migrates customer_id from integer to UUID in 2027.
WITHOUT surrogate keys: must update millions of fact table FK columns.
WITH surrogate keys: customer_sk=1 remains valid, unchanged.
Only dim_customer.customer_id column changes — fact table untouched.
REASON 3: Multiple source system integration
FreshMart acquires a competitor. Both had customer_id = 4201938.
Surrogate keys: assign unique customer_sk per entity — no collision.
Without surrogates: manual prefix/remap of all customer IDs — painful.
REASON 4: Join performance
INTEGER FK joins faster than VARCHAR or UUID.
Integer equality: O(1). VARCHAR: character-by-character comparison.
SURROGATE KEY GENERATION in dbt:
{{ dbt_utils.generate_surrogate_key(['customer_id', 'valid_from']) }}
Returns a deterministic hash — same input always produces same key.
Enables idempotent dimension loads — safe to rerun without creating duplicates.Star Schema — The Centrepiece of Dimensional Modelling
A star schema places one fact table at the centre, surrounded by dimension tables connected to it via foreign keys. The shape resembles a star. It is the standard structure for analytical models because it is queryable with simple, predictable SQL patterns and explains itself visually to analysts who are not data engineers.
STAR SCHEMA (ASCII diagram):
dim_date
(date_sk PK)
│ order_date_sk FK
│
dim_customer ── customer_sk FK ── fct_orders ── store_sk FK ── dim_store
(customer_sk PK) ┌──────────────┐ (store_sk PK)
│ order_sk PK │
payment_sk FK──┤ customer_sk │
│ │ store_sk │
dim_payment_method │ │ date_sk │
(payment_sk PK) ──┘ │ payment_sk │
│ order_amount │ ← FACTS
│ discount_amt │
│ delivery_fee │
│ num_items │
└──────────────┘
CANONICAL STAR SCHEMA QUERY PATTERN:
SELECT
d.year,
d.month_name,
s.city,
s.region,
c.tier AS customer_tier,
SUM(f.order_amount) AS gross_revenue,
SUM(f.discount_amount) AS total_discount,
SUM(f.order_amount - f.discount_amount) AS net_revenue,
COUNT(DISTINCT f.order_sk) AS order_count,
COUNT(DISTINCT f.customer_sk) AS unique_customers
FROM gold.fct_orders f
JOIN gold.dim_date d ON f.order_date_sk = d.date_sk
JOIN gold.dim_store s ON f.store_sk = s.store_sk
JOIN gold.dim_customer c ON f.customer_sk = c.customer_sk
JOIN gold.dim_payment_method p ON f.payment_method_sk = p.payment_sk
WHERE d.year = 2026
AND d.quarter = 1
AND s.region = 'South India'
AND c.is_current = TRUE
GROUP BY 1, 2, 3, 4, 5
ORDER BY 1, 2;
WHY STAR SCHEMA QUERIES ARE FAST:
1. Fact table has only integer FKs and numeric facts — narrow, fast to scan.
2. Dimension joins use integer equality — fastest join type.
3. Filter on dim attributes prunes dimensions first:
WHERE s.region = 'South India' → 2 store rows → only their fact rows scanned.
4. Aggregations operate on pre-filtered fact subsets — small, fast.
5. Columnar storage: only joined FK columns + aggregate columns read from disk.
STAR VS SNOWFLAKE SCHEMA:
3NF (snowflaked): store → city → state → region (3 extra joins for region)
Star (correct): region is a column in dim_store (0 extra joins)
VERDICT: always star. Snowflake schemas add join complexity for negligible benefit.
Dimension tables are small — redundant city names in dim_store cost bytes, not GBs.
Use snowflake schema only when the dimension itself has millions of rows.Conformed Dimensions — The Integration Layer
A conformed dimension is shared across multiple fact tables with identical structure and meaning. When dim_customer is used in both fct_orders and fct_payments with the same surrogate keys and the same attributes, it is conformed. This enables drilling across fact tables — comparing order metrics to payment metrics for the same customer in the same query.
CONFORMED DIMENSION: dim_customer shared across two fact tables
fct_orders fct_payments
┌──────────────┐ ┌──────────────┐
│ customer_sk ─┼──────┐ ┌───┼─ customer_sk │
│ order_amount │ │ │ │ payment_amt │
└──────────────┘ ▼ ▼ └──────────────┘
┌─────────────┐
│ dim_customer│ ← CONFORMED: same table, same SK, same meaning
│ customer_sk │ used by BOTH fact tables
│ tier, city │
└─────────────┘
CROSS-PROCESS QUERY:
-- Payment success rate by customer tier:
SELECT
c.tier,
COUNT(DISTINCT o.order_sk) AS total_orders,
COUNT(DISTINCT p.payment_sk) AS successful_payments,
ROUND(COUNT(DISTINCT p.payment_sk)::NUMERIC
/ COUNT(DISTINCT o.order_sk), 3) AS payment_success_rate
FROM gold.fct_orders o
JOIN gold.dim_customer c USING (customer_sk)
LEFT JOIN gold.fct_payments p ON o.order_sk = p.order_sk
WHERE p.status = 'captured' OR p.status IS NULL
GROUP BY c.tier;
-- Works ONLY because customer_sk=1 means the same customer in BOTH marts.
WHAT BREAKS CROSS-PROCESS ANALYSIS:
Each mart builds its own customer dimension with different SK numbering:
customer_sk=1 in orders mart = customer 4201938
customer_sk=1 in payments mart = customer 4201939 ← DIFFERENT CUSTOMER
Cross-mart join produces nonsense silently.
CONFORMED dim_date: ALWAYS conformed. Every fact table uses the same dim_date.
Never build a separate date dimension per fact table.
date_sk=20260317 means March 17, 2026 everywhere.
JUNK DIMENSIONS:
Low-cardinality flags that do not belong in any existing dimension:
is_promo_order (Y/N), is_late_delivery (Y/N), is_first_order (Y/N)
Consolidate into a single "junk dimension" table:
CREATE TABLE gold.dim_order_flags (
order_flags_sk INT PRIMARY KEY, -- pre-built all combinations
is_promo_order BOOLEAN,
is_late_delivery BOOLEAN,
is_first_order BOOLEAN,
is_weekend_order BOOLEAN
);
-- 16 rows for 4 boolean flags (2^4)
-- Fact table: one FK order_flags_sk → replaces 4 individual columnsWide Tables — The Practical Alternative to Star Schemas
The star schema was designed for row-oriented databases where joins were expensive. Columnar lakehouse engines (Spark, BigQuery, Databricks) are efficient at joins — but they excel even more at scanning wide tables because column pruning eliminates unused column I/O. This has led to the wide table (One Big Table / OBT) pattern as a practical complement or replacement for star schemas.
WIDE TABLE (OBT — One Big Table):
All dimension attributes denormalised into a single, very wide fact table.
No joins required at query time.
fct_orders_wide (80 columns, no joins needed):
┌──────────────────────────────────────────────────────────────────┐
│ ORDER: order_id, order_amount, discount_amount, delivery_fee, │
│ num_items, status, created_at, order_date, order_tier │
│ CUSTOMER: customer_id, customer_tier, customer_city, region, │
│ acquisition_channel │
│ STORE: store_id, store_name, store_city, store_type, store_region│
│ PAYMENT: payment_method, payment_status, captured_at │
│ DELIVERY: delivery_minutes, delivery_partner, driver_rating │
└──────────────────────────────────────────────────────────────────┘
WIDE TABLE QUERY:
SELECT store_region, customer_tier, SUM(order_amount) AS revenue
FROM gold.fct_orders_wide
WHERE order_date = '2026-03-17'
GROUP BY 1, 2;
-- ZERO joins. Columnar storage: only the 4 columns above read from disk.
WHEN WIDE TABLES WIN:
✓ Non-technical analysts / auto-generated BI SQL — no joins to learn
✓ Columnar lakehouse engines — wide scans are efficient
✓ Moderate data volume (< 100M rows) — storage duplication acceptable
✓ Attributes change slowly — SCD2 in a wide table is complex
WHEN STAR SCHEMA WINS:
✓ Attributes change frequently — SCD2 simpler with separate dim tables
✓ Multiple fact tables at different grains — cannot embed all into one table
✓ Very large fact tables (billions of rows) — redundant attributes cost storage
✓ Complex cross-process analysis via conformed dimensions
THE PRAGMATIC HYBRID (2026 recommendation):
Build star schema for the canonical model (SCD2, conformed dimensions).
Derive a wide table for BI tool consumption:
-- models/gold/fct_orders_wide.sql
SELECT
f.*,
c.tier AS customer_tier, c.city AS customer_city, c.region AS customer_region,
s.store_name, s.store_region,
p.payment_method
FROM gold.fct_orders f
JOIN gold.dim_customer c ON f.customer_sk = c.customer_sk
JOIN gold.dim_store s ON f.store_sk = s.store_sk
JOIN gold.dim_payment_method p ON f.payment_method_sk = p.payment_sk
-- Analysts query fct_orders_wide.
-- Modelling rigour lives in the star schema.
-- Best of both worlds.Implementing Dimensional Models in dbt
-- snapshots/customers_snapshot.sql
{% snapshot customers_snapshot %}
{{ config(
target_schema = 'snapshots',
unique_key = 'customer_id',
strategy = 'check',
check_cols = ['city', 'state', 'tier', 'acquisition_channel'],
invalidate_hard_deletes = True,
) }}
SELECT customer_id, customer_name, email_hashed, city, state, tier,
acquisition_channel, registration_date, updated_at
FROM {{ source('silver', 'customers') }}
WHERE is_current = TRUE
{% endsnapshot %}
-- models/gold/dims/dim_customer.sql
{{ config(materialized='table', unique_key='customer_sk') }}
WITH snapshot AS (SELECT * FROM {{ ref('customers_snapshot') }})
SELECT
{{ dbt_utils.generate_surrogate_key(['customer_id', 'dbt_valid_from']) }}
AS customer_sk,
customer_id,
INITCAP(TRIM(customer_name)) AS customer_name,
email_hashed,
LOWER(TRIM(city)) AS city,
LOWER(TRIM(state)) AS state,
CASE
WHEN state IN ('Karnataka','Tamil Nadu','Kerala','Andhra Pradesh','Telangana') THEN 'South'
WHEN state IN ('Maharashtra','Gujarat','Goa') THEN 'West'
WHEN state IN ('Delhi','Uttar Pradesh','Haryana','Punjab','Rajasthan') THEN 'North'
ELSE 'East'
END AS region,
tier,
acquisition_channel,
registration_date,
dbt_valid_from AS valid_from,
dbt_valid_to AS valid_to,
CASE WHEN dbt_valid_to IS NULL THEN TRUE ELSE FALSE END AS is_current,
CURRENT_TIMESTAMP() AS dim_updated_at
FROM snapshot
-- models/gold/facts/fct_orders.sql
{{ config(
materialized='incremental',
unique_key='order_sk',
incremental_strategy='merge',
file_format='delta',
) }}
WITH orders AS (
SELECT * FROM {{ ref('silver_orders') }}
{% if is_incremental() %}
WHERE updated_at > (SELECT MAX(order_updated_at) FROM {{ this }})
{% endif %}
)
SELECT
{{ dbt_utils.generate_surrogate_key(['o.order_id']) }} AS order_sk,
-- Dimension surrogate keys (looked up at load time):
c.customer_sk,
s.store_sk,
d.date_sk AS order_date_sk,
COALESCE(p.payment_sk, -1) AS payment_method_sk,
-- Degenerate dimension:
o.order_id,
-- Additive facts:
o.order_amount,
o.discount_amount,
o.delivery_fee,
o.num_items,
o.order_amount - o.discount_amount AS net_revenue,
-- Non-additive flags:
o.is_first_order,
o.has_promo,
-- Audit:
o.created_at AS order_created_at,
o.updated_at AS order_updated_at,
CURRENT_TIMESTAMP() AS fact_loaded_at
FROM orders o
JOIN {{ ref('dim_date') }} d ON DATE(o.created_at) = d.full_date
JOIN {{ ref('dim_customer') }} c ON o.customer_id = c.customer_id
AND o.created_at BETWEEN c.valid_from
AND COALESCE(c.valid_to, '9999-12-31')
JOIN {{ ref('dim_store') }} s ON o.store_id = s.store_id
LEFT JOIN {{ ref('dim_payment_method') }} p ON o.payment_method = p.payment_method_nameTwo Dashboards, Same Month, Different Revenue — The Modelling Root Cause
Finance reports March revenue as ₹4.21 crore. Operations reports ₹3.87 crore. Same company, same month — ₹34 lakh difference. The CEO asks for an explanation.
-- Finance dashboard SQL (Metabase auto-generated):
SELECT SUM(order_amount) FROM fct_orders
WHERE order_date >= '2026-03-01' AND order_date < '2026-04-01';
-- Returns: 4.21 crore
-- Operations dashboard SQL (analyst hand-written):
SELECT SUM(order_amount) FROM fct_orders f
JOIN dim_date d ON f.order_date_sk = d.date_sk
WHERE d.month_name = 'March' AND d.year = 2026
AND f.status = 'delivered';
-- Returns: 3.87 crore
-- TWO DIFFERENCES FOUND:
-- 1. Finance includes ALL statuses. Operations filters to 'delivered' only.
-- 2. Are the date ranges identical?
-- Check status breakdown:
SELECT status, SUM(order_amount) AS revenue
FROM fct_orders
WHERE order_date >= '2026-03-01' AND order_date < '2026-04-01'
GROUP BY 1 ORDER BY 2 DESC;
-- placed: 0.12 crore ← Finance includes (not yet delivered)
-- confirmed: 0.08 crore ← Finance includes
-- delivering: 0.06 crore ← Finance includes
-- delivered: 3.87 crore ← Operations reports only this ✓
-- cancelled: 0.08 crore ← Finance includes CANCELLED orders!
-- Root cause: both queries are "correct" — they measure different things.
-- Finance: GMV (all orders placed)
-- Operations: completed revenue (delivered only)
-- But BOTH are labelled "March revenue" — that is the problem.
-- MODELLING FIX: define canonical metric in dbt, one place:
-- models/gold/metrics/mrt_monthly_revenue.sql
SELECT
d.year, d.month_number, d.month_name,
SUM(CASE WHEN f.status = 'delivered' THEN f.order_amount ELSE 0 END)
AS delivered_revenue,
SUM(CASE WHEN f.status != 'cancelled' THEN f.order_amount ELSE 0 END)
AS gross_order_value,
SUM(CASE WHEN f.status = 'cancelled' THEN f.order_amount ELSE 0 END)
AS cancelled_value
FROM fct_orders f
JOIN dim_date d USING (order_date_sk)
GROUP BY 1, 2, 3;
-- Both Finance and Operations now query mrt_monthly_revenue.
-- Finance: SELECT delivered_revenue + in_progress_value AS gmv
-- Operations: SELECT delivered_revenue
-- The SAME number. The disagreement is eliminated structurally.The root cause was not a data quality issue — it was a missing canonical metric definition. Centralising business logic in a dbt Gold model eliminated the disagreement. No pipeline work would have fixed this — it required a modelling decision.
5 Interview Questions — With Complete Answers
Errors You Will Hit — And Exactly Why They Happen
🎯 Key Takeaways
- ✓Dimensional modelling organises analytical data into facts (measurements of business events — numeric) and dimensions (context — who, what, where, when). The grain declares exactly what one row in the fact table represents and must be declared before any other design decision.
- ✓Declare the grain first and keep every column honest to it. Every fact column must be a single value at the declared grain. Every dimension FK must apply at that grain. Mixing measurements at different granularities in the same fact table corrupts aggregations and is the hardest modelling error to fix after the fact.
- ✓Four fact table types: Transaction (one row per business event — immutable), Periodic Snapshot (one row per entity per period — appended, tracks state over time), Accumulating Snapshot (one row per process lifecycle — updated as stages complete, multiple date FKs), Factless (no numeric facts — records that an event occurred or a relationship exists).
- ✓Additive facts sum across all dimensions. Semi-additive facts sum across some dimensions but not time (inventory levels — use AVG for time periods). Non-additive facts cannot be summed at all (ratios, booleans). Store component numerators and denominators, compute ratios at query time — never store pre-computed percentages in fact tables.
- ✓Surrogate keys (warehouse-generated integers) are required for: SCD Type 2 point-in-time joins (each historical dimension version needs a unique key), source system independence (source key changes don't propagate to fact FKs), multiple source integration (no natural key collisions), and join performance (integer equality is fastest).
- ✓Star schema: fact table at the centre, denormalised dimension tables connected via integer FKs. All hierarchy levels in one dimension table row. Simple, predictable SQL. Snowflake schemas (normalised dimensions) add join complexity for negligible benefit — avoid them for analytical Gold layers.
- ✓Conformed dimensions are shared across multiple fact tables with identical structure and meaning. dim_date is always conformed. Conformed dimensions enable cross-process analysis — comparing orders to payments for the same customer correctly. Non-conformed dimensions make cross-mart queries produce wrong results silently.
- ✓Wide tables (OBT) embed all dimension attributes into a single table — zero joins at query time. Best for self-service analysts, BI tools, and columnar engines. Star schemas are better for SCD Type 2, multiple fact tables at different grains, and very large fact tables. The hybrid approach: star schema for canonical model + derived wide table for BI consumption.
- ✓In dbt: use dbt_utils.generate_surrogate_key() for deterministic hash-based surrogate keys, dbt snapshots with strategy="check" for SCD Type 2 dimension tables, incremental materialisation with merge strategy for large fact tables, and the date-range join for point-in-time fact-to-dimension lookups.
- ✓Centralise business logic in dbt Gold models. A "revenue" metric disagreement between Finance and Operations (₹4.21 crore vs ₹3.87 crore) is always a missing canonical definition problem, not a data quality problem. One dbt model defines delivered_revenue, gross_order_value, and cancelled_value — both teams query the canonical model and can no longer accidentally apply different filters to the same metric name.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.