SQL for Data Engineers — Beyond the Basics
Window functions, CTEs, deduplication, NULL handling, and the queries every interview tests.
The SQL a Data Engineer Writes Is Not the SQL You Learned First
Basic SQL — SELECT, WHERE, GROUP BY, JOINs — gets you far enough to query a database. It does not get you far enough to build the transformation layer of a production data platform. The SQL a data engineer writes daily is different in kind, not just complexity.
A data engineer's SQL calculates running totals and moving averages without self-joins. It deduplicates millions of rows in a single pass using ranking functions. It tracks slowly changing dimensions across historical snapshots. It handles NULL in ways that prevent silent aggregation errors. It is written as modular, testable CTEs rather than nested subqueries. And it is written with query performance in mind — because a query that takes 4 minutes instead of 4 seconds is a pipeline SLA violation.
silver.orders, silver.customers, silver.stores. SQL dialect is standard PostgreSQL / Snowflake compatible unless noted.Window Functions — The Most Powerful SQL Feature You Must Know
Window functions perform calculations across a set of rows that are related to the current row — without collapsing those rows into a single group the way GROUP BY does. They are the feature that separates junior SQL writers from senior ones, and they appear in virtually every data engineering interview.
The fundamental structure is: function() OVER (PARTITION BY ... ORDER BY ... ROWS/RANGE BETWEEN ...). The OVER clause is what makes it a window function rather than a regular aggregation. Breaking down each part:
SELECT
order_id,
store_id,
order_amount,
-- The window function:
SUM(order_amount) OVER (
PARTITION BY store_id -- divide rows into groups by store
-- calculations are independent per partition
ORDER BY order_date -- within each partition, sort by date
ROWS BETWEEN -- define the "window" of rows to include
UNBOUNDED PRECEDING -- from the first row of the partition
AND CURRENT ROW -- to the current row
) AS running_total_by_store
FROM silver.orders;
-- PARTITION BY is optional — omit it to treat all rows as one partition:
SUM(order_amount) OVER (ORDER BY order_date) AS running_total_all_stores
-- ORDER BY inside OVER is optional — omit it for unordered aggregations:
SUM(order_amount) OVER (PARTITION BY store_id) AS store_total
-- this gives every row the store's total — same as a correlated subquery but much faster
-- The frame clause (ROWS/RANGE BETWEEN) is optional:
-- Default with ORDER BY: RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
-- Default without ORDER BY: ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWINGRanking functions — ROW_NUMBER, RANK, DENSE_RANK
SELECT
order_id,
store_id,
order_amount,
-- ROW_NUMBER: sequential unique number within partition
-- No ties — every row gets a unique number
ROW_NUMBER() OVER (
PARTITION BY store_id
ORDER BY order_amount DESC
) AS row_num,
-- RANK: tied rows get the same rank, then skips numbers
-- Scores: 100, 100, 80 → ranks: 1, 1, 3
RANK() OVER (
PARTITION BY store_id
ORDER BY order_amount DESC
) AS rank_with_gaps,
-- DENSE_RANK: tied rows get the same rank, no skipping
-- Scores: 100, 100, 80 → ranks: 1, 1, 2
DENSE_RANK() OVER (
PARTITION BY store_id
ORDER BY order_amount DESC
) AS rank_no_gaps
FROM silver.orders
WHERE order_date = '2026-03-17';
-- REAL USE CASE: get the top 3 orders per store
WITH ranked AS (
SELECT
order_id,
store_id,
order_amount,
ROW_NUMBER() OVER (
PARTITION BY store_id
ORDER BY order_amount DESC
) AS rn
FROM silver.orders
WHERE order_date = '2026-03-17'
)
SELECT order_id, store_id, order_amount
FROM ranked
WHERE rn <= 3; -- keep only top 3 per store
-- WHY ROW_NUMBER FOR DEDUP vs RANK:
-- ROW_NUMBER guarantees one row per group even when values tie
-- RANK may return 2 rows when two rows tie for position 1
-- For deduplication, always use ROW_NUMBERLAG and LEAD — comparing current row to previous or next
-- LAG: access the value from a previous row in the partition
-- LEAD: access the value from a following row in the partition
SELECT
store_id,
order_date,
daily_revenue,
-- Yesterday's revenue for this store
LAG(daily_revenue, 1, 0) OVER (
PARTITION BY store_id
ORDER BY order_date
) AS prev_day_revenue,
-- Args: (column, offset, default_if_null)
-- offset=1 means one row back; default=0 when no previous row exists
-- Day-over-day change
daily_revenue - LAG(daily_revenue, 1, 0) OVER (
PARTITION BY store_id
ORDER BY order_date
) AS day_over_day_change,
-- Day-over-day % change (careful with division — LAG can be 0)
CASE
WHEN LAG(daily_revenue, 1) OVER (PARTITION BY store_id ORDER BY order_date) IS NULL
OR LAG(daily_revenue, 1) OVER (PARTITION BY store_id ORDER BY order_date) = 0
THEN NULL
ELSE ROUND(
(daily_revenue - LAG(daily_revenue, 1) OVER (PARTITION BY store_id ORDER BY order_date))
/ LAG(daily_revenue, 1) OVER (PARTITION BY store_id ORDER BY order_date) * 100,
2
)
END AS pct_change,
-- Tomorrow's revenue (from the future row's perspective)
LEAD(daily_revenue, 1) OVER (
PARTITION BY store_id
ORDER BY order_date
) AS next_day_revenue
FROM gold.daily_store_revenue
WHERE order_date BETWEEN '2026-03-01' AND '2026-03-17'
ORDER BY store_id, order_date;
-- REAL DE USE CASE: detect gaps in daily data
-- Find dates where a store had no data (gap detection)
WITH store_dates AS (
SELECT
store_id,
order_date,
LEAD(order_date) OVER (PARTITION BY store_id ORDER BY order_date) AS next_date
FROM gold.daily_store_revenue
)
SELECT store_id, order_date, next_date,
next_date - order_date - 1 AS missing_days
FROM store_dates
WHERE next_date - order_date > 1 -- gap of more than 1 day
ORDER BY missing_days DESC;Running totals and moving averages
SELECT
store_id,
order_date,
daily_revenue,
-- Running total (cumulative sum from start of partition to current row)
SUM(daily_revenue) OVER (
PARTITION BY store_id
ORDER BY order_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS cumulative_revenue,
-- 7-day moving average (last 7 days including today)
AVG(daily_revenue) OVER (
PARTITION BY store_id
ORDER BY order_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS moving_avg_7d,
-- 7-day moving sum
SUM(daily_revenue) OVER (
PARTITION BY store_id
ORDER BY order_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS rolling_7d_sum,
-- Month-to-date total (from first day of current month to today)
SUM(daily_revenue) OVER (
PARTITION BY store_id, DATE_TRUNC('month', order_date)
ORDER BY order_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS mtd_revenue,
-- Percentage of store's total monthly revenue
daily_revenue / SUM(daily_revenue) OVER (
PARTITION BY store_id, DATE_TRUNC('month', order_date)
) * 100 AS pct_of_monthly_total
FROM gold.daily_store_revenue
ORDER BY store_id, order_date;
-- ROWS vs RANGE — an important distinction:
--
-- ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
-- → exactly 6 rows before current row, regardless of value gaps
-- → correct for "last 7 rows"
--
-- RANGE BETWEEN INTERVAL '6 days' PRECEDING AND CURRENT ROW
-- → all rows within the last 6 days by value (calendar days)
-- → correct for "last 7 calendar days" even if some days have no data
-- → only works when ORDER BY column is DATE or TIMESTAMP
--
-- For moving averages over time series with gaps, RANGE is more correct.
-- For moving averages over sequences of rows, ROWS is more correct.NTILE and PERCENT_RANK — distribution functions
SELECT
customer_id,
total_spend_ytd,
-- NTILE(n): divide rows into n equal-sized buckets
-- Useful for quartile/decile segmentation
NTILE(4) OVER (ORDER BY total_spend_ytd DESC) AS spend_quartile,
-- 1 = top 25%, 2 = next 25%, 3 = next 25%, 4 = bottom 25%
NTILE(10) OVER (ORDER BY total_spend_ytd DESC) AS spend_decile,
-- 1 = top 10%, ..., 10 = bottom 10%
-- PERCENT_RANK: what percentile is this row in? (0.0 to 1.0)
ROUND(PERCENT_RANK() OVER (ORDER BY total_spend_ytd) * 100, 1)
AS percentile_rank,
-- CUME_DIST: cumulative distribution (fraction of rows <= current)
ROUND(CUME_DIST() OVER (ORDER BY total_spend_ytd) * 100, 1)
AS cumulative_pct
FROM silver.customer_annual_stats
ORDER BY total_spend_ytd DESC;
-- REAL USE CASE: customer segmentation for marketing
SELECT
customer_id,
total_orders,
total_spend_ytd,
CASE NTILE(5) OVER (ORDER BY total_spend_ytd DESC)
WHEN 1 THEN 'Platinum' -- top 20%
WHEN 2 THEN 'Gold' -- next 20%
WHEN 3 THEN 'Silver' -- next 20%
WHEN 4 THEN 'Bronze' -- next 20%
WHEN 5 THEN 'Standard' -- bottom 20%
END AS customer_tier
FROM silver.customer_annual_stats;Common Table Expressions — The Foundation of Readable SQL
A CTE (Common Table Expression) is a named temporary result set defined within a query using the WITH clause. CTEs transform a single deeply-nested subquery mess into a series of named, readable, independently-understandable steps. They are how data engineers write SQL that other engineers can review, debug, and modify.
In dbt, every model is essentially a CTE chain — each model builds on the previous one, with each CTE representing one logical transformation step. The ability to write clear, modular CTEs is the most important SQL skill for a data engineer.
-- BAD: nested subquery — unreadable, hard to debug, impossible to test steps
SELECT
s.store_name,
ranked.daily_revenue,
ranked.revenue_rank
FROM (
SELECT
store_id,
SUM(order_amount) AS daily_revenue,
RANK() OVER (ORDER BY SUM(order_amount) DESC) AS revenue_rank
FROM (
SELECT order_id, store_id, order_amount
FROM silver.orders
WHERE order_date = '2026-03-17'
AND status = 'delivered'
) filtered
GROUP BY store_id
) ranked
JOIN silver.stores s ON ranked.store_id = s.store_id
WHERE ranked.revenue_rank <= 5;
-- GOOD: CTE chain — each step is named, readable, testable independently
WITH
-- Step 1: filter to the orders we care about
delivered_orders AS (
SELECT order_id, store_id, order_amount
FROM silver.orders
WHERE order_date = '2026-03-17'
AND status = 'delivered'
),
-- Step 2: aggregate by store
store_revenue AS (
SELECT
store_id,
SUM(order_amount) AS daily_revenue,
COUNT(*) AS order_count
FROM delivered_orders
GROUP BY store_id
),
-- Step 3: rank stores by revenue
ranked_stores AS (
SELECT
store_id,
daily_revenue,
order_count,
RANK() OVER (ORDER BY daily_revenue DESC) AS revenue_rank
FROM store_revenue
)
-- Final: join to store names and filter top 5
SELECT
s.store_name,
s.city,
rs.daily_revenue,
rs.order_count,
rs.revenue_rank
FROM ranked_stores rs
JOIN silver.stores s ON rs.store_id = s.store_id
WHERE rs.revenue_rank <= 5
ORDER BY rs.revenue_rank;Multiple CTEs — building a full transformation pipeline in SQL
-- Typical dbt Gold model: daily customer metrics
WITH
-- Base: all orders in the analysis window
base_orders AS (
SELECT
o.order_id,
o.customer_id,
o.store_id,
o.order_amount,
o.status,
o.order_date,
o.created_at
FROM silver.orders o
WHERE o.order_date BETWEEN '2026-01-01' AND '2026-03-17'
AND o.status IN ('delivered', 'cancelled')
),
-- Customer first order date (for cohort analysis)
customer_first_order AS (
SELECT
customer_id,
MIN(order_date) AS first_order_date,
MIN(order_id) AS first_order_id
FROM base_orders
WHERE status = 'delivered'
GROUP BY customer_id
),
-- Delivered orders only (for revenue metrics)
delivered AS (
SELECT * FROM base_orders WHERE status = 'delivered'
),
-- Customer-level aggregation
customer_metrics AS (
SELECT
d.customer_id,
COUNT(DISTINCT d.order_id) AS total_orders,
SUM(d.order_amount) AS total_revenue,
AVG(d.order_amount) AS avg_order_value,
MAX(d.order_date) AS last_order_date,
COUNT(DISTINCT d.store_id) AS stores_visited,
cfo.first_order_date,
-- Days since first order
('2026-03-17'::DATE - cfo.first_order_date) AS customer_age_days,
-- Order frequency (orders per month since first order)
ROUND(
COUNT(DISTINCT d.order_id)::NUMERIC
/ GREATEST(
('2026-03-17'::DATE - cfo.first_order_date) / 30.0,
1
),
2
) AS orders_per_month
FROM delivered d
JOIN customer_first_order cfo USING (customer_id)
GROUP BY d.customer_id, cfo.first_order_date
),
-- Add customer dimension attributes
final AS (
SELECT
cm.*,
c.customer_name,
c.city,
c.tier,
-- Classify by value
CASE
WHEN cm.total_revenue >= 50000 THEN 'high_value'
WHEN cm.total_revenue >= 10000 THEN 'mid_value'
ELSE 'low_value'
END AS value_segment,
-- Classify by recency
CASE
WHEN '2026-03-17'::DATE - cm.last_order_date <= 30 THEN 'active'
WHEN '2026-03-17'::DATE - cm.last_order_date <= 90 THEN 'at_risk'
ELSE 'churned'
END AS recency_segment
FROM customer_metrics cm
JOIN silver.customers c USING (customer_id)
)
SELECT * FROM final;Recursive CTEs — for hierarchical and graph data
-- Recursive CTEs traverse hierarchical data:
-- parent-child relationships, org charts, category trees
-- Example: product category hierarchy
-- categories table: (category_id, name, parent_category_id)
-- Root categories have parent_category_id = NULL
WITH RECURSIVE category_tree AS (
-- Base case: start with root categories (no parent)
SELECT
category_id,
name,
parent_category_id,
name AS full_path,
0 AS depth
FROM silver.categories
WHERE parent_category_id IS NULL
UNION ALL
-- Recursive case: join each category to its children
SELECT
c.category_id,
c.name,
c.parent_category_id,
ct.full_path || ' > ' || c.name AS full_path, -- build path
ct.depth + 1 AS depth
FROM silver.categories c
JOIN category_tree ct ON c.parent_category_id = ct.category_id
)
SELECT
category_id,
name,
full_path,
depth
FROM category_tree
ORDER BY full_path;
-- Result:
-- Electronics depth=0
-- Electronics > Phones depth=1
-- Electronics > Phones > Smartphones depth=2
-- Electronics > Phones > Feature Phones depth=2
-- Electronics > Laptops depth=1
-- GUARD AGAINST INFINITE LOOPS:
-- Add WHERE depth < 10 to the recursive case
-- Some databases support MAXRECURSION hintDeduplication — The Most Common Data Engineering SQL Task
Duplicate records are one of the most frequent data quality problems in any pipeline. They come from source systems that emit the same event twice during retries, from CDC tools that deliver at-least-once, from pipeline reruns that re-insert already-loaded records, and from UNION operations that do not account for shared rows between sources.
SQL deduplication using window functions is the standard, efficient approach. It handles all deduplication scenarios in a single pass without expensive self-joins.
-- ── SCENARIO 1: Exact duplicates (all columns identical) ────────────────────
-- Keep one copy of each row where every column is the same
SELECT DISTINCT *
FROM silver.orders;
-- Simple but scans all rows and cannot be controlled which row is kept
-- Better with CTE:
WITH deduped AS (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY order_id, customer_id, amount, status) AS rn
FROM silver.orders
)
SELECT * EXCLUDE (rn) FROM deduped WHERE rn = 1;
-- ── SCENARIO 2: Same primary key, keep most recent version ───────────────────
-- Multiple rows with the same order_id — keep the one with the latest updated_at
WITH deduped AS (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY order_id -- group by business key
ORDER BY updated_at DESC -- most recent first
) AS rn
FROM silver.orders
)
SELECT * EXCLUDE (rn)
FROM deduped
WHERE rn = 1;
-- For each order_id, keeps exactly one row — the one with the latest updated_at
-- ── SCENARIO 3: Same primary key, keep first seen version ────────────────────
-- Keep the original record, ignore later duplicates
WITH deduped AS (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY order_id
ORDER BY created_at ASC, -- oldest first
ingested_at ASC -- tiebreak by when pipeline saw it
) AS rn
FROM silver.orders
)
SELECT * EXCLUDE (rn)
FROM deduped
WHERE rn = 1;
-- ── SNOWFLAKE: QUALIFY clause (no outer CTE needed) ──────────────────────────
-- Snowflake supports QUALIFY to filter on window function results directly
SELECT *
FROM silver.orders
QUALIFY ROW_NUMBER() OVER (
PARTITION BY order_id
ORDER BY updated_at DESC
) = 1;
-- Cleaner than wrapping in CTE — equivalent result
-- ── IN-PLACE DEDUPLICATION: remove duplicates from an existing table ─────────
-- Use when you need to clean a table that already has duplicates
-- Step 1: identify rows to keep (in temp table or CTE)
CREATE TEMP TABLE orders_deduped AS
SELECT * EXCLUDE (rn)
FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY updated_at DESC) AS rn
FROM silver.orders
) t
WHERE rn = 1;
-- Step 2: swap (truncate original, insert clean version)
BEGIN;
TRUNCATE silver.orders;
INSERT INTO silver.orders SELECT * FROM orders_deduped;
COMMIT;
-- Or use DELETE with a CTE (PostgreSQL / Snowflake):
WITH duplicates AS (
SELECT ctid,
ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY updated_at DESC) AS rn
FROM silver.orders
)
DELETE FROM silver.orders
WHERE ctid IN (SELECT ctid FROM duplicates WHERE rn > 1);NULL — The Most Misunderstood Value in SQL
NULL in SQL does not mean zero. It does not mean empty string. It means unknown. This single semantic distinction causes more silent data errors in data engineering than almost any other SQL concept. When you add NULL to a number, the result is NULL. When you compare NULL to anything — even another NULL — the result is neither TRUE nor FALSE, but UNKNOWN. And UNKNOWN in a WHERE clause means the row is excluded.
-- NULL arithmetic — results are always NULL
SELECT 380 + NULL; -- NULL
SELECT NULL * 0; -- NULL (not 0!)
SELECT 'hello' || NULL; -- NULL (not 'hello')
-- NULL comparisons — never use = NULL
SELECT * FROM orders WHERE promo_code = NULL; -- WRONG: returns 0 rows
SELECT * FROM orders WHERE promo_code IS NULL; -- CORRECT: returns rows with NULL promo_code
SELECT * FROM orders WHERE promo_code IS NOT NULL;
-- NULL in three-valued logic:
-- TRUE AND NULL = NULL (not FALSE)
-- FALSE AND NULL = FALSE
-- TRUE OR NULL = TRUE
-- FALSE OR NULL = NULL (not FALSE)
-- NOT NULL = NULL
-- Consequence: WHERE clause excludes rows where condition is NULL
-- This means: if promo_code is NULL, these rows are silently excluded:
SELECT * FROM orders WHERE promo_code != 'SAVE10';
-- Rows where promo_code IS NULL are NOT returned — they fail the != comparison
-- FIX:
SELECT * FROM orders WHERE promo_code != 'SAVE10' OR promo_code IS NULL;
-- ── NULL IN AGGREGATIONS ──────────────────────────────────────────────────────
-- NULL values are IGNORED in all aggregate functions EXCEPT COUNT(*)
SELECT
COUNT(*) AS total_rows, -- counts ALL rows including NULL
COUNT(promo_code) AS rows_with_promo, -- counts only non-NULL promo_code
COUNT(*) - COUNT(promo_code)
AS rows_without_promo, -- difference = NULL rows
AVG(order_amount) AS avg_amount, -- NULL amounts excluded from avg
SUM(discount_amount) AS total_discount -- NULL discounts treated as 0 in SUM
-- Wait — are they? No! NULL is NOT 0.
-- SUM ignores NULLs, not treats as 0.
FROM silver.orders;
-- Safe pattern: use COALESCE to treat NULL as 0 in sums
SELECT SUM(COALESCE(discount_amount, 0)) AS total_discount FROM silver.orders;
-- Now NULL discount_amount is explicitly treated as 0
-- ── COALESCE: return first non-NULL value ─────────────────────────────────────
SELECT
order_id,
COALESCE(promo_code, 'NO_PROMO') AS promo_code_safe,
COALESCE(delivery_fee, 0) AS delivery_fee_safe,
COALESCE(notes, special_inst, '') AS display_notes -- try notes, then special_inst, then ''
FROM silver.orders;
-- ── NULLIF: return NULL if value equals a specific value ─────────────────────
-- Prevents division by zero elegantly
SELECT
store_id,
total_revenue / NULLIF(total_orders, 0) AS avg_order_value
-- if total_orders = 0, NULLIF returns NULL → NULL/anything = NULL (not error)
FROM gold.store_metrics;
-- Clean up dummy/placeholder values:
SELECT NULLIF(phone_number, 'N/A') AS phone_clean FROM customers;
-- Returns NULL where phone_number = 'N/A'
-- ── NULL IN JOINs ─────────────────────────────────────────────────────────────
-- NULL = NULL is UNKNOWN, so NULL join keys never match
SELECT o.*, c.name
FROM orders o
LEFT JOIN customers c ON o.customer_id = c.customer_id;
-- If o.customer_id IS NULL → no match → c.name IS NULL in result
-- Common mistake: assuming all LEFT JOIN nulls come from missing matches
-- Some may come from NULL join keys in the left table itself
-- Distinguish them:
SELECT
o.order_id,
o.customer_id,
c.name,
CASE
WHEN o.customer_id IS NULL THEN 'null_key_in_orders'
WHEN c.customer_id IS NULL THEN 'missing_customer'
ELSE 'matched'
END AS join_status
FROM orders o
LEFT JOIN customers c ON o.customer_id = c.customer_id;Set Operations — UNION, INTERSECT, EXCEPT
Set operations combine the results of two queries that have the same columns. They are essential for a data engineer who combines data from multiple sources, audits for data completeness, or identifies records that appear in one dataset but not another.
-- ── UNION: combine rows from two queries ────────────────────────────────────
-- UNION ALL: include all rows from both queries (keeps duplicates)
-- Use UNION ALL by default — it is faster because no deduplication step
SELECT order_id, order_amount, 'online' AS channel FROM silver.online_orders
UNION ALL
SELECT order_id, order_amount, 'instore' AS channel FROM silver.instore_orders;
-- UNION (without ALL): deduplicate — removes rows that appear in both results
-- Slower — requires a sort or hash to find duplicates
SELECT customer_id FROM silver.email_subscribers
UNION
SELECT customer_id FROM silver.sms_subscribers;
-- Returns unique customer_ids who subscribed to either channel
-- ── INTERSECT: rows that appear in BOTH queries ───────────────────────────────
-- Find customers subscribed to both email AND SMS
SELECT customer_id FROM silver.email_subscribers
INTERSECT
SELECT customer_id FROM silver.sms_subscribers;
-- Equivalent with JOIN (sometimes more explicit):
SELECT DISTINCT e.customer_id
FROM silver.email_subscribers e
INNER JOIN silver.sms_subscribers s USING (customer_id);
-- ── EXCEPT: rows in first query but NOT in second ────────────────────────────
-- (Called MINUS in Oracle/some other databases)
-- Find email subscribers who are NOT SMS subscribers
SELECT customer_id FROM silver.email_subscribers
EXCEPT
SELECT customer_id FROM silver.sms_subscribers;
-- REAL DE USE CASE: data completeness check
-- Find order_ids in the source (PostgreSQL) but missing from the warehouse
SELECT order_id FROM source.orders WHERE order_date = '2026-03-17'
EXCEPT
SELECT order_id FROM silver.orders WHERE order_date = '2026-03-17';
-- Returns order_ids that were ingested but not yet in warehouse
-- If this returns rows: the pipeline has a gap
-- ── UNION ALL for combining multiple sources (common DE pattern) ──────────────
-- Combine payments from multiple payment providers into one table
WITH all_payments AS (
SELECT payment_id, merchant_id, amount, 'razorpay' AS provider, created_at
FROM silver.razorpay_payments
UNION ALL
SELECT payment_id, merchant_id, amount, 'paytm' AS provider, created_at
FROM silver.paytm_payments
UNION ALL
SELECT payment_id, merchant_id, amount, 'phonepe' AS provider, created_at
FROM silver.phonepe_payments
)
SELECT
provider,
DATE_TRUNC('day', created_at) AS payment_date,
COUNT(*) AS transaction_count,
SUM(amount) AS total_volume
FROM all_payments
GROUP BY 1, 2
ORDER BY 2, 1;Slowly Changing Dimensions in SQL — Types 1, 2, and 3
A Slowly Changing Dimension (SCD) is a dimension table where the attribute values change over time — a customer changes their city, a product changes its category, a store changes its manager. How you handle these changes determines whether historical analysis produces correct or misleading results. SCD handling is one of the most-tested topics in data engineering interviews.
-- SCD TYPE 1: update in place, discard old value
-- Use when: you only care about current state, history is not needed
-- Example: correct a data entry mistake in an address
-- UPSERT pattern (INSERT or UPDATE):
INSERT INTO silver.customers (customer_id, name, city, updated_at)
VALUES (4201938, 'Priya Sharma', 'Hyderabad', NOW())
ON CONFLICT (customer_id)
DO UPDATE SET
city = EXCLUDED.city,
updated_at = EXCLUDED.updated_at;
-- If customer 4201938 exists: update city from 'Bangalore' to 'Hyderabad'
-- If not: insert as new row
-- PROBLEM: all historical analysis now shows Hyderabad
-- "How much did Priya spend while she lived in Bangalore?" → impossible to answer
-- Use SCD Type 2 if that question matters to the business-- SCD TYPE 2: add a new row for each change, expire the old row
-- Use when: you need to track history — "what was X at time T?"
-- Example: customer moves city — track both old and new for historical analysis
-- SCD Type 2 table structure:
CREATE TABLE silver.customers_scd2 (
customer_sk BIGSERIAL PRIMARY KEY, -- surrogate key (new per version)
customer_id BIGINT NOT NULL, -- business key (same across versions)
name VARCHAR NOT NULL,
city VARCHAR NOT NULL,
tier VARCHAR NOT NULL,
-- SCD metadata columns:
valid_from DATE NOT NULL, -- when this version became active
valid_to DATE, -- when this version expired (NULL = current)
is_current BOOLEAN NOT NULL DEFAULT TRUE,
UNIQUE (customer_id, valid_from)
);
-- STEP 1: expire the current row
UPDATE silver.customers_scd2
SET
valid_to = CURRENT_DATE - INTERVAL '1 day',
is_current = FALSE
WHERE customer_id = 4201938
AND is_current = TRUE;
-- STEP 2: insert the new version
INSERT INTO silver.customers_scd2
(customer_id, name, city, tier, valid_from, valid_to, is_current)
VALUES
(4201938, 'Priya Sharma', 'Hyderabad', 'Gold', CURRENT_DATE, NULL, TRUE);
-- QUERY: "What city was Priya in when she placed order 9284751?"
SELECT c.city
FROM silver.orders o
JOIN silver.customers_scd2 c
ON o.customer_id = c.customer_id
AND o.order_date BETWEEN c.valid_from AND COALESCE(c.valid_to, '9999-12-31')
WHERE o.order_id = 9284751;
-- Returns 'Bangalore' if the order was placed before the move
-- Returns 'Hyderabad' if placed after
-- dbt snapshot pattern (generates SCD2 automatically):
-- {% snapshot customers_snapshot %}
-- {{ config(target_schema='snapshots', unique_key='customer_id',
-- strategy='check', check_cols=['city', 'tier']) }}
-- SELECT customer_id, name, city, tier FROM {{ source('prod', 'customers') }}
-- {% endsnapshot %}-- SCD TYPE 3: add a column for the previous value
-- Use when: you only need to track one change back (current + previous)
-- Example: customer segment changes — track current and previous segment
CREATE TABLE silver.customers_scd3 (
customer_id BIGINT PRIMARY KEY,
name VARCHAR NOT NULL,
current_tier VARCHAR NOT NULL,
previous_tier VARCHAR, -- NULL if never changed
tier_changed_at TIMESTAMPTZ, -- when the most recent change happened
city VARCHAR NOT NULL
);
-- UPDATE pattern:
UPDATE silver.customers_scd3
SET
previous_tier = current_tier, -- save current as previous
current_tier = 'Platinum', -- set new current
tier_changed_at = NOW()
WHERE customer_id = 4201938;
-- LIMITATION: only tracks one previous value
-- After a second change: previous_tier is overwritten
-- Cannot answer "what was the tier two changes ago?"
-- Use SCD Type 2 for full historyDate and Time SQL — Arithmetic, Truncation, and Timezone Handling
Date and time manipulation is in almost every data engineering query. Daily aggregations, week-over-week comparisons, cohort analysis by signup month, session duration calculations — all require solid date SQL skills. The syntax varies slightly between databases; we cover PostgreSQL with notes on Snowflake and BigQuery differences.
-- ── DATE TRUNCATION ──────────────────────────────────────────────────────────
-- DATE_TRUNC rounds down to the start of the specified period
SELECT
order_id,
created_at,
DATE_TRUNC('day', created_at) AS order_day, -- 2026-03-17 00:00:00
DATE_TRUNC('week', created_at) AS order_week_start, -- 2026-03-16 (Monday)
DATE_TRUNC('month', created_at) AS order_month, -- 2026-03-01 00:00:00
DATE_TRUNC('year', created_at) AS order_year, -- 2026-01-01 00:00:00
DATE_TRUNC('hour', created_at) AS order_hour -- 2026-03-17 20:00:00
FROM silver.orders;
-- In Snowflake: DATE_TRUNC('month', created_at) — same syntax
-- In BigQuery: DATE_TRUNC(created_at, MONTH) — arguments reversed!
-- ── DATE ARITHMETIC ───────────────────────────────────────────────────────────
-- PostgreSQL:
SELECT
CURRENT_DATE, -- today: 2026-03-17
CURRENT_DATE - INTERVAL '7 days', -- 7 days ago: 2026-03-10
CURRENT_DATE - INTERVAL '1 month', -- 1 month ago: 2026-02-17
CURRENT_DATE + INTERVAL '30 days', -- 30 days ahead: 2026-04-16
-- Date parts:
EXTRACT(DOW FROM CURRENT_DATE), -- day of week (0=Sunday, 6=Saturday)
EXTRACT(DOY FROM CURRENT_DATE), -- day of year (1–365)
EXTRACT(MONTH FROM created_at), -- month number (1–12)
EXTRACT(EPOCH FROM created_at), -- Unix timestamp (seconds since 1970)
-- Date difference:
CURRENT_DATE - order_date AS days_since_order, -- integer
AGE(CURRENT_DATE, order_date) AS age_interval, -- interval '14 days'
DATE_PART('day', CURRENT_DATE - order_date::DATE) AS days_int -- explicitly integer
FROM silver.orders;
-- ── COMMON DATE PATTERNS ──────────────────────────────────────────────────────
-- Last 7 days (inclusive of today):
WHERE order_date >= CURRENT_DATE - INTERVAL '6 days'
-- Month-to-date:
WHERE order_date >= DATE_TRUNC('month', CURRENT_DATE)
AND order_date < CURRENT_DATE + INTERVAL '1 day'
-- Previous full month:
WHERE order_date >= DATE_TRUNC('month', CURRENT_DATE - INTERVAL '1 month')
AND order_date < DATE_TRUNC('month', CURRENT_DATE)
-- Same day last year:
WHERE order_date = CURRENT_DATE - INTERVAL '1 year'
-- Last complete week (Monday–Sunday):
WHERE order_date >= DATE_TRUNC('week', CURRENT_DATE - INTERVAL '7 days')
AND order_date < DATE_TRUNC('week', CURRENT_DATE)
-- ── TIMEZONE HANDLING ─────────────────────────────────────────────────────────
-- Always work in UTC internally, convert to IST (UTC+5:30) only for display
-- Convert UTC timestamp to IST for reporting:
SELECT
order_id,
created_at AT TIME ZONE 'UTC' AT TIME ZONE 'Asia/Kolkata' AS created_at_ist
FROM silver.orders;
-- Snowflake timezone conversion:
SELECT CONVERT_TIMEZONE('UTC', 'Asia/Kolkata', created_at) AS ist_time FROM orders;
-- BigQuery timezone conversion:
SELECT DATETIME(created_at, 'Asia/Kolkata') AS ist_time FROM orders;
-- Safe IST daily grouping (orders placed between midnight and midnight IST):
SELECT
DATE(created_at AT TIME ZONE 'Asia/Kolkata') AS order_date_ist,
COUNT(*) AS order_count
FROM silver.orders
GROUP BY 1
ORDER BY 1;Cohort analysis — the classic date SQL challenge
-- Monthly cohort retention: for each signup month cohort,
-- what % of customers are still ordering in month N?
WITH
-- Step 1: assign each customer to their signup cohort (month of first order)
cohort_assignment AS (
SELECT
customer_id,
DATE_TRUNC('month', MIN(order_date)) AS cohort_month
FROM silver.orders
WHERE status = 'delivered'
GROUP BY customer_id
),
-- Step 2: for each customer, list every month they had at least one order
active_months AS (
SELECT DISTINCT
customer_id,
DATE_TRUNC('month', order_date) AS active_month
FROM silver.orders
WHERE status = 'delivered'
),
-- Step 3: join and calculate cohort_period (0 = cohort month, 1 = one month later, etc.)
cohort_data AS (
SELECT
ca.cohort_month,
am.active_month,
-- How many months after cohort month is this?
EXTRACT(YEAR FROM AGE(am.active_month, ca.cohort_month)) * 12
+ EXTRACT(MONTH FROM AGE(am.active_month, ca.cohort_month))
AS cohort_period,
COUNT(DISTINCT am.customer_id) AS customers_active
FROM cohort_assignment ca
JOIN active_months am USING (customer_id)
WHERE am.active_month >= ca.cohort_month
GROUP BY 1, 2, 3
),
-- Step 4: calculate cohort size (customers in month 0)
cohort_sizes AS (
SELECT cohort_month, customers_active AS cohort_size
FROM cohort_data
WHERE cohort_period = 0
)
-- Final: retention rates
SELECT
cd.cohort_month,
cs.cohort_size,
cd.cohort_period,
cd.customers_active,
ROUND(cd.customers_active::NUMERIC / cs.cohort_size * 100, 1) AS retention_pct
FROM cohort_data cd
JOIN cohort_sizes cs USING (cohort_month)
ORDER BY cohort_month, cohort_period;Query Optimisation — Writing SQL That Is Fast, Not Just Correct
Correct SQL and fast SQL are not the same thing. A data engineer whose queries take 40 minutes when they should take 40 seconds is a pipeline SLA risk. Query optimisation is not premature optimisation — it is the difference between a pipeline that meets its morning SLA and one that does not.
Reading EXPLAIN ANALYZE in PostgreSQL and Snowflake
-- ── POSTGRESQL: EXPLAIN ANALYZE ─────────────────────────────────────────────
EXPLAIN ANALYZE
SELECT
s.store_name,
COUNT(*) AS order_count,
SUM(o.amount) AS total_revenue
FROM silver.orders o
JOIN silver.stores s ON o.store_id = s.store_id
WHERE o.order_date >= '2026-01-01'
GROUP BY s.store_name;
-- Sample output:
-- HashAggregate (cost=... rows=10) (actual time=892ms rows=10)
-- -> Hash Join (cost=... rows=480k) (actual time=24ms rows=482k)
-- Hash Cond: (o.store_id = s.store_id)
-- -> Index Scan using idx_orders_date on orders o
-- (actual time=0.04ms rows=482193)
-- Index Cond: (order_date >= '2026-01-01')
-- -> Hash (actual time=0.03ms rows=10)
-- -> Seq Scan on stores s (actual time=0.02ms rows=10)
-- Planning Time: 1.2 ms
-- Execution Time: 896 ms
-- KEY THINGS TO LOOK FOR:
-- "Seq Scan" on a large table → likely missing index
-- "Index Scan" → index is being used ✓
-- "Hash Join" → joining medium tables efficiently ✓
-- "Nested Loop" on large tables → may be slow
-- rows estimate vs actual rows differing by 10×+ → stale statistics
-- high "actual time" on one node → that node is the bottleneck
-- ── SNOWFLAKE: EXPLAIN ────────────────────────────────────────────────────────
-- Snowflake does not use traditional indexes — it uses micro-partition pruning
EXPLAIN
SELECT store_name, SUM(amount) FROM orders WHERE order_date >= '2026-01-01'
GROUP BY store_name;
-- Look for in Snowflake:
-- "Partition pruning: 847 of 1024 partitions pruned" → filter is working ✓
-- "TableScan: ALL PARTITIONS" → no partition pruning → check clustering key
-- "SpillToLocalStorage" → query is spilling — increase warehouse size or
-- rewrite query to reduce intermediate result sizeThe ten most impactful optimisation rules
-- 1. FILTER EARLY — push WHERE conditions as early as possible in CTEs
-- BAD: filter after the JOIN
SELECT * FROM orders o JOIN customers c USING (customer_id)
WHERE o.order_date >= '2026-03-01';
-- GOOD: filter before the JOIN — reduces rows joining
WITH recent_orders AS (
SELECT * FROM orders WHERE order_date >= '2026-03-01'
)
SELECT * FROM recent_orders o JOIN customers c USING (customer_id);
-- 2. AVOID SELECT * IN PRODUCTION — select only columns you need
-- SELECT * reads all columns including ones your query never uses
-- On columnar warehouses, SELECT * reads every column — negates columnar benefit
SELECT order_id, customer_id, amount FROM orders; -- not SELECT *
-- 3. AVOID FUNCTIONS ON INDEXED COLUMNS IN WHERE — prevents index use
-- BAD: applying function to indexed column prevents index use
WHERE LOWER(email) = 'priya@example.com'
WHERE EXTRACT(YEAR FROM order_date) = 2026 -- full scan!
-- GOOD: compute the comparison value instead, leave the column raw
WHERE email = LOWER('Priya@Example.com') -- or store email pre-lowercased
WHERE order_date >= '2026-01-01' -- range on the column directly
AND order_date < '2027-01-01'
-- 4. USE JOINS INSTEAD OF CORRELATED SUBQUERIES
-- BAD: correlated subquery runs once per outer row
SELECT
o.*,
(SELECT name FROM customers WHERE customer_id = o.customer_id) AS customer_name
FROM orders o;
-- GOOD: JOIN runs once
SELECT o.*, c.name AS customer_name
FROM orders o
JOIN customers c USING (customer_id);
-- 5. COUNT(1) vs COUNT(*) — no practical difference in modern databases
-- Both count rows. COUNT(*) is the SQL standard. Use it.
SELECT COUNT(*) FROM orders;
-- 6. AVOID DISTINCT when not needed — it triggers an extra sort/hash step
-- If you know the join produces unique results, omit DISTINCT
-- If you need uniqueness, check whether a better JOIN design eliminates duplicates
-- 7. MATERIALISE INTERMEDIATE CTEs for reuse in Snowflake / BigQuery
-- A CTE referenced multiple times may be re-executed each time (database-dependent)
-- In Snowflake, use CREATE TEMP TABLE for CTEs referenced > once in complex queries
-- 8. PARTITION PRUNING — always filter on partition columns
-- Snowflake: cluster by the column you filter on most frequently
-- Parquet/Iceberg: filter on the partition column in WHERE clause
-- Without partition filter: full table scan regardless of other filters
-- 9. JOIN ORDER — put the smaller table on the right in HASH JOINs
-- PostgreSQL optimiser usually handles this automatically
-- In Snowflake/BigQuery for very large tables: hint with smaller table on right side
-- 10. APPROXIMATE FUNCTIONS for exploration on very large datasets
-- Instead of COUNT(DISTINCT customer_id) which requires loading all IDs:
SELECT APPROX_COUNT_DISTINCT(customer_id) FROM orders; -- Snowflake
SELECT HLL_COUNT.MERGE(HLL_COUNT.INIT(customer_id)) FROM orders; -- BigQuery
-- Typically within 1–2% of exact count, but runs much fasterWarehouse-Specific SQL — Snowflake, BigQuery, Redshift Differences
SQL is standardised but every warehouse adds its own extensions and has its own quirks. When you switch warehouses or work across multiple, these differences cause silent errors or failed queries. Here are the most important differences a data engineer needs to know.
-- ── JSON / SEMI-STRUCTURED DATA ──────────────────────────────────────────────
-- PostgreSQL:
SELECT payload->>'order_id' AS order_id, -- text value
payload->'amount' AS amount_json, -- JSON value
(payload->>'amount')::DECIMAL -- cast to numeric
FROM orders WHERE payload IS NOT NULL;
-- Snowflake (VARIANT column):
SELECT
payload:order_id::INTEGER AS order_id,
payload:customer.city::VARCHAR AS city, -- nested path
payload:items[0]:name::VARCHAR AS first_item -- array index
FROM orders;
-- BigQuery (JSON column):
SELECT
JSON_VALUE(payload, '$.order_id') AS order_id,
JSON_VALUE(payload, '$.customer.city') AS city
FROM orders;
-- ── ARRAY HANDLING ────────────────────────────────────────────────────────────
-- PostgreSQL: UNNEST
SELECT order_id, UNNEST(items) AS item FROM orders;
-- Snowflake: FLATTEN
SELECT o.order_id, f.value:item_name::VARCHAR AS item_name
FROM orders o, LATERAL FLATTEN(input => o.items) f;
-- BigQuery: UNNEST
SELECT order_id, item
FROM orders, UNNEST(items) AS item;
-- ── WINDOW FUNCTION QUALIFY (Snowflake / BigQuery only) ──────────────────────
-- PostgreSQL requires a CTE:
WITH ranked AS (SELECT *, ROW_NUMBER() OVER (...) AS rn FROM orders)
SELECT * FROM ranked WHERE rn = 1;
-- Snowflake / BigQuery — QUALIFY filters window function results directly:
SELECT * FROM orders
QUALIFY ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY updated_at DESC) = 1;
-- ── STRING FUNCTIONS ─────────────────────────────────────────────────────────
-- Concatenation:
SELECT 'Hello' || ' ' || 'World'; -- PostgreSQL, Snowflake, BigQuery
SELECT CONCAT('Hello', ' ', 'World'); -- All three
-- String splitting:
-- PostgreSQL: SPLIT_PART(str, delimiter, field_number)
-- Snowflake: SPLIT_PART(str, delimiter, field_number) -- same
-- BigQuery: SPLIT(str, delimiter)[OFFSET(0)] -- returns array
-- Regex:
-- PostgreSQL: ~ for match, regexp_replace(), regexp_extract()
-- Snowflake: REGEXP_LIKE(), REGEXP_REPLACE(), REGEXP_SUBSTR()
-- BigQuery: REGEXP_CONTAINS(), REGEXP_REPLACE(), REGEXP_EXTRACT()
-- ── DATE FUNCTIONS ────────────────────────────────────────────────────────────
-- Get current date:
CURRENT_DATE -- PostgreSQL, Snowflake
CURRENT_DATE() -- BigQuery
-- Date truncation:
DATE_TRUNC('month', ts) -- PostgreSQL: DATE_TRUNC(part, value)
DATE_TRUNC('month', ts) -- Snowflake: same
DATE_TRUNC(ts, MONTH) -- BigQuery: DATE_TRUNC(value, part) — reversed!
-- Add intervals:
ts + INTERVAL '7 days' -- PostgreSQL
DATEADD('day', 7, ts) -- Snowflake
DATE_ADD(ts, INTERVAL 7 DAY) -- BigQuery
-- ── LIMITING ROWS ────────────────────────────────────────────────────────────
SELECT * FROM orders LIMIT 10; -- PostgreSQL, Snowflake, BigQuery
SELECT TOP 10 * FROM orders; -- SQL Server / Redshift (also supports LIMIT)
SELECT * FROM orders FETCH FIRST 10 ROWS ONLY; -- SQL standardA Complete dbt Gold Model — Revenue Dashboard SQL
The analytics team needs a Gold layer table that powers the daily revenue dashboard. Requirements: daily revenue per store per category, with running month-to-date totals, day-over-day change, and store ranking within category. This is a complete production dbt model using every pattern from this module.
-- Gold model: daily revenue metrics by store and category
-- Powers the FreshMart Revenue Dashboard
-- Refresh: daily at 06:00 AM IST
WITH
-- Step 1: Base — delivered orders in analysis window
-- Filter pushed down to Silver before any aggregation
base AS (
SELECT
o.order_id,
o.store_id,
p.category,
o.order_amount,
DATE(o.created_at AT TIME ZONE 'Asia/Kolkata') AS order_date
FROM silver.orders o
JOIN silver.order_items oi ON o.order_id = oi.order_id
JOIN silver.products p ON oi.product_id = p.product_id
WHERE o.status = 'delivered'
AND o.created_at >= DATE_TRUNC('month', CURRENT_DATE - INTERVAL '2 months')
-- 2 months back to support MoM comparisons
),
-- Step 2: Daily aggregation by store + category
daily_agg AS (
SELECT
store_id,
category,
order_date,
COUNT(DISTINCT order_id) AS order_count,
SUM(order_amount) AS daily_revenue
FROM base
GROUP BY 1, 2, 3
),
-- Step 3: Window function layer — running totals, comparisons, rankings
enriched AS (
SELECT
store_id,
category,
order_date,
order_count,
daily_revenue,
-- Running month-to-date revenue
SUM(daily_revenue) OVER (
PARTITION BY store_id, category,
DATE_TRUNC('month', order_date)
ORDER BY order_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS mtd_revenue,
-- Day-over-day revenue change
daily_revenue
- COALESCE(LAG(daily_revenue) OVER (
PARTITION BY store_id, category
ORDER BY order_date
), 0)
AS dod_revenue_change,
-- 7-day moving average
ROUND(AVG(daily_revenue) OVER (
PARTITION BY store_id, category
ORDER BY order_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
), 2) AS moving_avg_7d,
-- Rank within category on each day (which store performed best?)
RANK() OVER (
PARTITION BY category, order_date
ORDER BY daily_revenue DESC
) AS store_rank_in_category
FROM daily_agg
),
-- Step 4: Add store dimension attributes
final AS (
SELECT
e.*,
s.store_name,
s.city,
s.store_manager
FROM enriched e
JOIN silver.stores s USING (store_id)
)
SELECT * FROM final
ORDER BY order_date DESC, category, store_rank_in_category;This model uses every major pattern from this module: CTE chain for readability and testability, filter pushdown in the base CTE, window functions for running totals and rankings, LAG with COALESCE for safe day-over-day comparison, and a clean final JOIN to dimension attributes. A data engineer who can write this kind of SQL confidently is ready for a production data engineering role.
5 Interview Questions — With Complete Answers
Errors You Will Hit — And Exactly Why They Happen
🎯 Key Takeaways
- ✓Window functions perform calculations across related rows without collapsing them like GROUP BY does. The OVER clause defines the window: PARTITION BY divides rows into groups, ORDER BY sequences them, and the frame clause (ROWS/RANGE BETWEEN) defines which rows are included per calculation.
- ✓ROW_NUMBER, RANK, and DENSE_RANK all assign sequential numbers within a partition. Use ROW_NUMBER for deduplication — it guarantees exactly one row per group. RANK allows ties and skips numbers. DENSE_RANK allows ties without skipping. ROW_NUMBER is the correct choice when you need exactly one row kept per key.
- ✓LAG and LEAD access values from previous and next rows without self-joins. Always provide a default value as the third argument to handle the first/last row in a partition where no previous/next row exists: LAG(amount, 1, 0).
- ✓CTEs (WITH clauses) are the foundation of readable, maintainable data engineering SQL. Every complex query should be written as a CTE chain — each step named, documented, and independently understandable. Avoid nested subqueries in production code.
- ✓Deduplication uses ROW_NUMBER() OVER (PARTITION BY business_key ORDER BY timestamp DESC) = 1. This efficiently keeps one row per business key in a single pass. In Snowflake and BigQuery, QUALIFY eliminates the need for a wrapping CTE.
- ✓NULL in SQL means unknown — not zero, not empty. NULL != "anything" evaluates to NULL (not TRUE), causing rows to be silently excluded from WHERE clauses. Always handle NULL explicitly: use IS NULL / IS NOT NULL for comparison, COALESCE for defaults, NULLIF for converting values to NULL.
- ✓SCD Type 1 overwrites — simple but loses history. SCD Type 2 adds a new row with valid_from/valid_to dates — preserves full history, enables point-in-time queries. SCD Type 3 adds a previous-value column — simple but only one change back. Use SCD2 for dimensions where historical analysis matters.
- ✓DATE_TRUNC is the standard way to group by time period. Always use TIMESTAMPTZ for event timestamps and convert to IST (AT TIME ZONE "Asia/Kolkata") only for display, not for storage. BigQuery reverses the argument order of DATE_TRUNC — a common cross-warehouse bug.
- ✓Query optimisation priorities: filter early in CTEs to reduce rows before joins, avoid SELECT *, never apply functions to indexed columns in WHERE (prevents index use), prefer JOINs over correlated subqueries, always filter on partition columns in cloud warehouses.
- ✓UNION ALL is almost always the right choice over UNION. UNION ALL is faster (no deduplication step) and preserves all records from all sources. Use UNION only when you explicitly want distinct values across two sets. When deduplication is needed, do it explicitly with ROW_NUMBER rather than implicitly with UNION.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.