SQL for Data Engineers
Not SQL for beginners — SQL for people who need to build pipelines, debug data quality issues, and answer hard questions from analysts. The parts that actually come up in interviews and on the job.
Why SQL is not optional
Every data engineering interview has SQL questions. Every data engineering job uses SQL daily — to validate pipeline output, to write Gold layer transformations, to debug data quality issues, to answer ad-hoc questions from analysts.
PySpark, Python, and cloud tools are important. But SQL is the one skill that never goes away regardless of which cloud, which tool, or which company you work at.
The JOIN types — and when each one actually applies
Most people know INNER JOIN. The ones that catch you in interviews are LEFT JOIN and the edge cases.
-- INNER JOIN: only rows where both sides match
SELECT o.order_id, c.customer_name
FROM orders o
INNER JOIN customers c ON o.customer_id = c.customer_id
-- LEFT JOIN: every row from orders, even if no matching customer
-- (useful for finding orphaned records — data quality check)
SELECT o.order_id, c.customer_name
FROM orders o
LEFT JOIN customers c ON o.customer_id = c.customer_id
WHERE c.customer_id IS NULL -- orders with no matching customer
-- SELF JOIN: join a table to itself
-- (useful for finding duplicates or hierarchies)
SELECT a.order_id, b.order_id as duplicate_id
FROM orders a
JOIN orders b
ON a.customer_id = b.customer_id
AND a.order_date = b.order_date
AND a.order_id < b.order_id -- avoid matching row with itselfWindow functions — the most important thing to master
Window functions let you do calculations across rows without collapsing them into groups. Every senior data engineer uses them constantly. Every interview asks about them.
-- ROW_NUMBER: unique rank per partition, no ties
SELECT
order_id,
customer_id,
order_date,
amount,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date) as order_seq
FROM orders
-- Result: each customer's orders numbered 1,2,3... in date order
-- RANK vs DENSE_RANK (the interview trap question)
SELECT
product_name,
revenue,
RANK() OVER (ORDER BY revenue DESC) as rank_with_gaps, -- 1,2,2,4
DENSE_RANK() OVER (ORDER BY revenue DESC) as rank_no_gaps -- 1,2,2,3
FROM product_revenue
-- LAG and LEAD: access previous or next row's value
SELECT
order_date,
daily_revenue,
LAG(daily_revenue) OVER (ORDER BY order_date) as prev_day_revenue,
daily_revenue - LAG(daily_revenue) OVER (ORDER BY order_date) as day_over_day_change
FROM daily_sales
-- Running total (cumulative sum)
SELECT
order_date,
daily_revenue,
SUM(daily_revenue) OVER (ORDER BY order_date ROWS UNBOUNDED PRECEDING) as running_total
FROM daily_sales
-- Percentage of total using window function
SELECT
region,
revenue,
revenue * 100.0 / SUM(revenue) OVER () as pct_of_total
FROM regional_salesCTEs — how to write SQL that is actually readable
CTE stands for Common Table Expression. It is a named temporary result set you define with a WITH clause. Use them to break complex queries into readable steps.
-- Without CTEs (hard to read and debug)
SELECT customer_id, SUM(amount) as ltv
FROM (
SELECT o.customer_id, o.amount
FROM orders o
WHERE o.status = 'completed'
AND o.order_date >= '2024-01-01'
) filtered
GROUP BY customer_id
HAVING SUM(amount) > 1000
-- With CTEs (easy to read, easy to debug one step at a time)
WITH completed_orders AS (
SELECT customer_id, amount
FROM orders
WHERE status = 'completed'
AND order_date >= '2024-01-01'
),
customer_ltv AS (
SELECT customer_id, SUM(amount) as ltv
FROM completed_orders
GROUP BY customer_id
)
SELECT *
FROM customer_ltv
WHERE ltv > 1000GROUP BY vs HAVING — the one that trips people up
-- WHERE filters rows BEFORE grouping
-- HAVING filters groups AFTER grouping
-- Wrong: this errors because you can't use aggregate in WHERE
SELECT region, COUNT(*) as order_count
FROM orders
WHERE COUNT(*) > 100 -- ERROR
GROUP BY region
-- Correct: use HAVING for aggregate conditions
SELECT region, COUNT(*) as order_count
FROM orders
GROUP BY region
HAVING COUNT(*) > 100 -- works
-- Both together
SELECT region, COUNT(*) as order_count
FROM orders
WHERE order_date >= '2024-01-01' -- filter rows first
GROUP BY region
HAVING COUNT(*) > 100 -- then filter groupsA real data quality check query
Data engineers write queries like this to validate pipeline output. You run this against your Silver layer to make sure the Bronze-to-Silver transformation worked correctly.
-- Check 1: null counts per column
SELECT
COUNT(*) as total_rows,
COUNT(*) - COUNT(order_id) as null_order_ids,
COUNT(*) - COUNT(customer_id) as null_customer_ids,
COUNT(*) - COUNT(amount) as null_amounts,
COUNT(*) - COUNT(order_date) as null_dates
FROM silver.orders
-- Check 2: duplicate order IDs
SELECT order_id, COUNT(*) as cnt
FROM silver.orders
GROUP BY order_id
HAVING COUNT(*) > 1
-- Check 3: amount range sanity check
SELECT
MIN(amount) as min_amount,
MAX(amount) as max_amount,
AVG(amount) as avg_amount,
COUNT(CASE WHEN amount < 0 THEN 1 END) as negative_amounts,
COUNT(CASE WHEN amount > 100000 THEN 1 END) as suspiciously_large
FROM silver.orders
-- Check 4: row count comparison (Silver should have fewer rows than Bronze after dedup)
SELECT
(SELECT COUNT(*) FROM bronze.orders) as bronze_count,
(SELECT COUNT(*) FROM silver.orders) as silver_count,
(SELECT COUNT(*) FROM bronze.orders) - (SELECT COUNT(*) FROM silver.orders) as removed_rowsMonth-over-month comparison — a classic analyst request
Analysts ask for this constantly. Know how to write it without thinking.
WITH monthly_revenue AS (
SELECT
DATE_TRUNC('month', order_date) as month,
SUM(amount) as revenue
FROM orders
GROUP BY DATE_TRUNC('month', order_date)
)
SELECT
month,
revenue,
LAG(revenue) OVER (ORDER BY month) as prev_month_revenue,
revenue - LAG(revenue) OVER (ORDER BY month) as change,
ROUND(
(revenue - LAG(revenue) OVER (ORDER BY month)) * 100.0
/ NULLIF(LAG(revenue) OVER (ORDER BY month), 0),
1
) as pct_change
FROM monthly_revenue
ORDER BY monthNotice NULLIF(..., 0) — this prevents division by zero if the previous month had zero revenue. Small details like this are what separate people who write SQL versus people who write production-quality SQL.
Top N per group — another common interview question
-- Top 3 products by revenue per region
WITH ranked_products AS (
SELECT
region,
product_name,
revenue,
ROW_NUMBER() OVER (PARTITION BY region ORDER BY revenue DESC) as rn
FROM product_sales
)
SELECT region, product_name, revenue
FROM ranked_products
WHERE rn <= 3
ORDER BY region, rnWhat to practice
Do not just read these examples. Type them out. Change them. Break them on purpose and fix them. The best free practice resources are LeetCode (filter for SQL, medium difficulty) and StrataScratch which has real interview questions from actual companies.
When you can write a window function without looking it up and explain WHY you used it — you are ready.
Wrote complex analytical SQL using window functions (ROW_NUMBER, RANK, LAG, LEAD, SUM OVER) for business reporting
Optimized slow queries using indexing, partitioning, and query execution plan analysis — reducing runtime by 80%
Built data reconciliation queries using CTEs and correlated subqueries to validate pipeline output accuracy
Knowledge Check
5 questions · Earn 50 XP for passing · Score 60% or more to pass
SCD Type 1, 2, and 3 — what they are, when to use each, and how to implement them.
Bronze, Silver, Gold — the pattern behind every modern data lake.
What to check, when to check it, and what to do when checks fail.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.