Analytics with LAG and LEAD
Period-over-period comparison, retention analysis, funnel drop-off, session gap detection, first and last event patterns — every time-series analytics pattern built on offset window functions
// Part 01
The Core Insight — Comparing Rows Without Self-Joins
Before window functions, comparing each row to the previous row required a self-join — joining the table to itself on a relationship like "previous order date for the same customer." Self-joins are verbose, error-prone, and slow on large tables because they force a Cartesian product before filtering.
LAG and LEAD replace every self-join for adjacent-row comparison with a single function call. LAG accesses the value from N rows before the current row within the window. LEAD accesses the value from N rows after. The window (PARTITION BY + ORDER BY) defines what "before" and "after" mean — previous order for the same customer, previous day's revenue for the same store, previous status in the same event sequence.
// Part 02
Period-Over-Period Comparison — MoM, WoW, YoY
Period-over-period (PoP) comparison is the most common use of LAG in production analytics. Every business dashboard shows "this month vs last month" — LAG makes this a single expression rather than a self-join or two separate subqueries.
Month-over-month revenue change
Week-over-week per store
Same period last year — LAG with offset 12 (monthly data)
// Part 03
Retention Analysis — Did the Customer Come Back?
Retention analysis answers "of the customers who ordered in period N, how many ordered again in period N+1?" LAG and LEAD make this query straightforward — for each customer's order, LEAD shows their next order date, enabling cohort retention measurement without complex subqueries.
Days to next order — repeat purchase gap
Cohort retention — first order month vs return rate
// Part 04
Session Gap Detection — Finding Breaks in Event Streams
Session gap detection identifies when a sequence of events has a meaningful gap — defining "sessions" in clickstream data, finding delivery delays, or flagging unusual pauses in order streams. The pattern: use LAG to compute the gap between consecutive events, then flag gaps above a threshold.
Detecting gaps in order sequences
Customer order gap analysis — churn signal detection
// Part 05
First and Last Event Patterns — Identifying Boundaries
Identifying the first and last occurrence of an event per entity is a foundational analytics pattern. Combined with LAG and LEAD, it powers acquisition analysis (first order), churn detection (last order), and event sequence boundaries.
First order detection — is this the customer's first?
First-order uplift — do first orders differ from repeat orders?
// Part 06
State Change Detection — Tracking Status Transitions
State change detection identifies when a row's categorical value changes — order status transitions, loyalty tier upgrades, product availability changes. The pattern: LAG the status column and compare to the current value. When they differ, a transition occurred.
Loyalty tier upgrade detection
// Part 07
Funnel Analysis — Drop-off Between Steps
Funnel analysis measures how many users/customers complete each step in a sequence and where drop-off occurs. LAG and LEAD make it easy to compare adjacent funnel steps without multiple subqueries.
// Part 08
Running Comparisons — Each Row vs Its Previous N Rows
Beyond simple one-step LAG, production analytics often needs to compare a value to an average or sum of the previous N values — a baseline computed from the recent history. The combination of LAG-based values and frame-based window aggregates handles this cleanly.
// Part 09
Islands and Gaps — Consecutive Sequence Analysis
The islands and gaps problem: given a sequence of events, identify consecutive runs ("islands") and the breaks between them ("gaps"). Classic examples: consecutive days a store was active, runs of increasing revenue, streaks of on-time deliveries. The LAG-based solution is elegant.
Consecutive active days per store
🎯 Pro Tip
The islands-and-gaps pattern always follows the same structure: (1) LAG to compute the gap from the previous row, (2) flag rows where gap > threshold as island starts, (3) cumulative SUM of the flag gives each island a unique number, (4) GROUP BY the island number to summarise each island. This pattern works for any "consecutive run" problem — daily activity streaks, consecutive revenue increases, runs of on-time delivery.
// Part 10
What This Looks Like at Work
You are a data analyst at DoorDash. The retention team needs a comprehensive customer health report: for each customer who placed their most recent order within the last 60 days, show their order history summary, their average gap between orders, whether their gaps are increasing or decreasing (churn signal), and their predicted next order window based on their typical gap.
🎯 Pro Tip
When building predictive signals from order gaps, always filter to customers with at least 2-3 orders before computing gap trends. A customer with one order has no gap; a customer with two orders has one gap but no trend. Meaningful gap trends require at least 3 orders (two gaps to compare). Add a WHERE total_orders >= 3 filter to the final report to avoid misleading signals from customers with limited history.
// Part 11
Interview Prep — 5 Questions With Complete Answers
The pattern: aggregate data to monthly granularity in a CTE using DATE_TRUNC('month', date), then apply LAG(revenue) OVER (ORDER BY month_start) in the outer SELECT to access the previous month's value. The MoM change is current_revenue - LAG(revenue), and the growth percentage is (current - previous) / NULLIF(previous, 0) * 100.
Full implementation: WITH monthly AS (SELECT DATE_TRUNC('month', order_date)::DATE AS month_start, ROUND(SUM(total_amount), 2) AS revenue FROM orders WHERE order_status = 'Delivered' GROUP BY DATE_TRUNC('month', order_date)) SELECT month_start, revenue, LAG(revenue) OVER (ORDER BY month_start) AS prev_revenue, ROUND(revenue - LAG(revenue) OVER (ORDER BY month_start), 2) AS mom_change, ROUND((revenue - LAG(revenue) OVER (ORDER BY month_start)) / NULLIF(LAG(revenue) OVER (ORDER BY month_start), 0) * 100, 1) AS mom_pct FROM monthly ORDER BY month_start.
Important details: NULLIF(LAG(revenue), 0) prevents division by zero if a prior month had zero revenue. The first month's LAG returns NULL — handle this in the display (show '— First period' instead of a percentage). For per-entity PoP (per store, per product), add PARTITION BY entity_id to the OVER clause — LAG then looks at the previous month for the same entity, not the globally previous month. For year-over-year using monthly data, use LAG(revenue, 12) OVER (ORDER BY month_start) — the offset 12 reaches back 12 months.
Churn detection combines LAG (to compute gaps between orders) with aggregation (to compare recent gaps to historical averages). The pattern: for each customer's orders, compute the gap in days between consecutive orders using order_date - LAG(order_date) OVER (PARTITION BY customer_id ORDER BY order_date). Then aggregate per customer: the latest gap, the average historical gap, and the ratio between them.
A customer shows churn signals when their latest gap significantly exceeds their historical average: latest_gap > avg_gap * 1.5 is "at risk," latest_gap > avg_gap * 2 is "likely churned." Customers whose last order date is beyond their predicted next order date (last_order_date + avg_gap < CURRENT_DATE) are overdue. Combining these signals produces an actionable churn risk score: Green (ordering within normal cadence), Amber (gap longer than usual), Red (significantly overdue).
A complete signal: for customers whose last order was more than 90 days ago AND whose last gap is more than twice their average gap, send a re-engagement campaign. Use LEAD instead of LAG for forward-looking analysis — LEAD(order_date) OVER (...) on the most recent order returns NULL (no next order yet), which is itself a churn signal for customers whose last gap has exceeded their typical cadence. Filter to customers with at least 3 orders before computing trends — fewer orders produce unreliable gap averages.
The islands and gaps problem identifies consecutive runs of events (islands) separated by breaks (gaps). Examples: consecutive days a user was active, consecutive months of positive revenue growth, consecutive on-time deliveries. The challenge is grouping consecutive rows into the same island when there is no explicit island identifier in the data.
The LAG-based solution in four steps. Step 1: use LAG to compute the gap between each row and its predecessor — active_date - LAG(active_date) OVER (PARTITION BY entity ORDER BY active_date). Step 2: flag rows where the gap exceeds the threshold as island starts — CASE WHEN gap > 1 OR LAG is NULL THEN 1 ELSE 0 END AS is_island_start. Step 3: compute a cumulative sum of the island start flag using SUM(is_island_start) OVER (PARTITION BY entity ORDER BY active_date) — each island gets a unique incrementing number. Step 4: GROUP BY the island number to summarise each consecutive run — MIN(date) is the island start, MAX(date) is the island end, COUNT(*) is the island length.
The elegance of this approach: the cumulative SUM of the is_island_start flag increments exactly once per island. All rows in the same consecutive run share the same cumulative sum value — the same island number. Any non-consecutive gap increments the flag, creating a new island number for all subsequent rows. This pattern generalises to any consecutive-run problem: consecutive days of positive revenue (flag when revenue turns negative or date jumps), consecutive on-time deliveries (flag when a late delivery occurs), consecutive months of sales growth (flag when growth rate goes negative).
Use LAG to compute the gap for each order, then aggregate per customer. The gap for each order is order_date - LAG(order_date) OVER (PARTITION BY customer_id ORDER BY order_date, order_id). The first order per customer has LAG = NULL (no prior order) — this NULL is excluded from AVG automatically, so AVG correctly computes the average gap only across orders that have a preceding order.
Implementation: WITH order_gaps AS (SELECT customer_id, order_date, order_date - LAG(order_date) OVER (PARTITION BY customer_id ORDER BY order_date, order_id) AS gap_days FROM orders WHERE order_status = 'Delivered') SELECT customer_id, COUNT(*) AS order_count, ROUND(AVG(gap_days), 1) AS avg_gap_days, MIN(gap_days) AS min_gap, MAX(gap_days) AS max_gap FROM order_gaps GROUP BY customer_id HAVING COUNT(*) >= 2. The HAVING COUNT(*) >= 2 excludes single-order customers who have no meaningful gap to compute.
The result is used for: predicted next order date (last_order_date + avg_gap_days::INTEGER), churn signal (CURRENT_DATE - last_order_date > avg_gap_days * 1.5), and segment classification (customers with avg_gap < 7 days are weekly shoppers, avg_gap 7-30 days are monthly shoppers). Always add a tiebreaker to the ORDER BY inside the LAG (order_id is a good choice) — when two orders share the same date, the gap computation must be deterministic.
A funnel analysis needs: the count at each step, the drop-off from the previous step, and the overall conversion from the top of the funnel. LAG provides the previous step's count, FIRST_VALUE provides the top-of-funnel count, and simple arithmetic computes the rates.
Pattern: define each funnel step as a row in a CTE — either using a VALUES clause for known step counts, or computing counts per step from event tables. Then in the outer SELECT, apply LAG(users) OVER (ORDER BY step_num) for the previous step count. Drop-off = LAG(users) - users. Drop-off rate = drop-off / LAG(users) * 100. Overall conversion = users / FIRST_VALUE(users) OVER (ORDER BY step_num ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) * 100 — this gives the cumulative conversion from step 1.
For a real event-based funnel (not pre-aggregated counts), aggregate each step separately in a CTE — count distinct users who performed action A, action B, action C — then UNION ALL the step counts into one result set that LAG can process. The key is computing steps independently (not chaining — A then B then C — which double-counts and misses users who skip steps) and then joining the step counts. Use DISTINCT user counts at each step, not cumulative — a user who completes step 3 should not be double-counted at step 2. The funnel shows what percentage of the initial population makes it to each step, independent of whether they did all prior steps.
// Part 12
Errors You Will Hit — And Exactly Why They Happen
🎯 Key Takeaways
- ✓LAG(col, n, default) returns the value n rows before the current row within the window. LEAD(col, n, default) returns n rows after. Both require ORDER BY — without it, before/after are undefined.
- ✓LAG replaces every self-join for adjacent-row comparison. The window (PARTITION BY + ORDER BY) defines what adjacent means — previous order for the same customer, previous day for the same store.
- ✓Period-over-period pattern: aggregate to period granularity in a CTE (DATE_TRUNC), then apply LAG(revenue) OVER (ORDER BY month_start) in the outer SELECT. MoM change = revenue - LAG(revenue). Growth rate uses NULLIF on the denominator.
- ✓For per-entity PoP (per store, per product), add PARTITION BY entity_id to the OVER clause — LAG looks at the previous period for the same entity, not the globally previous row.
- ✓Churn signal: compute gap_days = order_date - LAG(order_date), then compare latest_gap to avg_gap per customer. latest_gap > avg_gap * 1.5 is at risk; > 2x is likely churned.
- ✓First-order detection: LAG(order_id) IS NULL flags the first order per customer (no predecessor). LEAD(order_id) IS NULL flags the last order (no successor).
- ✓Islands and gaps: (1) LAG to compute gap, (2) flag gap > threshold as island_start = 1, (3) SUM(island_start) OVER (...) cumulative gives each island a unique number, (4) GROUP BY island_num to summarise runs.
- ✓Funnel drop-off: define step counts in a CTE, then LAG(users) OVER (ORDER BY step_num) gives the prior step count. Drop-off rate = (LAG - current) / LAG * 100. Overall conversion uses FIRST_VALUE.
- ✓Trailing average excluding current row: ROWS BETWEEN 3 PRECEDING AND 1 PRECEDING. This frame ends one row before the current row — the average is from prior rows only, a true trailing baseline.
- ✓Always pre-aggregate to the correct granularity in a CTE before applying LAG/LEAD. Window functions on raw row-level data when you need daily or monthly values produce one result per raw row, not per period.
What comes next
In Module 56, you learn Recursive CTEs — querying hierarchical data like org charts, category trees, and bill-of-materials without fixed-depth self-joins, using the WITH RECURSIVE pattern.
Module 56 → Recursive CTEsDiscussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.