Ranking Functions
Advanced ranking — percentile rank, cumulative distribution, conditional rankings, multi-level leaderboards, and every production pattern where ranking drives business decisions
// Part 01
The Full Ranking Toolkit
Module 52 introduced ROW_NUMBER, RANK, DENSE_RANK, and NTILE. This module goes deeper — covering the statistical ranking functions PERCENT_RANK and CUME_DIST, the ordered-set aggregate PERCENTILE_CONT, conditional and multi-level ranking, and the production patterns that appear in leaderboards, cohort analysis, and performance tiers.
// Part 02
PERCENT_RANK and CUME_DIST — Statistical Position
PERCENT_RANK answers "what percentage of rows rank strictly below this one?" — ranging from 0.0 (lowest) to 1.0 (highest). CUME_DIST answers "what fraction of rows have a value less than or equal to this row's value?" — always greater than 0.0, up to 1.0. Both are useful for expressing a row's position as a relative fraction rather than an absolute rank number.
PERCENT_RANK vs CUME_DIST — the key difference
// Part 03
PERCENTILE_CONT and PERCENTILE_DISC — True Percentile Values
PERCENT_RANK and CUME_DIST tell you where a row sits in the distribution. PERCENTILE_CONT and PERCENTILE_DISC answer the inverse question: what value sits at the Nth percentile? These are ordered-set aggregates — they collapse rows like GROUP BY and return a single value representing a specific point in the distribution.
// Part 04
Conditional Ranking — Rank Only a Subset
Sometimes you need to rank rows only when they meet a condition — rank only active products, rank only delivered orders, rank employees within their own department. The FILTER clause and CASE expressions extend ranking functions to handle these scenarios cleanly.
Rank within a filtered subset using a CTE
Rank with a FILTER clause — conditional aggregation inside ranking
// Part 05
Multi-Level Rankings — Ranking Within Ranking
Production leaderboards often need hierarchy: rank stores within a city, then cities within a region. Or rank products within a category, then rank categories overall. Multi-level rankings use multiple window functions with different PARTITION BY clauses in the same query.
// Part 06
Leaderboard Patterns — Production-Grade Rankings
Production leaderboards need more than a simple RANK() — they need movement indicators (up/down from last period), badges for sustained performance, and handling of new entrants. These patterns combine ranking functions with LAG for previous-period comparison.
Leaderboard with movement arrows
Sustained performance badge — consistently top-ranked
// Part 07
Percentile-Based Segmentation — Deciles and Quartiles
Percentile-based segmentation divides a continuous distribution into equal-sized groups for analysis. Quartiles (4 groups), deciles (10 groups), and percentiles (100 groups) are standard tools for customer segmentation, product tiering, and performance distribution analysis.
// Part 08
Ranking for Deduplication — Keep One Row Per Group
ROW_NUMBER is the cleanest way to deduplicate — keep exactly one row per group based on a defined priority. The pattern appears constantly in data engineering: keep the most recent record, keep the highest-value record, keep the first occurrence.
🎯 Pro Tip
When deduplicating, always add a tiebreaker to the ORDER BY inside ROW_NUMBER. If two rows have the same primary sort value (same order_date, same revenue), the one that gets rn = 1 is non-deterministic without a tiebreaker. Add a unique column (order_id, product_id) as the final ORDER BY key to ensure deterministic, reproducible results across query executions.
// Part 09
What This Looks Like at Work
You are a data analyst at Amazon. The seller success team runs a monthly seller health report: for each seller, their revenue rank within their category, their overall revenue rank, their percentile position, their quartile segment, and whether they moved up or down compared to the previous month. The ranking logic alone would take multiple queries without window functions — with them, it is one SELECT on two pre-aggregated CTEs.
🎯 Pro Tip
When building leaderboard reports, always compute rankings in a CTE layer rather than in the final SELECT. This avoids repeating the same RANK() OVER (...) expression multiple times (for the badge CASE and for display). Extract ranks into a CTE with columns rank_jan and rank_feb, then reference those columns in the final SELECT for comparisons and badges. The query runs faster (each OVER() is computed once) and is easier to read.
// Part 10
Interview Prep — 5 Questions With Complete Answers
Both express a row's position in the distribution as a fraction between 0 and 1, but they compute different things. PERCENT_RANK answers "what fraction of rows rank strictly below this row?" It is computed as (rank - 1) / (total_rows - 1). The lowest value always gets 0.0 (zero rows rank below it) and the highest always gets 1.0. Tied rows get the same PERCENT_RANK value. If two rows share the minimum value, both get PERCENT_RANK = 0.0.
CUME_DIST answers "what fraction of rows have a value less than or equal to this row's value?" It is computed as rows_with_value_leq_current / total_rows. CUME_DIST is always greater than zero — even the lowest-value rows have themselves counted, so the minimum is 1/total_rows. For the highest value, CUME_DIST = 1.0. All rows sharing the same value get the same CUME_DIST value, and it includes themselves in the count.
Practical implication: PERCENT_RANK is better for "is this row in the top X%?" comparisons — WHERE PERCENT_RANK() OVER (...) >= 0.8 finds the top 20% cleanly. CUME_DIST is useful when you want to know what proportion of the population has a value at or below a specific point — for example, "what fraction of orders are at or below ₹500?" Both are window functions requiring OVER() with ORDER BY — unlike PERCENTILE_CONT and PERCENTILE_DISC which are ordered-set aggregates that collapse rows like GROUP BY.
NTILE(n) divides rows into exactly n buckets with approximately equal row counts — it is a row-count-based division. NTILE(4) assigns each row to quartile 1, 2, 3, or 4, where each quartile contains roughly 25% of the rows. The bucket assignment depends on row position, not value — two rows with very different values could both be in quartile 3 if they are adjacent in the ordered sequence.
PERCENT_RANK is value-based — it computes the row's fractional rank relative to the distribution of values. Filtering WHERE PERCENT_RANK() OVER (...) >= 0.75 finds the top 25% by value, not by count. If the top 5% of customers account for 60% of revenue (a right-skewed distribution), NTILE splits 25% of customers per bucket regardless of value distribution, while PERCENT_RANK >= 0.75 identifies customers whose spend puts them in the top 25% by value rank.
Choose NTILE for equal-count segments — marketing campaigns that need to reach exactly 25% of the customer base per segment, A/B test splits where equal group sizes matter. Choose PERCENT_RANK for value-based thresholds — "find customers above the 75th percentile of spending", "identify products in the bottom 10% of revenue", "classify orders by where they fall in the distribution". NTILE can produce unintuitive results when the data is skewed — a product that just barely makes the top quartile may be far below the actual 75th percentile value if most revenue is concentrated in the top few products. For business reporting where "top 25%" means value-based, PERCENT_RANK is more meaningful.
When using ROW_NUMBER for deduplication (keeping exactly one row per group), ties in the ORDER BY column create non-deterministic behaviour — the database may assign rn = 1 to any of the tied rows, and the choice may change between query executions or database versions. This is a correctness problem: the same query run twice could return different rows, making results unreproducible.
The solution is always to add a tiebreaker column as the final ORDER BY key — a column that is guaranteed unique within the partition, typically the primary key. ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC, order_id DESC) — if two orders have the same order_date, the one with the higher order_id gets rn = 1. This ensures deterministic, reproducible results regardless of physical row order or internal optimiser decisions.
The tiebreaker column choice matters semantically: order_id DESC for the most recently inserted duplicate, created_at ASC for the original/oldest record. Document the tiebreaker logic — when someone asks "which record do you keep when there are duplicates?", the ORDER BY clause inside ROW_NUMBER should have a clear, business-meaningful answer. For RANK and DENSE_RANK (which are designed to express ties), no tiebreaker is needed — tied rows intentionally share the same rank. Only ROW_NUMBER requires a tiebreaker because its contract is to produce unique sequential integers.
Both compute the value at a specified percentile of a distribution, but they use different methods for non-integer positions. PERCENTILE_CONT uses linear interpolation — when the percentile falls between two actual data values, it returns the weighted average of those two values. PERCENTILE_DISC returns the nearest actual value in the dataset — no interpolation.
For the median (50th percentile) of [100, 200, 300, 400]: PERCENTILE_CONT(0.5) returns 250 (the midpoint of 200 and 300, interpolated). PERCENTILE_DISC(0.5) returns 200 or 300 (the actual data value nearest the 50th percentile position). For an odd number of values like [100, 200, 300]: both would return 200 since 200 is exactly at the 50th percentile.
Choose PERCENTILE_CONT for statistical analysis where mathematical precision matters — it follows the standard definition of percentile that most statisticians and data scientists expect. Choose PERCENTILE_DISC when you need the result to be an actual value from the dataset — for example, when selecting the median salary to identify a specific employee as the median earner. These are ordered-set aggregates (unlike the window function PERCENT_RANK) — they use WITHIN GROUP (ORDER BY column) syntax and collapse rows like GROUP BY. They do not have OVER() — they are called like standard aggregate functions: SELECT store_id, PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY total_amount) AS median FROM orders GROUP BY store_id.
The pattern requires computing ranks for two time periods independently, then comparing them. Use two CTEs — one per period — each aggregating the relevant metric (revenue, order count, score) per entity. Then a final SELECT joins the two CTEs and applies RANK() to each period's metric with separate OVER() clauses. The rank movement is the previous period rank minus the current period rank — positive means improved (moved up), negative means declined (moved down), zero means unchanged.
Implementation: WITH current_period AS (SELECT entity_id, SUM(metric) AS score FROM ... WHERE date IN current_range GROUP BY entity_id), previous_period AS (SELECT entity_id, SUM(metric) AS score FROM ... WHERE date IN previous_range GROUP BY entity_id), combined AS (SELECT COALESCE(c.entity_id, p.entity_id) AS entity_id, COALESCE(c.score, 0) AS current_score, COALESCE(p.score, 0) AS prev_score FROM current_period c FULL OUTER JOIN previous_period p ON c.entity_id = p.entity_id) SELECT *, RANK() OVER (ORDER BY current_score DESC) AS current_rank, RANK() OVER (ORDER BY prev_score DESC) AS prev_rank, RANK() OVER (ORDER BY prev_score DESC) - RANK() OVER (ORDER BY current_score DESC) AS rank_change FROM combined.
Two important details: use FULL OUTER JOIN between periods to include entities that appear in one period but not the other (new entrants have no previous rank; churned entities have no current rank). Use COALESCE(score, 0) so new entrants get rank assigned based on zero score rather than being excluded. For the movement indicator display, convert rank_change to a symbol: CASE WHEN rank_change > 0 THEN '▲ ' || rank_change WHEN rank_change < 0 THEN '▼ ' || ABS(rank_change) ELSE '—' END. For entities that did not exist in the previous period, show 'NEW' instead of a movement arrow. The FULL OUTER JOIN naturally handles this — entities with no previous record have prev_rank based on zero score.
// Part 11
Errors You Will Hit — And Exactly Why They Happen
🎯 Key Takeaways
- ✓PERCENT_RANK: fractional rank from 0.0 (lowest) to 1.0 (highest) — fraction of rows ranking strictly below. Formula: (rank - 1) / (total_rows - 1). Use WHERE PERCENT_RANK() >= 0.8 to find top 20%.
- ✓CUME_DIST: cumulative distribution — fraction of rows with value ≤ current row. Always > 0, up to 1.0. Differs from PERCENT_RANK by including the current row in the count.
- ✓PERCENTILE_CONT(f) WITHIN GROUP (ORDER BY col): returns the interpolated value at fraction f of the distribution. Uses GROUP BY syntax, not OVER(). PERCENTILE_DISC returns the nearest actual value from the dataset.
- ✓NTILE(n): equal-count buckets. When rows do not divide evenly by n, early buckets get one extra row. Use for equal-sized segments; use PERCENT_RANK for value-based thresholds.
- ✓Conditional ranking: wrap the window function in a CASE — return the rank when the condition is true, NULL otherwise. Or pre-filter in a CTE and apply the ranking to the filtered subset.
- ✓Multi-level rankings (store within city, city overall) must be computed in separate CTE layers at their respective granularities. Never compute city-level ranks in a store-level query — duplicate values produce incorrect rank assignments.
- ✓ROW_NUMBER deduplication: always add a unique tiebreaker column as the final ORDER BY key. Without a tiebreaker, ties produce non-deterministic assignment of rn = 1 — different rows may be kept on different executions.
- ✓Leaderboard with movement: FULL OUTER JOIN two period CTEs, compute RANK() OVER (...) on each period's metric independently, subtract to get rank_change. Positive = improved, negative = declined.
- ✓Pre-compute ranks in a CTE when the same RANK() expression appears in multiple CASE branches — the CTE computes it once, the outer SELECT references the named column without repeated computation.
- ✓PERCENT_RANK and CUME_DIST require ORDER BY in OVER(). PERCENTILE_CONT and PERCENTILE_DISC use WITHIN GROUP (ORDER BY) and are aggregate functions — no OVER() clause.
What comes next
In Module 54, you learn Analytics with LAG and LEAD — every period-over-period pattern, retention analysis, funnel drop-off, session gap detection, and time-series analytics built on offset window functions.
Module 54 → Analytics with LAG and LEADDiscussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.