GROUP BY
Split rows into groups and compute aggregates per group — the engine behind every analytics report, dashboard metric, and business intelligence query
// Part 01
From One Summary to Many — What GROUP BY Does
In Module 27 you learned aggregate functions — they collapse all rows into a single summary number. But most real business questions are not "what is the total revenue?" They are "what is the total revenue per store?" or "what is the average order value per city?" or "how many orders were placed per month?"
GROUP BY is the clause that splits rows into groups before aggregating — so instead of one summary number for the whole table, you get one summary number per group. It is the difference between a company-wide total and a per-department breakdown. Between a product's average rating and a category's average rating.
Same table, same WHERE filter, same aggregate functions — but GROUP BY store_id splits the rows into one bucket per store and computes the aggregates separately for each bucket. The result is one row per store instead of one row total.
// Part 02
How GROUP BY Works Internally
Understanding what the database does when it executes GROUP BY makes you write better GROUP BY queries — and explains every error you will encounter.
The execution steps
The key insight: GROUP BY runs before SELECT. This is why you cannot use a SELECT alias in GROUP BY in standard SQL — the alias has not been computed yet when GROUP BY runs. It is also why aggregate functions appear in SELECT (step 5) and HAVING (step 4), but never in WHERE (step 2) — WHERE runs before groups are formed, so aggregates do not exist yet.
What happens inside each group
After GROUP BY, each group contains all rows that share the same value(s) in the GROUP BY column(s). SELECT then processes each group independently — the aggregate functions see only the rows belonging to that group. COUNT(*) counts the rows in that group. SUM(total_amount) adds up the total_amount values for that group. MIN and MAX find the extremes within that group.
// Part 03
GROUP BY Single Column — The Foundation
Group by category
// Part 04
GROUP BY Multiple Columns — Two-Dimensional Grouping
GROUP BY can list multiple columns. The database creates one group for each unique combination of all GROUP BY columns — not one group per unique value in each column independently.
The key rule — every non-aggregate SELECT column must be in GROUP BY
// Part 05
GROUP BY with WHERE — Filter Before Grouping
WHERE filters rows before GROUP BY processes them. Only rows that pass the WHERE condition are included in any group. This is the most efficient way to filter — rows excluded by WHERE never participate in grouping or aggregation.
// Part 06
GROUP BY with Expressions — Grouping on Computed Values
GROUP BY does not have to group by raw column values. You can group by any expression — a calculated value, a date part, a CASE WHEN classification. This is how you group by month, by price band, by day of week, or by any computed category.
Grouping by date parts
Grouping by CASE WHEN — custom bands
// Part 07
GROUP BY with JOIN — Multi-Table Aggregation
The most powerful GROUP BY queries join multiple tables before grouping — bringing together data from customers, orders, products, and stores, then aggregating the combined result.
// Part 08
NULL Values in GROUP BY
Unlike most SQL contexts where NULL ≠ NULL, GROUP BY treats all NULL values as a single group. All rows with NULL in the GROUP BY column(s) are placed together in one NULL group. This is consistent with the DISTINCT behaviour you learned in Module 10.
The NULL group is a valid group — it gets its own aggregate row just like any other group. If you want to exclude the NULL group from results, add WHERE column IS NOT NULL before the GROUP BY. If you want to label the NULL group, use COALESCE in the GROUP BY expression.
// Part 09
ROLLUP — Subtotals and Grand Totals
ROLLUP is an extension of GROUP BY that automatically generates subtotals and a grand total. It is used heavily in reporting — the kind of hierarchical totals you see in Excel pivot tables.
// Part 10
Performance — How GROUP BY Uses Indexes
GROUP BY performance on large tables depends on whether the database can use indexes to avoid sorting all rows.
How the database executes GROUP BY
To group rows, the database must find all rows with the same GROUP BY value and put them together. Two approaches: Sort-based grouping — sort all rows by the GROUP BY columns, then scan sequentially (identical values are adjacent after sorting). Or Hash-based grouping — compute a hash of each row's GROUP BY values and bucket rows into a hash table. Hash-based is usually faster when the number of groups is manageable in memory.
When indexes help GROUP BY
If the GROUP BY column(s) are indexed and the query also filters by those columns, the database can use the index to avoid a full sort. For very selective GROUP BY (few groups), a partial scan of the index can be faster than processing all rows. For GROUP BY on the primary key or a unique column, each group has exactly one row — effectively no grouping needed.
Practical performance guidance
🎯 Pro Tip
The most impactful GROUP BY optimisation: ensure a strong WHERE clause filters rows before GROUP BY processes them. A GROUP BY that runs on 10,000 rows after WHERE filtering is 100x faster than running on 1,000,000 rows without filtering. Always think of WHERE as the pre-filter that determines how much work GROUP BY has to do.
// Part 11
Complete Analytics Examples — Real Business Reports
These are the types of GROUP BY queries you will write in real analytics work — complete, production-quality reports.
Monthly revenue trend
Product performance report
Customer RFM summary (Recency, Frequency, Monetary)
// Part 12
What This Looks Like at Work
You are a data analyst at Venmo. The product team is reviewing payment method adoption across different merchant categories. They need a breakdown of transaction volume and value by payment method and merchant type — a two-dimensional GROUP BY report.
🎯 Pro Tip
Two-dimensional GROUP BY queries (grouping by two columns) are the foundation of every business pivot table. The moment you find yourself wanting to see "X by Y" in a report — revenue by city by month, orders by payment method by product category, customers by tier by city — that is a GROUP BY with two columns. Learn to read business questions and immediately translate "X by Y" to GROUP BY x, y.
// Part 13
Interview Prep — 5 Questions With Complete Answers
GROUP BY divides rows into groups based on one or more columns, so that aggregate functions (COUNT, SUM, AVG, MIN, MAX) compute separately for each group rather than across all rows. Without GROUP BY, an aggregate function returns one value for the entire result set. With GROUP BY, it returns one value per group — one row per unique combination of GROUP BY column values.
The relationship is fundamental: aggregate functions and GROUP BY are designed to work together. GROUP BY defines the groups; aggregate functions compute statistics for each group. SELECT COUNT(*), SUM(total_amount) FROM orders returns one summary row. SELECT store_id, COUNT(*), SUM(total_amount) FROM orders GROUP BY store_id returns one summary row per store — the aggregates are computed independently within each store's group of rows.
The execution order is critical: GROUP BY runs after WHERE (which filters individual rows before grouping) but before SELECT (which computes the output columns including aggregates). This means: WHERE filters which rows participate in groups; GROUP BY determines the groups; SELECT aggregates within each group. Because GROUP BY runs before SELECT, you cannot use SELECT aliases in GROUP BY in standard SQL — the aliases do not exist yet. And because WHERE runs before GROUP BY, you cannot filter groups using WHERE — that requires HAVING, which runs after GROUP BY.
Every column in the SELECT list of a GROUP BY query must be either (1) listed in the GROUP BY clause, or (2) wrapped in an aggregate function. This is the fundamental rule of grouped queries and it follows directly from what GROUP BY does.
The reason: after GROUP BY, each group is collapsed into one output row. A non-aggregate column like first_name might have different values within a group — if you group by city, one Seattle group might contain customers named Aisha, Rahul, and Priya. The database cannot return a single first_name for this group because there is no single value. It can only return values that are the same for every row in the group (the GROUP BY columns themselves) or computed aggregate values that reduce the group to one number.
PostgreSQL and most standard SQL databases enforce this strictly — a non-aggregate, non-GROUP BY column in SELECT throws an error. MySQL with sql_mode not including ONLY_FULL_GROUP_BY was historically more permissive — it would pick an arbitrary value from the group for non-aggregate columns, which is usually wrong and unpredictable. This is why MySQL queries sometimes appear to "work" without proper GROUP BY usage, but return unreliable results. Always follow the rule: every SELECT column is either in GROUP BY or in an aggregate function.
WHERE and HAVING both filter data, but they operate at different stages of the GROUP BY execution pipeline and on different things. WHERE filters individual rows before groups are formed — it determines which rows participate in grouping. HAVING filters groups after they have been formed and aggregated — it determines which groups appear in the final result.
Concrete example: to find stores with more than 3 delivered orders. WHERE order_status = 'Delivered' filters individual orders to include only delivered ones. HAVING COUNT(*) > 3 filters store groups to include only stores that have more than 3 delivered orders. Both conditions are necessary: WHERE selects which orders count, HAVING selects which stores qualify. Using WHERE COUNT(*) > 3 is an error — WHERE runs before grouping, so aggregate results do not exist yet when WHERE evaluates.
Performance implication: WHERE filtering is almost always more efficient than HAVING filtering for the same condition. WHERE filters rows before they enter the grouping process — fewer rows means less grouping work. HAVING filters after all the grouping and aggregation is done — the work was already performed on the excluded groups. The rule: use WHERE for any condition that can be expressed on individual row values. Use HAVING only for conditions that require aggregate values (COUNT > N, SUM > value, AVG > threshold). Moving a filter from HAVING to WHERE (when the logic allows it) can dramatically improve query performance.
GROUP BY treats all NULL values in a GROUP BY column as belonging to the same group — all rows with NULL in the GROUP BY column are placed together in a single NULL group. This is different from WHERE and JOIN conditions, where NULL ≠ NULL. For grouping purposes, the database considers NULLs to be equal to each other.
This means a GROUP BY query on a nullable column will produce a row in the result for the NULL group, alongside rows for each non-null distinct value. GROUP BY department from employees might produce rows for 'Management', 'Operations', 'Sales', and NULL — the NULL row represents all employees without a department assignment. The aggregate functions compute normally for the NULL group: COUNT(*) counts all NULL-department employees, SUM(salary) adds their salaries.
If you want to exclude the NULL group from results, add WHERE column IS NOT NULL before the GROUP BY — this filters out NULL rows before they can form a group. If you want to give the NULL group a meaningful label in the output, use COALESCE in the GROUP BY expression: GROUP BY COALESCE(department, 'Unassigned') also changes the group value to 'Unassigned' in the SELECT output. Note that grouping by COALESCE(department, 'Unassigned') and department = 'Unassigned' are different: the COALESCE version groups NULLs and actual 'Unassigned' values together, while grouping by the raw column keeps them separate (though in practice, both should not exist in a well-constrained schema).
A basic GROUP BY can find the maximum sales value per category: SELECT category, MAX(units_sold) FROM product_sales GROUP BY category. But this only returns the maximum value — not which product achieved it. To find the specific product name alongside the maximum, you need to join the GROUP BY result back to the original data or use a different approach.
Approach 1 — subquery: SELECT p.category, p.product_name, ps.units_sold FROM product_sales ps JOIN products p ON ... WHERE (p.category, ps.units_sold) IN (SELECT p2.category, MAX(ps2.units_sold) FROM product_sales ps2 JOIN products p2 ON ... GROUP BY p2.category). This finds the product where the (category, units_sold) pair matches the maximum units_sold for that category.
Approach 2 — window functions (the modern approach): SELECT DISTINCT ON (category) category, product_name, units_sold FROM product_sales ORDER BY category, units_sold DESC. In PostgreSQL, DISTINCT ON returns the first row per category after sorting by units_sold descending — effectively the top product per category. The even cleaner approach using ROW_NUMBER(): SELECT category, product_name, units_sold FROM (SELECT category, product_name, units_sold, ROW_NUMBER() OVER (PARTITION BY category ORDER BY units_sold DESC) AS rn FROM product_sales) ranked WHERE rn = 1. This ranks products within each category and selects only rank 1 — the top seller. Window functions (Module 45) are the professional solution for "top N per group" queries.
// Part 14
Errors You Will Hit — And Exactly Why They Happen
🎯 Key Takeaways
- ✓GROUP BY splits rows into groups based on column values, then aggregate functions compute separately per group. One output row per unique combination of GROUP BY column values.
- ✓Execution order: FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY → LIMIT. GROUP BY runs before SELECT — SELECT aliases cannot be used in GROUP BY in standard SQL.
- ✓Every non-aggregate column in SELECT must appear in GROUP BY. Aggregate columns (COUNT, SUM, AVG) do not need to be in GROUP BY — they compute per group automatically.
- ✓WHERE filters individual rows BEFORE grouping. HAVING filters groups AFTER grouping. Use WHERE for row-level conditions; use HAVING only for conditions on aggregate results.
- ✓GROUP BY on multiple columns creates one group per unique combination. GROUP BY city, loyalty_tier creates one group per (city, tier) pair — potentially many groups.
- ✓GROUP BY on expressions (EXTRACT(MONTH FROM date), CASE WHEN salary >= 70000) groups by computed values. Repeat the full expression in GROUP BY — do not rely on aliases.
- ✓GROUP BY treats NULLs as a single group — all NULL rows in the GROUP BY column form one group. Use WHERE col IS NOT NULL to exclude or COALESCE to rename the NULL group.
- ✓Fan-out bug: joining a one-to-many relationship before aggregating inflates counts. Pre-aggregate in a subquery or CTE, or use COUNT(DISTINCT key) to deduplicate.
- ✓ROLLUP extends GROUP BY to add subtotals and grand totals automatically — one row per group, plus subtotal rows for each partial combination, plus one grand total row.
- ✓The most impactful performance optimisation: filter aggressively with WHERE before GROUP BY. Reducing rows before grouping is the single biggest lever for GROUP BY query speed.
What comes next
In Module 29, you learn HAVING — the clause that filters groups after aggregation, completing the WHERE + GROUP BY + HAVING triad that powers every analytical query.
Module 29 → HAVINGDiscussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.