Python · SQL · Web Dev · Java · AI/ML tracks launching soon — your one platform for all of IT
Intermediate

GROUP BY

Split rows into groups and compute aggregates per group — the engine behind every analytics report, dashboard metric, and business intelligence query

14–18 min April 2026
Section 6 · Aggregation
Aggregation · 3 modulesModule 28

// Part 01

From One Summary to Many — What GROUP BY Does

In Module 27 you learned aggregate functions — they collapse all rows into a single summary number. But most real business questions are not "what is the total revenue?" They are "what is the total revenue per store?" or "what is the average order value per city?" or "how many orders were placed per month?"

GROUP BY is the clause that splits rows into groups before aggregating — so instead of one summary number for the whole table, you get one summary number per group. It is the difference between a company-wide total and a per-department breakdown. Between a product's average rating and a category's average rating.

Loading FreshCart DB…
Ctrl + Enter to run
Loading FreshCart database in your browser…
Loading FreshCart DB…
Ctrl + Enter to run
Loading FreshCart database in your browser…

Same table, same WHERE filter, same aggregate functions — but GROUP BY store_id splits the rows into one bucket per store and computes the aggregates separately for each bucket. The result is one row per store instead of one row total.

// Part 02

How GROUP BY Works Internally

Understanding what the database does when it executes GROUP BY makes you write better GROUP BY queries — and explains every error you will encounter.

The execution steps

01
FROM
Identify the source table(s)
02
WHERE
Filter individual rows — only matching rows proceed
03
GROUP BY
Split remaining rows into groups — one bucket per unique combination of GROUP BY columns
04
HAVING
Filter groups — discard groups that do not satisfy the HAVING condition
05
SELECT
Compute the output columns for each surviving group — aggregate functions run here
06
ORDER BY
Sort the group-level result rows
07
LIMIT
Truncate the sorted result

The key insight: GROUP BY runs before SELECT. This is why you cannot use a SELECT alias in GROUP BY in standard SQL — the alias has not been computed yet when GROUP BY runs. It is also why aggregate functions appear in SELECT (step 5) and HAVING (step 4), but never in WHERE (step 2) — WHERE runs before groups are formed, so aggregates do not exist yet.

What happens inside each group

After GROUP BY, each group contains all rows that share the same value(s) in the GROUP BY column(s). SELECT then processes each group independently — the aggregate functions see only the rows belonging to that group. COUNT(*) counts the rows in that group. SUM(total_amount) adds up the total_amount values for that group. MIN and MAX find the extremes within that group.

// Part 03

GROUP BY Single Column — The Foundation

Group by category

Loading FreshCart DB…
Ctrl + Enter to run
Loading FreshCart database in your browser…
Loading FreshCart DB…
Ctrl + Enter to run
Loading FreshCart database in your browser…
Loading FreshCart DB…
Ctrl + Enter to run
Loading FreshCart database in your browser…
Loading FreshCart DB…
Ctrl + Enter to run
Loading FreshCart database in your browser…

// Part 04

GROUP BY Multiple Columns — Two-Dimensional Grouping

GROUP BY can list multiple columns. The database creates one group for each unique combination of all GROUP BY columns — not one group per unique value in each column independently.

Loading FreshCart DB…
Ctrl + Enter to run
Loading FreshCart database in your browser…
Loading FreshCart DB…
Ctrl + Enter to run
Loading FreshCart database in your browser…
Loading FreshCart DB…
Ctrl + Enter to run
Loading FreshCart database in your browser…

The key rule — every non-aggregate SELECT column must be in GROUP BY

The GROUP BY rule — every non-aggregate must be grouped
-- WRONG: store_name is in SELECT but not in GROUP BY
SELECT store_id, store_name, COUNT(*)
FROM orders
JOIN stores ON orders.store_id = stores.store_id
GROUP BY store_id;
-- ERROR: column "stores.store_name" must appear in GROUP BY
-- or be used in an aggregate function

-- RIGHT: include all non-aggregate columns in GROUP BY
SELECT store_id, store_name, COUNT(*)
FROM orders
JOIN stores ON orders.store_id = stores.store_id
GROUP BY store_id, store_name;

-- Or aggregate store_name (rarely needed):
SELECT store_id, MIN(store_name) AS store_name, COUNT(*)
FROM orders
JOIN stores ON orders.store_id = stores.store_id
GROUP BY store_id;

// Part 05

GROUP BY with WHERE — Filter Before Grouping

WHERE filters rows before GROUP BY processes them. Only rows that pass the WHERE condition are included in any group. This is the most efficient way to filter — rows excluded by WHERE never participate in grouping or aggregation.

Loading FreshCart DB…
Ctrl + Enter to run
Loading FreshCart database in your browser…
Loading FreshCart DB…
Ctrl + Enter to run
Loading FreshCart database in your browser…
Loading FreshCart DB…
Ctrl + Enter to run
Loading FreshCart database in your browser…

// Part 06

GROUP BY with Expressions — Grouping on Computed Values

GROUP BY does not have to group by raw column values. You can group by any expression — a calculated value, a date part, a CASE WHEN classification. This is how you group by month, by price band, by day of week, or by any computed category.

Grouping by date parts

Loading FreshCart DB…
Ctrl + Enter to run
Loading FreshCart database in your browser…
Loading FreshCart DB…
Ctrl + Enter to run
Loading FreshCart database in your browser…

Grouping by CASE WHEN — custom bands

Loading FreshCart DB…
Ctrl + Enter to run
Loading FreshCart database in your browser…
Loading FreshCart DB…
Ctrl + Enter to run
Loading FreshCart database in your browser…
💡 Note
In MySQL and DuckDB (this playground), you can use a SELECT alias in GROUP BY: GROUP BY order_tier. In PostgreSQL, this is not allowed — the standard requires repeating the full expression. For portability, always repeat the expression in GROUP BY. If the expression is very long and repetition feels painful, use a CTE or subquery to define the group column first.

// Part 07

GROUP BY with JOIN — Multi-Table Aggregation

The most powerful GROUP BY queries join multiple tables before grouping — bringing together data from customers, orders, products, and stores, then aggregating the combined result.

Loading FreshCart DB…
Ctrl + Enter to run
Loading FreshCart database in your browser…
Loading FreshCart DB…
Ctrl + Enter to run
Loading FreshCart database in your browser…
Loading FreshCart DB…
Ctrl + Enter to run
Loading FreshCart database in your browser…

// Part 08

NULL Values in GROUP BY

Unlike most SQL contexts where NULL ≠ NULL, GROUP BY treats all NULL values as a single group. All rows with NULL in the GROUP BY column(s) are placed together in one NULL group. This is consistent with the DISTINCT behaviour you learned in Module 10.

Loading FreshCart DB…
Ctrl + Enter to run
Loading FreshCart database in your browser…
Loading FreshCart DB…
Ctrl + Enter to run
Loading FreshCart database in your browser…

The NULL group is a valid group — it gets its own aggregate row just like any other group. If you want to exclude the NULL group from results, add WHERE column IS NOT NULL before the GROUP BY. If you want to label the NULL group, use COALESCE in the GROUP BY expression.

Loading FreshCart DB…
Ctrl + Enter to run
Loading FreshCart database in your browser…

// Part 09

ROLLUP — Subtotals and Grand Totals

ROLLUP is an extension of GROUP BY that automatically generates subtotals and a grand total. It is used heavily in reporting — the kind of hierarchical totals you see in Excel pivot tables.

ROLLUP syntax
-- Standard GROUP BY: one row per group
GROUP BY store_id, payment_method

-- ROLLUP: rows per group + subtotals + grand total
GROUP BY ROLLUP(store_id, payment_method)

-- This generates:
-- One row per (store_id, payment_method) combination
-- One subtotal row per store_id (NULL for payment_method)
-- One grand total row (NULL for both store_id and payment_method)
Loading FreshCart DB…
Ctrl + Enter to run
Loading FreshCart database in your browser…

// Part 10

Performance — How GROUP BY Uses Indexes

GROUP BY performance on large tables depends on whether the database can use indexes to avoid sorting all rows.

How the database executes GROUP BY

To group rows, the database must find all rows with the same GROUP BY value and put them together. Two approaches: Sort-based grouping — sort all rows by the GROUP BY columns, then scan sequentially (identical values are adjacent after sorting). Or Hash-based grouping — compute a hash of each row's GROUP BY values and bucket rows into a hash table. Hash-based is usually faster when the number of groups is manageable in memory.

When indexes help GROUP BY

If the GROUP BY column(s) are indexed and the query also filters by those columns, the database can use the index to avoid a full sort. For very selective GROUP BY (few groups), a partial scan of the index can be faster than processing all rows. For GROUP BY on the primary key or a unique column, each group has exactly one row — effectively no grouping needed.

Practical performance guidance

GROUP BY performance patterns
-- FAST: GROUP BY on an indexed column with selective WHERE
SELECT store_id, COUNT(*), SUM(total_amount)
FROM orders
WHERE order_date >= '2024-01-01'  -- filtered by indexed date column
GROUP BY store_id;

-- FAST: GROUP BY on low-cardinality column (few distinct values)
-- 'order_status' has 4 values — hash grouping is trivial
SELECT order_status, COUNT(*), SUM(total_amount)
FROM orders
GROUP BY order_status;

-- SLOWER: GROUP BY on high-cardinality column, no filtering
-- customer_id has 20 distinct values here but in production could be millions
SELECT customer_id, COUNT(*)
FROM orders
GROUP BY customer_id;
-- On millions of rows: needs full sort or large hash table

-- OPTIMIZATION: filter with WHERE first to reduce rows before grouping
SELECT customer_id, COUNT(*)
FROM orders
WHERE order_date >= '2024-03-01'  -- process only recent rows
GROUP BY customer_id;

🎯 Pro Tip

The most impactful GROUP BY optimisation: ensure a strong WHERE clause filters rows before GROUP BY processes them. A GROUP BY that runs on 10,000 rows after WHERE filtering is 100x faster than running on 1,000,000 rows without filtering. Always think of WHERE as the pre-filter that determines how much work GROUP BY has to do.

// Part 11

Complete Analytics Examples — Real Business Reports

These are the types of GROUP BY queries you will write in real analytics work — complete, production-quality reports.

Monthly revenue trend

Loading FreshCart DB…
Ctrl + Enter to run
Loading FreshCart database in your browser…

Product performance report

Loading FreshCart DB…
Ctrl + Enter to run
Loading FreshCart database in your browser…

Customer RFM summary (Recency, Frequency, Monetary)

Loading FreshCart DB…
Ctrl + Enter to run
Loading FreshCart database in your browser…

// Part 12

What This Looks Like at Work

You are a data analyst at Venmo. The product team is reviewing payment method adoption across different merchant categories. They need a breakdown of transaction volume and value by payment method and merchant type — a two-dimensional GROUP BY report.

2:00 PM
Request arrives
The product manager needs: for each combination of payment method and merchant category, show the transaction count, total value, average transaction size, and the percentage of total transactions. This is a multi-column GROUP BY with percentage calculations.
2:15 PM
You build the query (adapted for FreshCart)
Adapted to FreshCart: payment method × product category breakdown.
Loading FreshCart DB…
Ctrl + Enter to run
Loading FreshCart database in your browser…
2:40 PM
Report delivered
The product manager immediately spots that COD (Cash on Delivery) is disproportionately high for Staples orders — customers trust FreshCart enough to pay digitally for premium products but default to COD for everyday groceries. This insight drives a new COD-to-digital conversion campaign targeting staple product orders.

🎯 Pro Tip

Two-dimensional GROUP BY queries (grouping by two columns) are the foundation of every business pivot table. The moment you find yourself wanting to see "X by Y" in a report — revenue by city by month, orders by payment method by product category, customers by tier by city — that is a GROUP BY with two columns. Learn to read business questions and immediately translate "X by Y" to GROUP BY x, y.

// Part 13

Interview Prep — 5 Questions With Complete Answers

Q: What does GROUP BY do and how does it relate to aggregate functions?

GROUP BY divides rows into groups based on one or more columns, so that aggregate functions (COUNT, SUM, AVG, MIN, MAX) compute separately for each group rather than across all rows. Without GROUP BY, an aggregate function returns one value for the entire result set. With GROUP BY, it returns one value per group — one row per unique combination of GROUP BY column values.

The relationship is fundamental: aggregate functions and GROUP BY are designed to work together. GROUP BY defines the groups; aggregate functions compute statistics for each group. SELECT COUNT(*), SUM(total_amount) FROM orders returns one summary row. SELECT store_id, COUNT(*), SUM(total_amount) FROM orders GROUP BY store_id returns one summary row per store — the aggregates are computed independently within each store's group of rows.

The execution order is critical: GROUP BY runs after WHERE (which filters individual rows before grouping) but before SELECT (which computes the output columns including aggregates). This means: WHERE filters which rows participate in groups; GROUP BY determines the groups; SELECT aggregates within each group. Because GROUP BY runs before SELECT, you cannot use SELECT aliases in GROUP BY in standard SQL — the aliases do not exist yet. And because WHERE runs before GROUP BY, you cannot filter groups using WHERE — that requires HAVING, which runs after GROUP BY.

Q: What is the rule about non-aggregate columns in GROUP BY queries?

Every column in the SELECT list of a GROUP BY query must be either (1) listed in the GROUP BY clause, or (2) wrapped in an aggregate function. This is the fundamental rule of grouped queries and it follows directly from what GROUP BY does.

The reason: after GROUP BY, each group is collapsed into one output row. A non-aggregate column like first_name might have different values within a group — if you group by city, one Seattle group might contain customers named Aisha, Rahul, and Priya. The database cannot return a single first_name for this group because there is no single value. It can only return values that are the same for every row in the group (the GROUP BY columns themselves) or computed aggregate values that reduce the group to one number.

PostgreSQL and most standard SQL databases enforce this strictly — a non-aggregate, non-GROUP BY column in SELECT throws an error. MySQL with sql_mode not including ONLY_FULL_GROUP_BY was historically more permissive — it would pick an arbitrary value from the group for non-aggregate columns, which is usually wrong and unpredictable. This is why MySQL queries sometimes appear to "work" without proper GROUP BY usage, but return unreliable results. Always follow the rule: every SELECT column is either in GROUP BY or in an aggregate function.

Q: What is the difference between WHERE and HAVING in a GROUP BY query?

WHERE and HAVING both filter data, but they operate at different stages of the GROUP BY execution pipeline and on different things. WHERE filters individual rows before groups are formed — it determines which rows participate in grouping. HAVING filters groups after they have been formed and aggregated — it determines which groups appear in the final result.

Concrete example: to find stores with more than 3 delivered orders. WHERE order_status = 'Delivered' filters individual orders to include only delivered ones. HAVING COUNT(*) > 3 filters store groups to include only stores that have more than 3 delivered orders. Both conditions are necessary: WHERE selects which orders count, HAVING selects which stores qualify. Using WHERE COUNT(*) > 3 is an error — WHERE runs before grouping, so aggregate results do not exist yet when WHERE evaluates.

Performance implication: WHERE filtering is almost always more efficient than HAVING filtering for the same condition. WHERE filters rows before they enter the grouping process — fewer rows means less grouping work. HAVING filters after all the grouping and aggregation is done — the work was already performed on the excluded groups. The rule: use WHERE for any condition that can be expressed on individual row values. Use HAVING only for conditions that require aggregate values (COUNT > N, SUM > value, AVG > threshold). Moving a filter from HAVING to WHERE (when the logic allows it) can dramatically improve query performance.

Q: How does GROUP BY handle NULL values?

GROUP BY treats all NULL values in a GROUP BY column as belonging to the same group — all rows with NULL in the GROUP BY column are placed together in a single NULL group. This is different from WHERE and JOIN conditions, where NULL ≠ NULL. For grouping purposes, the database considers NULLs to be equal to each other.

This means a GROUP BY query on a nullable column will produce a row in the result for the NULL group, alongside rows for each non-null distinct value. GROUP BY department from employees might produce rows for 'Management', 'Operations', 'Sales', and NULL — the NULL row represents all employees without a department assignment. The aggregate functions compute normally for the NULL group: COUNT(*) counts all NULL-department employees, SUM(salary) adds their salaries.

If you want to exclude the NULL group from results, add WHERE column IS NOT NULL before the GROUP BY — this filters out NULL rows before they can form a group. If you want to give the NULL group a meaningful label in the output, use COALESCE in the GROUP BY expression: GROUP BY COALESCE(department, 'Unassigned') also changes the group value to 'Unassigned' in the SELECT output. Note that grouping by COALESCE(department, 'Unassigned') and department = 'Unassigned' are different: the COALESCE version groups NULLs and actual 'Unassigned' values together, while grouping by the raw column keeps them separate (though in practice, both should not exist in a well-constrained schema).

Q: You need to find the top-selling product in each category. How do you approach this with GROUP BY?

A basic GROUP BY can find the maximum sales value per category: SELECT category, MAX(units_sold) FROM product_sales GROUP BY category. But this only returns the maximum value — not which product achieved it. To find the specific product name alongside the maximum, you need to join the GROUP BY result back to the original data or use a different approach.

Approach 1 — subquery: SELECT p.category, p.product_name, ps.units_sold FROM product_sales ps JOIN products p ON ... WHERE (p.category, ps.units_sold) IN (SELECT p2.category, MAX(ps2.units_sold) FROM product_sales ps2 JOIN products p2 ON ... GROUP BY p2.category). This finds the product where the (category, units_sold) pair matches the maximum units_sold for that category.

Approach 2 — window functions (the modern approach): SELECT DISTINCT ON (category) category, product_name, units_sold FROM product_sales ORDER BY category, units_sold DESC. In PostgreSQL, DISTINCT ON returns the first row per category after sorting by units_sold descending — effectively the top product per category. The even cleaner approach using ROW_NUMBER(): SELECT category, product_name, units_sold FROM (SELECT category, product_name, units_sold, ROW_NUMBER() OVER (PARTITION BY category ORDER BY units_sold DESC) AS rn FROM product_sales) ranked WHERE rn = 1. This ranks products within each category and selects only rank 1 — the top seller. Window functions (Module 45) are the professional solution for "top N per group" queries.

// Part 14

Errors You Will Hit — And Exactly Why They Happen

ERROR: column 'first_name' must appear in GROUP BY or be used in an aggregate function

Cause: A column in SELECT is neither inside an aggregate function nor listed in the GROUP BY clause. When GROUP BY is used, every GROUP BY row represents multiple source rows collapsed into one. A non-aggregate column could have different values across those source rows — the database cannot determine a single value to return, so it refuses the query.

Fix: Add the column to GROUP BY: GROUP BY store_id, first_name. Or wrap it in an aggregate: SELECT MAX(first_name) AS name (picks one value from the group). Or remove it from SELECT if it is not needed. In analytics queries, the most common fix is adding all descriptive columns (name, city, status) to GROUP BY alongside the grouping key.

GROUP BY returns more rows than expected — expected one row per store but got many

Cause: The GROUP BY columns have more unique combinations than expected — often because a JOIN introduced extra rows that created new unique combinations, or because a column you thought was a one-to-one with the GROUP BY key actually varies. For example, grouping by store_id and store_name should give one row per store, but if store_name has trailing whitespace in some rows ('Seattle ' vs 'Seattle'), it creates two distinct groups.

Fix: Inspect the GROUP BY columns: SELECT store_id, store_name, COUNT(*) FROM stores GROUP BY store_id, store_name — if this returns more than one row per store_id, there are inconsistent store_name values. Fix the data (TRIM whitespace, normalise capitalisation) or group only by the primary key (store_id) and use MIN(store_name) or MAX(store_name) to get a single name per group in SELECT.

WHERE clause on aggregate result fails — WHERE COUNT(*) > 3 gives syntax error

Cause: You tried to use an aggregate function in WHERE. WHERE executes before GROUP BY in the logical order — aggregate values do not exist yet when WHERE runs. WHERE evaluates row-by-row before any grouping occurs, so it cannot reference aggregated values.

Fix: Move the aggregate condition to HAVING, which runs after GROUP BY: GROUP BY store_id HAVING COUNT(*) > 3. HAVING filters groups after they have been formed and aggregated. Use WHERE for conditions on individual row values (column = value, column > threshold), and HAVING for conditions on aggregate results (COUNT(*) > N, SUM(amount) > value, AVG(rating) >= 4.0).

GROUP BY query gives wrong totals when joining tables — revenue is inflated

Cause: A many-to-many or one-to-many join multiplied rows before aggregation. If orders are joined to order_items (one order → many items), each order row is duplicated for every item. SUM(orders.total_amount) then adds the total_amount multiple times — once per item in the order. This is called a fan-out bug and is one of the most common data quality issues in analytics queries.

Fix: Aggregate at the correct level before joining. For order-level metrics (total_amount), aggregate in orders first. For item-level metrics (line_total, quantity), aggregate in order_items. Use subqueries or CTEs to pre-aggregate one side of the join: WITH order_totals AS (SELECT order_id, SUM(line_total) AS total FROM order_items GROUP BY order_id) then join order_totals to orders. Or use COUNT(DISTINCT order_id) instead of COUNT(*) to deduplicate after a fan-out join.

GROUP BY expression query works in MySQL but fails in PostgreSQL

Cause: You used a SELECT alias in GROUP BY: SELECT EXTRACT(MONTH FROM order_date) AS month FROM orders GROUP BY month. MySQL and DuckDB allow GROUP BY to reference SELECT aliases. PostgreSQL follows the SQL standard strictly — GROUP BY runs before SELECT, so aliases are not yet defined when GROUP BY executes.

Fix: Repeat the full expression in GROUP BY: GROUP BY EXTRACT(MONTH FROM order_date). For complex expressions that are painful to repeat, use a subquery or CTE: WITH monthly AS (SELECT EXTRACT(MONTH FROM order_date) AS month, total_amount FROM orders) SELECT month, SUM(total_amount) FROM monthly GROUP BY month. The CTE defines the alias first, making it available for GROUP BY in the outer query.

Try It Yourself

The FreshCart management team needs a store performance summary for their Q1 2024 review (January through March 2024). Write a GROUP BY query that shows for each store: store_id, city, total delivered orders, unique customers served, total delivered revenue (rounded to 2 decimal places), average order value (rounded to 2 decimal places), fastest delivery in days, and a performance_band column using CASE: 'Star' if revenue above ₹3,000, 'Good' if above ₹1,500, 'Needs Support' otherwise. Only include stores that had at least 1 delivered order. Sort by total revenue descending.

🎯 Key Takeaways

  • GROUP BY splits rows into groups based on column values, then aggregate functions compute separately per group. One output row per unique combination of GROUP BY column values.
  • Execution order: FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY → LIMIT. GROUP BY runs before SELECT — SELECT aliases cannot be used in GROUP BY in standard SQL.
  • Every non-aggregate column in SELECT must appear in GROUP BY. Aggregate columns (COUNT, SUM, AVG) do not need to be in GROUP BY — they compute per group automatically.
  • WHERE filters individual rows BEFORE grouping. HAVING filters groups AFTER grouping. Use WHERE for row-level conditions; use HAVING only for conditions on aggregate results.
  • GROUP BY on multiple columns creates one group per unique combination. GROUP BY city, loyalty_tier creates one group per (city, tier) pair — potentially many groups.
  • GROUP BY on expressions (EXTRACT(MONTH FROM date), CASE WHEN salary >= 70000) groups by computed values. Repeat the full expression in GROUP BY — do not rely on aliases.
  • GROUP BY treats NULLs as a single group — all NULL rows in the GROUP BY column form one group. Use WHERE col IS NOT NULL to exclude or COALESCE to rename the NULL group.
  • Fan-out bug: joining a one-to-many relationship before aggregating inflates counts. Pre-aggregate in a subquery or CTE, or use COUNT(DISTINCT key) to deduplicate.
  • ROLLUP extends GROUP BY to add subtotals and grand totals automatically — one row per group, plus subtotal rows for each partial combination, plus one grand total row.
  • The most impactful performance optimisation: filter aggressively with WHERE before GROUP BY. Reducing rows before grouping is the single biggest lever for GROUP BY query speed.

What comes next

In Module 29, you learn HAVING — the clause that filters groups after aggregation, completing the WHERE + GROUP BY + HAVING triad that powers every analytical query.

Module 29 → HAVING
Share

Discussion

0

Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.

Continue with GitHub
Loading...