SQL — Module 28Intermediate

GROUP BY

Split rows into groups and compute aggregates per group — the engine behind every analytics report, dashboard metric, and business intelligence query

14–18 min April 2026

Section 6 · Aggregation

Aggregation · 3 modulesModule 28

Aggregate Grouping Filtering

// Part 01

From One Summary to Many — What GROUP BY Does

In Module 27 you learned aggregate functions — they collapse all rows into a single summary number. But most real business questions are not "what is the total revenue?" They are "what is the total revenue per store?" or "what is the average order value per city?" or "how many orders were placed per month?"

GROUP BY is the clause that splits rows into groups before aggregating — so instead of one summary number for the whole table, you get one summary number per group. It is the difference between a company-wide total and a per-department breakdown. Between a product's average rating and a category's average rating.

Loading FreshCart DB…
-- Without GROUP BY: one total for the whole table
SELECT
  COUNT(*)              AS total_orders,
  SUM(total_amount)     AS total_revenue
FROM orders
WHERE order_status = 'Delivered';Ctrl + Enter to run
Loading FreshCart database in your browser…

Loading FreshCart DB…
-- With GROUP BY: one row per store, aggregates per store
SELECT
  store_id,
  COUNT(*)              AS order_count,
  ROUND(SUM(total_amount), 2)   AS store_revenue
FROM orders
WHERE order_status = 'Delivered'
GROUP BY store_id
ORDER BY store_revenue DESC;Ctrl + Enter to run
Loading FreshCart database in your browser…

Same table, same WHERE filter, same aggregate functions — but GROUP BY store_id splits the rows into one bucket per store and computes the aggregates separately for each bucket. The result is one row per store instead of one row total.

// Part 02

How GROUP BY Works Internally

Understanding what the database does when it executes GROUP BY makes you write better GROUP BY queries — and explains every error you will encounter.

The execution steps

FROM

Identify the source table(s)

WHERE

Filter individual rows — only matching rows proceed

GROUP BY

Split remaining rows into groups — one bucket per unique combination of GROUP BY columns

HAVING

Filter groups — discard groups that do not satisfy the HAVING condition

SELECT

Compute the output columns for each surviving group — aggregate functions run here

ORDER BY

Sort the group-level result rows

LIMIT

Truncate the sorted result

The key insight: GROUP BY runs before SELECT. This is why you cannot use a SELECT alias in GROUP BY in standard SQL — the alias has not been computed yet when GROUP BY runs. It is also why aggregate functions appear in SELECT (step 5) and HAVING (step 4), but never in WHERE (step 2) — WHERE runs before groups are formed, so aggregates do not exist yet.

What happens inside each group

After GROUP BY, each group contains all rows that share the same value(s) in the GROUP BY column(s). SELECT then processes each group independently — the aggregate functions see only the rows belonging to that group. COUNT(*) counts the rows in that group. SUM(total_amount) adds up the total_amount values for that group. MIN and MAX find the extremes within that group.

// Part 03

GROUP BY Single Column — The Foundation

Group by category

Loading FreshCart DB…
-- Product count and average price per category
SELECT
  category,
  COUNT(*)                        AS product_count,
  ROUND(AVG(unit_price), 2)       AS avg_price,
  MIN(unit_price)                 AS cheapest,
  MAX(unit_price)                 AS most_expensive
FROM products
GROUP BY category
ORDER BY product_count DESC;Ctrl + Enter to run
Loading FreshCart database in your browser…

Loading FreshCart DB…
-- Orders per payment method — how customers prefer to pay
SELECT
  payment_method,
  COUNT(*)                        AS order_count,
  ROUND(SUM(total_amount), 2)     AS total_revenue,
  ROUND(AVG(total_amount), 2)     AS avg_order_value
FROM orders
GROUP BY payment_method
ORDER BY order_count DESC;Ctrl + Enter to run
Loading FreshCart database in your browser…

Loading FreshCart DB…
-- Customer count per city — where are FreshCart's customers?
SELECT
  city,
  COUNT(*)                        AS customer_count,
  COUNT(DISTINCT loyalty_tier)    AS tiers_present
FROM customers
GROUP BY city
ORDER BY customer_count DESC;Ctrl + Enter to run
Loading FreshCart database in your browser…

Loading FreshCart DB…
-- Orders per status — the operations dashboard summary
SELECT
  order_status,
  COUNT(*)                        AS order_count,
  ROUND(SUM(total_amount), 2)     AS total_value,
  ROUND(AVG(total_amount), 2)     AS avg_value
FROM orders
GROUP BY order_status
ORDER BY order_count DESC;Ctrl + Enter to run
Loading FreshCart database in your browser…

// Part 04

GROUP BY Multiple Columns — Two-Dimensional Grouping

GROUP BY can list multiple columns. The database creates one group for each unique combination of all GROUP BY columns — not one group per unique value in each column independently.

Loading FreshCart DB…
-- Orders per store per payment method
-- One row per (store_id, payment_method) pair
SELECT
  store_id,
  payment_method,
  COUNT(*)                        AS order_count,
  ROUND(SUM(total_amount), 2)     AS revenue
FROM orders
WHERE order_status = 'Delivered'
GROUP BY store_id, payment_method
ORDER BY store_id, revenue DESC;Ctrl + Enter to run
Loading FreshCart database in your browser…

Loading FreshCart DB…
-- Customers per city per loyalty tier
-- One row per (city, loyalty_tier) combination
SELECT
  city,
  loyalty_tier,
  COUNT(*)                        AS customer_count
FROM customers
GROUP BY city, loyalty_tier
ORDER BY city,
  CASE loyalty_tier
    WHEN 'Platinum' THEN 1
    WHEN 'Gold'     THEN 2
    WHEN 'Silver'   THEN 3
    ELSE 4
  END;Ctrl + Enter to run
Loading FreshCart database in your browser…

Loading FreshCart DB…
-- Product count per category per in_stock status
SELECT
  category,
  in_stock,
  COUNT(*)                        AS product_count,
  ROUND(AVG(unit_price), 2)       AS avg_price
FROM products
GROUP BY category, in_stock
ORDER BY category, in_stock DESC;Ctrl + Enter to run
Loading FreshCart database in your browser…

The key rule — every non-aggregate SELECT column must be in GROUP BY

The GROUP BY rule — every non-aggregate must be grouped

-- WRONG: store_name is in SELECT but not in GROUP BY
SELECT store_id, store_name, COUNT(*)
FROM orders
JOIN stores ON orders.store_id = stores.store_id
GROUP BY store_id;
-- ERROR: column "stores.store_name" must appear in GROUP BY
-- or be used in an aggregate function

-- RIGHT: include all non-aggregate columns in GROUP BY
SELECT store_id, store_name, COUNT(*)
FROM orders
JOIN stores ON orders.store_id = stores.store_id
GROUP BY store_id, store_name;

-- Or aggregate store_name (rarely needed):
SELECT store_id, MIN(store_name) AS store_name, COUNT(*)
FROM orders
JOIN stores ON orders.store_id = stores.store_id
GROUP BY store_id;

// Part 05

GROUP BY with WHERE — Filter Before Grouping

WHERE filters rows before GROUP BY processes them. Only rows that pass the WHERE condition are included in any group. This is the most efficient way to filter — rows excluded by WHERE never participate in grouping or aggregation.

Loading FreshCart DB…
-- Revenue per store for DELIVERED orders only
-- WHERE filters first, then GROUP BY processes surviving rows
SELECT
  o.store_id,
  s.city,
  COUNT(o.order_id)               AS delivered_orders,
  ROUND(SUM(o.total_amount), 2)   AS delivered_revenue
FROM orders AS o
JOIN stores AS s ON o.store_id = s.store_id
WHERE o.order_status = 'Delivered'   -- filter BEFORE grouping
GROUP BY o.store_id, s.city
ORDER BY delivered_revenue DESC;Ctrl + Enter to run
Loading FreshCart database in your browser…

Loading FreshCart DB…
-- Product metrics for in-stock items only
SELECT
  category,
  COUNT(*)                        AS in_stock_count,
  ROUND(AVG(unit_price), 2)       AS avg_price,
  ROUND(AVG((unit_price - cost_price) / unit_price * 100), 1) AS avg_margin_pct
FROM products
WHERE in_stock = true             -- filter before grouping
GROUP BY category
ORDER BY avg_margin_pct DESC;Ctrl + Enter to run
Loading FreshCart database in your browser…

Loading FreshCart DB…
-- Employee salary stats per department — management dept only
SELECT
  department,
  role,
  COUNT(*)                        AS headcount,
  MIN(salary)                     AS min_salary,
  MAX(salary)                     AS max_salary,
  ROUND(AVG(salary), 0)           AS avg_salary
FROM employees
WHERE hire_date >= '2020-01-01'   -- only employees hired from 2020 onwards
GROUP BY department, role
ORDER BY department, avg_salary DESC;Ctrl + Enter to run
Loading FreshCart database in your browser…

// Part 06

GROUP BY with Expressions — Grouping on Computed Values

GROUP BY does not have to group by raw column values. You can group by any expression — a calculated value, a date part, a CASE WHEN classification. This is how you group by month, by price band, by day of week, or by any computed category.

Grouping by date parts

Loading FreshCart DB…
-- Orders per month — group by the month extracted from order_date
SELECT
  EXTRACT(YEAR  FROM order_date)  AS year,
  EXTRACT(MONTH FROM order_date)  AS month,
  COUNT(*)                        AS order_count,
  ROUND(SUM(total_amount), 2)     AS monthly_revenue,
  ROUND(AVG(total_amount), 2)     AS avg_order_value
FROM orders
WHERE order_status = 'Delivered'
GROUP BY
  EXTRACT(YEAR  FROM order_date),
  EXTRACT(MONTH FROM order_date)
ORDER BY year, month;Ctrl + Enter to run
Loading FreshCart database in your browser…

Loading FreshCart DB…
-- Orders per day of week (1=Sunday...7=Saturday in DuckDB)
SELECT
  EXTRACT(DOW FROM order_date)    AS day_of_week,
  COUNT(*)                        AS order_count,
  ROUND(AVG(total_amount), 2)     AS avg_order_value
FROM orders
GROUP BY EXTRACT(DOW FROM order_date)
ORDER BY day_of_week;Ctrl + Enter to run
Loading FreshCart database in your browser…

Grouping by CASE WHEN — custom bands

Loading FreshCart DB…
-- Group orders into value tiers
SELECT
  CASE
    WHEN total_amount >= 2000 THEN 'Premium (₹2000+)'
    WHEN total_amount >= 800  THEN 'Standard (₹800-1999)'
    WHEN total_amount >= 300  THEN 'Basic (₹300-799)'
    ELSE 'Micro (<₹300)'
  END                             AS order_tier,
  COUNT(*)                        AS order_count,
  ROUND(SUM(total_amount), 2)     AS tier_revenue,
  ROUND(AVG(total_amount), 2)     AS avg_in_tier
FROM orders
GROUP BY
  CASE
    WHEN total_amount >= 2000 THEN 'Premium (₹2000+)'
    WHEN total_amount >= 800  THEN 'Standard (₹800-1999)'
    WHEN total_amount >= 300  THEN 'Basic (₹300-799)'
    ELSE 'Micro (<₹300)'
  END
ORDER BY avg_in_tier DESC;Ctrl + Enter to run
Loading FreshCart database in your browser…

Loading FreshCart DB…
-- Group employees by salary band
SELECT
  CASE
    WHEN salary >= 70000 THEN 'Senior (₹70k+)'
    WHEN salary >= 50000 THEN 'Mid (₹50k-70k)'
    WHEN salary >= 35000 THEN 'Junior (₹35k-50k)'
    ELSE 'Entry (<₹35k)'
  END                             AS salary_band,
  COUNT(*)                        AS headcount,
  ROUND(AVG(salary), 0)           AS avg_salary,
  SUM(salary)                     AS total_payroll
FROM employees
GROUP BY
  CASE
    WHEN salary >= 70000 THEN 'Senior (₹70k+)'
    WHEN salary >= 50000 THEN 'Mid (₹50k-70k)'
    WHEN salary >= 35000 THEN 'Junior (₹35k-50k)'
    ELSE 'Entry (<₹35k)'
  END
ORDER BY avg_salary DESC;Ctrl + Enter to run
Loading FreshCart database in your browser…

💡 Note

In MySQL and DuckDB (this playground), you can use a SELECT alias in GROUP BY: GROUP BY order_tier. In PostgreSQL, this is not allowed — the standard requires repeating the full expression. For portability, always repeat the expression in GROUP BY. If the expression is very long and repetition feels painful, use a CTE or subquery to define the group column first.

// Part 07

GROUP BY with JOIN — Multi-Table Aggregation

The most powerful GROUP BY queries join multiple tables before grouping — bringing together data from customers, orders, products, and stores, then aggregating the combined result.

Loading FreshCart DB…
-- Revenue, orders, and unique customers per store city
SELECT
  s.city,
  COUNT(DISTINCT o.order_id)      AS total_orders,
  COUNT(DISTINCT o.customer_id)   AS unique_customers,
  ROUND(SUM(o.total_amount), 2)   AS total_revenue,
  ROUND(AVG(o.total_amount), 2)   AS avg_order_value,
  s.monthly_target,
  ROUND(SUM(o.total_amount) / s.monthly_target * 100, 1) AS pct_of_target
FROM stores AS s
LEFT JOIN orders AS o
  ON s.store_id = o.store_id
  AND o.order_status = 'Delivered'
GROUP BY s.city, s.monthly_target
ORDER BY total_revenue DESC NULLS LAST;Ctrl + Enter to run
Loading FreshCart database in your browser…

Loading FreshCart DB…
-- Top product categories by units sold and revenue
SELECT
  p.category,
  COUNT(DISTINCT oi.order_id)     AS orders_with_category,
  SUM(oi.quantity)                AS total_units_sold,
  ROUND(SUM(oi.line_total), 2)    AS total_revenue,
  ROUND(AVG(oi.unit_price), 2)    AS avg_selling_price,
  COUNT(DISTINCT p.product_id)    AS distinct_products
FROM products AS p
JOIN order_items AS oi ON p.product_id = oi.product_id
JOIN orders AS o      ON oi.order_id   = o.order_id
WHERE o.order_status = 'Delivered'
GROUP BY p.category
ORDER BY total_revenue DESC;Ctrl + Enter to run
Loading FreshCart database in your browser…

Loading FreshCart DB…
-- Customer order behaviour by loyalty tier
SELECT
  c.loyalty_tier,
  COUNT(DISTINCT c.customer_id)   AS customers_in_tier,
  COUNT(o.order_id)               AS total_orders,
  ROUND(SUM(o.total_amount), 2)   AS total_revenue,
  ROUND(AVG(o.total_amount), 2)   AS avg_order_value,
  ROUND(COUNT(o.order_id)::DECIMAL
        / COUNT(DISTINCT c.customer_id), 1) AS orders_per_customer
FROM customers AS c
LEFT JOIN orders AS o
  ON c.customer_id = o.customer_id
  AND o.order_status = 'Delivered'
GROUP BY c.loyalty_tier
ORDER BY
  CASE c.loyalty_tier
    WHEN 'Platinum' THEN 1
    WHEN 'Gold'     THEN 2
    WHEN 'Silver'   THEN 3
    ELSE 4
  END;Ctrl + Enter to run
Loading FreshCart database in your browser…

// Part 08

NULL Values in GROUP BY

Unlike most SQL contexts where NULL ≠ NULL, GROUP BY treats all NULL values as a single group. All rows with NULL in the GROUP BY column(s) are placed together in one NULL group. This is consistent with the DISTINCT behaviour you learned in Module 10.

Loading FreshCart DB…
-- Group by sub_category — which includes NULL (products without sub-category)
-- NULL sub_category rows form their own group
SELECT
  sub_category,
  COUNT(*)                        AS product_count,
  ROUND(AVG(unit_price), 2)       AS avg_price
FROM products
GROUP BY sub_category
ORDER BY product_count DESC;Ctrl + Enter to run
Loading FreshCart database in your browser…

Loading FreshCart DB…
-- Employees grouped by store_id — head office employees have NULL store_id
SELECT
  store_id,
  COUNT(*)                        AS employee_count,
  ROUND(AVG(salary), 0)           AS avg_salary,
  SUM(salary)                     AS total_payroll
FROM employees
GROUP BY store_id
ORDER BY store_id NULLS LAST;Ctrl + Enter to run
Loading FreshCart database in your browser…

The NULL group is a valid group — it gets its own aggregate row just like any other group. If you want to exclude the NULL group from results, add WHERE column IS NOT NULL before the GROUP BY. If you want to label the NULL group, use COALESCE in the GROUP BY expression.

Loading FreshCart DB…
-- Label the NULL store as 'Head Office'
SELECT
  COALESCE(store_id, 'Head Office')   AS store,
  COUNT(*)                            AS employee_count,
  ROUND(AVG(salary), 0)               AS avg_salary
FROM employees
GROUP BY COALESCE(store_id, 'Head Office')
ORDER BY store;Ctrl + Enter to run
Loading FreshCart database in your browser…

// Part 09

ROLLUP — Subtotals and Grand Totals

ROLLUP is an extension of GROUP BY that automatically generates subtotals and a grand total. It is used heavily in reporting — the kind of hierarchical totals you see in Excel pivot tables.

ROLLUP syntax

-- Standard GROUP BY: one row per group
GROUP BY store_id, payment_method

-- ROLLUP: rows per group + subtotals + grand total
GROUP BY ROLLUP(store_id, payment_method)

-- This generates:
-- One row per (store_id, payment_method) combination
-- One subtotal row per store_id (NULL for payment_method)
-- One grand total row (NULL for both store_id and payment_method)

Loading FreshCart DB…
-- Revenue with subtotals per store and grand total
-- NULL in store_id = subtotal/grand total row
SELECT
  COALESCE(store_id, 'ALL STORES') AS store,
  COALESCE(payment_method, 'ALL METHODS') AS payment,
  COUNT(*)                          AS order_count,
  ROUND(SUM(total_amount), 2)       AS revenue
FROM orders
WHERE order_status = 'Delivered'
GROUP BY ROLLUP(store_id, payment_method)
ORDER BY store, payment;Ctrl + Enter to run
Loading FreshCart database in your browser…

// Part 10

Performance — How GROUP BY Uses Indexes

GROUP BY performance on large tables depends on whether the database can use indexes to avoid sorting all rows.

How the database executes GROUP BY

To group rows, the database must find all rows with the same GROUP BY value and put them together. Two approaches: Sort-based grouping — sort all rows by the GROUP BY columns, then scan sequentially (identical values are adjacent after sorting). Or Hash-based grouping — compute a hash of each row's GROUP BY values and bucket rows into a hash table. Hash-based is usually faster when the number of groups is manageable in memory.

When indexes help GROUP BY

If the GROUP BY column(s) are indexed and the query also filters by those columns, the database can use the index to avoid a full sort. For very selective GROUP BY (few groups), a partial scan of the index can be faster than processing all rows. For GROUP BY on the primary key or a unique column, each group has exactly one row — effectively no grouping needed.

Practical performance guidance

GROUP BY performance patterns

-- FAST: GROUP BY on an indexed column with selective WHERE
SELECT store_id, COUNT(*), SUM(total_amount)
FROM orders
WHERE order_date >= '2024-01-01'  -- filtered by indexed date column
GROUP BY store_id;

-- FAST: GROUP BY on low-cardinality column (few distinct values)
-- 'order_status' has 4 values — hash grouping is trivial
SELECT order_status, COUNT(*), SUM(total_amount)
FROM orders
GROUP BY order_status;

-- SLOWER: GROUP BY on high-cardinality column, no filtering
-- customer_id has 20 distinct values here but in production could be millions
SELECT customer_id, COUNT(*)
FROM orders
GROUP BY customer_id;
-- On millions of rows: needs full sort or large hash table

-- OPTIMIZATION: filter with WHERE first to reduce rows before grouping
SELECT customer_id, COUNT(*)
FROM orders
WHERE order_date >= '2024-03-01'  -- process only recent rows
GROUP BY customer_id;

🎯 Pro Tip

The most impactful GROUP BY optimisation: ensure a strong WHERE clause filters rows before GROUP BY processes them. A GROUP BY that runs on 10,000 rows after WHERE filtering is 100x faster than running on 1,000,000 rows without filtering. Always think of WHERE as the pre-filter that determines how much work GROUP BY has to do.

// Part 11

Complete Analytics Examples — Real Business Reports

These are the types of GROUP BY queries you will write in real analytics work — complete, production-quality reports.

Monthly revenue trend

Loading FreshCart DB…
-- Month-by-month revenue trend with growth metrics
SELECT
  EXTRACT(YEAR  FROM order_date)              AS year,
  EXTRACT(MONTH FROM order_date)              AS month,
  COUNT(*)                                    AS orders,
  ROUND(SUM(total_amount), 2)                 AS revenue,
  ROUND(AVG(total_amount), 2)                 AS avg_order,
  COUNT(DISTINCT customer_id)                 AS unique_customers
FROM orders
WHERE order_status = 'Delivered'
GROUP BY
  EXTRACT(YEAR  FROM order_date),
  EXTRACT(MONTH FROM order_date)
ORDER BY year, month;Ctrl + Enter to run
Loading FreshCart database in your browser…

Product performance report

Loading FreshCart DB…
-- Full product performance: sales, revenue, margin
SELECT
  p.product_name,
  p.category,
  p.brand,
  p.unit_price,
  COUNT(DISTINCT oi.order_id)                 AS times_ordered,
  SUM(oi.quantity)                            AS units_sold,
  ROUND(SUM(oi.line_total), 2)                AS total_revenue,
  ROUND(SUM(oi.quantity * (oi.unit_price - p.cost_price)), 2) AS gross_profit,
  ROUND(
    SUM(oi.quantity * (oi.unit_price - p.cost_price))
    / NULLIF(SUM(oi.line_total), 0) * 100
  , 1)                                        AS margin_pct
FROM products AS p
JOIN order_items AS oi ON p.product_id = oi.product_id
JOIN orders AS o ON oi.order_id = o.order_id
WHERE o.order_status = 'Delivered'
GROUP BY p.product_id, p.product_name, p.category, p.brand, p.unit_price, p.cost_price
ORDER BY gross_profit DESC;Ctrl + Enter to run
Loading FreshCart database in your browser…

Customer RFM summary (Recency, Frequency, Monetary)

Loading FreshCart DB…
-- RFM analysis: how recently, how often, how much per customer
SELECT
  c.customer_id,
  c.first_name || ' ' || c.last_name         AS customer,
  c.loyalty_tier,
  MAX(o.order_date)                           AS last_order_date,
  CURRENT_DATE - MAX(o.order_date)            AS days_since_last_order,
  COUNT(o.order_id)                           AS order_frequency,
  ROUND(SUM(o.total_amount), 2)               AS total_spend,
  ROUND(AVG(o.total_amount), 2)               AS avg_order_value
FROM customers AS c
LEFT JOIN orders AS o
  ON c.customer_id = o.customer_id
  AND o.order_status = 'Delivered'
GROUP BY c.customer_id, c.first_name, c.last_name, c.loyalty_tier
ORDER BY total_spend DESC NULLS LAST;Ctrl + Enter to run
Loading FreshCart database in your browser…

// Part 12

What This Looks Like at Work

You are a data analyst at Venmo. The product team is reviewing payment method adoption across different merchant categories. They need a breakdown of transaction volume and value by payment method and merchant type — a two-dimensional GROUP BY report.

2:00 PM

Request arrives

The product manager needs: for each combination of payment method and merchant category, show the transaction count, total value, average transaction size, and the percentage of total transactions. This is a multi-column GROUP BY with percentage calculations.

2:15 PM

You build the query (adapted for FreshCart)

Adapted to FreshCart: payment method × product category breakdown.

Loading FreshCart DB…
-- Payment method adoption by product category
-- Two-dimensional GROUP BY with percentage of total
SELECT
  p.category,
  o.payment_method,
  COUNT(DISTINCT o.order_id)                  AS order_count,
  ROUND(SUM(oi.line_total), 2)                AS category_payment_revenue,
  ROUND(AVG(o.total_amount), 2)               AS avg_order_value,
  -- Percentage of total orders (all categories, all methods)
  ROUND(
    COUNT(DISTINCT o.order_id) * 100.0
    / SUM(COUNT(DISTINCT o.order_id)) OVER ()
  , 1)                                        AS pct_of_total_orders
FROM orders AS o
JOIN order_items AS oi ON o.order_id  = oi.order_id
JOIN products    AS p  ON oi.product_id = p.product_id
WHERE o.order_status = 'Delivered'
GROUP BY p.category, o.payment_method
ORDER BY p.category, order_count DESC;Ctrl + Enter to run
Loading FreshCart database in your browser…

2:40 PM

Report delivered

The product manager immediately spots that COD (Cash on Delivery) is disproportionately high for Staples orders — customers trust FreshCart enough to pay digitally for premium products but default to COD for everyday groceries. This insight drives a new COD-to-digital conversion campaign targeting staple product orders.

🎯 Pro Tip

Two-dimensional GROUP BY queries (grouping by two columns) are the foundation of every business pivot table. The moment you find yourself wanting to see "X by Y" in a report — revenue by city by month, orders by payment method by product category, customers by tier by city — that is a GROUP BY with two columns. Learn to read business questions and immediately translate "X by Y" to GROUP BY x, y.

// Part 13

Interview Prep — 5 Questions With Complete Answers

Q: What does GROUP BY do and how does it relate to aggregate functions?

GROUP BY divides rows into groups based on one or more columns, so that aggregate functions (COUNT, SUM, AVG, MIN, MAX) compute separately for each group rather than across all rows. Without GROUP BY, an aggregate function returns one value for the entire result set. With GROUP BY, it returns one value per group — one row per unique combination of GROUP BY column values.

The relationship is fundamental: aggregate functions and GROUP BY are designed to work together. GROUP BY defines the groups; aggregate functions compute statistics for each group. SELECT COUNT(*), SUM(total_amount) FROM orders returns one summary row. SELECT store_id, COUNT(*), SUM(total_amount) FROM orders GROUP BY store_id returns one summary row per store — the aggregates are computed independently within each store's group of rows.

The execution order is critical: GROUP BY runs after WHERE (which filters individual rows before grouping) but before SELECT (which computes the output columns including aggregates). This means: WHERE filters which rows participate in groups; GROUP BY determines the groups; SELECT aggregates within each group. Because GROUP BY runs before SELECT, you cannot use SELECT aliases in GROUP BY in standard SQL — the aliases do not exist yet. And because WHERE runs before GROUP BY, you cannot filter groups using WHERE — that requires HAVING, which runs after GROUP BY.

Q: What is the rule about non-aggregate columns in GROUP BY queries?

Every column in the SELECT list of a GROUP BY query must be either (1) listed in the GROUP BY clause, or (2) wrapped in an aggregate function. This is the fundamental rule of grouped queries and it follows directly from what GROUP BY does.

The reason: after GROUP BY, each group is collapsed into one output row. A non-aggregate column like first_name might have different values within a group — if you group by city, one Seattle group might contain customers named Aisha, Rahul, and Priya. The database cannot return a single first_name for this group because there is no single value. It can only return values that are the same for every row in the group (the GROUP BY columns themselves) or computed aggregate values that reduce the group to one number.

PostgreSQL and most standard SQL databases enforce this strictly — a non-aggregate, non-GROUP BY column in SELECT throws an error. MySQL with sql_mode not including ONLY_FULL_GROUP_BY was historically more permissive — it would pick an arbitrary value from the group for non-aggregate columns, which is usually wrong and unpredictable. This is why MySQL queries sometimes appear to "work" without proper GROUP BY usage, but return unreliable results. Always follow the rule: every SELECT column is either in GROUP BY or in an aggregate function.

Q: What is the difference between WHERE and HAVING in a GROUP BY query?

WHERE and HAVING both filter data, but they operate at different stages of the GROUP BY execution pipeline and on different things. WHERE filters individual rows before groups are formed — it determines which rows participate in grouping. HAVING filters groups after they have been formed and aggregated — it determines which groups appear in the final result.

Concrete example: to find stores with more than 3 delivered orders. WHERE order_status = 'Delivered' filters individual orders to include only delivered ones. HAVING COUNT(*) > 3 filters store groups to include only stores that have more than 3 delivered orders. Both conditions are necessary: WHERE selects which orders count, HAVING selects which stores qualify. Using WHERE COUNT(*) > 3 is an error — WHERE runs before grouping, so aggregate results do not exist yet when WHERE evaluates.

Performance implication: WHERE filtering is almost always more efficient than HAVING filtering for the same condition. WHERE filters rows before they enter the grouping process — fewer rows means less grouping work. HAVING filters after all the grouping and aggregation is done — the work was already performed on the excluded groups. The rule: use WHERE for any condition that can be expressed on individual row values. Use HAVING only for conditions that require aggregate values (COUNT > N, SUM > value, AVG > threshold). Moving a filter from HAVING to WHERE (when the logic allows it) can dramatically improve query performance.

Q: How does GROUP BY handle NULL values?

GROUP BY treats all NULL values in a GROUP BY column as belonging to the same group — all rows with NULL in the GROUP BY column are placed together in a single NULL group. This is different from WHERE and JOIN conditions, where NULL ≠ NULL. For grouping purposes, the database considers NULLs to be equal to each other.

This means a GROUP BY query on a nullable column will produce a row in the result for the NULL group, alongside rows for each non-null distinct value. GROUP BY department from employees might produce rows for 'Management', 'Operations', 'Sales', and NULL — the NULL row represents all employees without a department assignment. The aggregate functions compute normally for the NULL group: COUNT(*) counts all NULL-department employees, SUM(salary) adds their salaries.

If you want to exclude the NULL group from results, add WHERE column IS NOT NULL before the GROUP BY — this filters out NULL rows before they can form a group. If you want to give the NULL group a meaningful label in the output, use COALESCE in the GROUP BY expression: GROUP BY COALESCE(department, 'Unassigned') also changes the group value to 'Unassigned' in the SELECT output. Note that grouping by COALESCE(department, 'Unassigned') and department = 'Unassigned' are different: the COALESCE version groups NULLs and actual 'Unassigned' values together, while grouping by the raw column keeps them separate (though in practice, both should not exist in a well-constrained schema).

Q: You need to find the top-selling product in each category. How do you approach this with GROUP BY?

A basic GROUP BY can find the maximum sales value per category: SELECT category, MAX(units_sold) FROM product_sales GROUP BY category. But this only returns the maximum value — not which product achieved it. To find the specific product name alongside the maximum, you need to join the GROUP BY result back to the original data or use a different approach.

Approach 1 — subquery: SELECT p.category, p.product_name, ps.units_sold FROM product_sales ps JOIN products p ON ... WHERE (p.category, ps.units_sold) IN (SELECT p2.category, MAX(ps2.units_sold) FROM product_sales ps2 JOIN products p2 ON ... GROUP BY p2.category). This finds the product where the (category, units_sold) pair matches the maximum units_sold for that category.

Approach 2 — window functions (the modern approach): SELECT DISTINCT ON (category) category, product_name, units_sold FROM product_sales ORDER BY category, units_sold DESC. In PostgreSQL, DISTINCT ON returns the first row per category after sorting by units_sold descending — effectively the top product per category. The even cleaner approach using ROW_NUMBER(): SELECT category, product_name, units_sold FROM (SELECT category, product_name, units_sold, ROW_NUMBER() OVER (PARTITION BY category ORDER BY units_sold DESC) AS rn FROM product_sales) ranked WHERE rn = 1. This ranks products within each category and selects only rank 1 — the top seller. Window functions (Module 45) are the professional solution for "top N per group" queries.

// Part 14

Errors You Will Hit — And Exactly Why They Happen

ERROR: column 'first_name' must appear in GROUP BY or be used in an aggregate function

Cause: A column in SELECT is neither inside an aggregate function nor listed in the GROUP BY clause. When GROUP BY is used, every GROUP BY row represents multiple source rows collapsed into one. A non-aggregate column could have different values across those source rows — the database cannot determine a single value to return, so it refuses the query.

Fix: Add the column to GROUP BY: GROUP BY store_id, first_name. Or wrap it in an aggregate: SELECT MAX(first_name) AS name (picks one value from the group). Or remove it from SELECT if it is not needed. In analytics queries, the most common fix is adding all descriptive columns (name, city, status) to GROUP BY alongside the grouping key.

GROUP BY returns more rows than expected — expected one row per store but got many

Cause: The GROUP BY columns have more unique combinations than expected — often because a JOIN introduced extra rows that created new unique combinations, or because a column you thought was a one-to-one with the GROUP BY key actually varies. For example, grouping by store_id and store_name should give one row per store, but if store_name has trailing whitespace in some rows ('Seattle ' vs 'Seattle'), it creates two distinct groups.

Fix: Inspect the GROUP BY columns: SELECT store_id, store_name, COUNT(*) FROM stores GROUP BY store_id, store_name — if this returns more than one row per store_id, there are inconsistent store_name values. Fix the data (TRIM whitespace, normalise capitalisation) or group only by the primary key (store_id) and use MIN(store_name) or MAX(store_name) to get a single name per group in SELECT.

WHERE clause on aggregate result fails — WHERE COUNT(*) > 3 gives syntax error

Cause: You tried to use an aggregate function in WHERE. WHERE executes before GROUP BY in the logical order — aggregate values do not exist yet when WHERE runs. WHERE evaluates row-by-row before any grouping occurs, so it cannot reference aggregated values.

Fix: Move the aggregate condition to HAVING, which runs after GROUP BY: GROUP BY store_id HAVING COUNT(*) > 3. HAVING filters groups after they have been formed and aggregated. Use WHERE for conditions on individual row values (column = value, column > threshold), and HAVING for conditions on aggregate results (COUNT(*) > N, SUM(amount) > value, AVG(rating) >= 4.0).

GROUP BY query gives wrong totals when joining tables — revenue is inflated

Cause: A many-to-many or one-to-many join multiplied rows before aggregation. If orders are joined to order_items (one order → many items), each order row is duplicated for every item. SUM(orders.total_amount) then adds the total_amount multiple times — once per item in the order. This is called a fan-out bug and is one of the most common data quality issues in analytics queries.

Fix: Aggregate at the correct level before joining. For order-level metrics (total_amount), aggregate in orders first. For item-level metrics (line_total, quantity), aggregate in order_items. Use subqueries or CTEs to pre-aggregate one side of the join: WITH order_totals AS (SELECT order_id, SUM(line_total) AS total FROM order_items GROUP BY order_id) then join order_totals to orders. Or use COUNT(DISTINCT order_id) instead of COUNT(*) to deduplicate after a fan-out join.

GROUP BY expression query works in MySQL but fails in PostgreSQL

Cause: You used a SELECT alias in GROUP BY: SELECT EXTRACT(MONTH FROM order_date) AS month FROM orders GROUP BY month. MySQL and DuckDB allow GROUP BY to reference SELECT aliases. PostgreSQL follows the SQL standard strictly — GROUP BY runs before SELECT, so aliases are not yet defined when GROUP BY executes.

Fix: Repeat the full expression in GROUP BY: GROUP BY EXTRACT(MONTH FROM order_date). For complex expressions that are painful to repeat, use a subquery or CTE: WITH monthly AS (SELECT EXTRACT(MONTH FROM order_date) AS month, total_amount FROM orders) SELECT month, SUM(total_amount) FROM monthly GROUP BY month. The CTE defines the alias first, making it available for GROUP BY in the outer query.

Try It Yourself

The FreshCart management team needs a store performance summary for their Q1 2024 review (January through March 2024). Write a GROUP BY query that shows for each store: store_id, city, total delivered orders, unique customers served, total delivered revenue (rounded to 2 decimal places), average order value (rounded to 2 decimal places), fastest delivery in days, and a performance_band column using CASE: 'Star' if revenue above ₹3,000, 'Good' if above ₹1,500, 'Needs Support' otherwise. Only include stores that had at least 1 delivered order. Sort by total revenue descending.

🎯 Key Takeaways

✓GROUP BY splits rows into groups based on column values, then aggregate functions compute separately per group. One output row per unique combination of GROUP BY column values.
✓Execution order: FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY → LIMIT. GROUP BY runs before SELECT — SELECT aliases cannot be used in GROUP BY in standard SQL.
✓Every non-aggregate column in SELECT must appear in GROUP BY. Aggregate columns (COUNT, SUM, AVG) do not need to be in GROUP BY — they compute per group automatically.
✓WHERE filters individual rows BEFORE grouping. HAVING filters groups AFTER grouping. Use WHERE for row-level conditions; use HAVING only for conditions on aggregate results.
✓GROUP BY on multiple columns creates one group per unique combination. GROUP BY city, loyalty_tier creates one group per (city, tier) pair — potentially many groups.
✓GROUP BY on expressions (EXTRACT(MONTH FROM date), CASE WHEN salary >= 70000) groups by computed values. Repeat the full expression in GROUP BY — do not rely on aliases.
✓GROUP BY treats NULLs as a single group — all NULL rows in the GROUP BY column form one group. Use WHERE col IS NOT NULL to exclude or COALESCE to rename the NULL group.
✓Fan-out bug: joining a one-to-many relationship before aggregating inflates counts. Pre-aggregate in a subquery or CTE, or use COUNT(DISTINCT key) to deduplicate.
✓ROLLUP extends GROUP BY to add subtotals and grand totals automatically — one row per group, plus subtotal rows for each partial combination, plus one grand total row.
✓The most impactful performance optimisation: filter aggressively with WHERE before GROUP BY. Reducing rows before grouping is the single biggest lever for GROUP BY query speed.

What comes next

In Module 29, you learn HAVING — the clause that filters groups after aggregation, completing the WHERE + GROUP BY + HAVING triad that powers every analytical query.

Module 29 → HAVING

Discussion

Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.

Continue with GitHub