SQL — Module 36Intermediate

Subqueries

Queries inside queries — scalar subqueries in SELECT, subqueries in WHERE and FROM, correlated subqueries, and when to use each type versus a JOIN or CTE

14–20 min April 2026

Section 8 · Subqueries & Set Operations

Subqueries & Set Operations · 5 modulesModule 36

Subqueries Correlated EXISTS UNION,Derived

// Part 01

What a Subquery Is

A subquery is a SELECT statement nested inside another SQL statement. The outer query treats the subquery's result as if it were a table, a single value, or a list — depending on where the subquery appears and what it returns. The database executes the inner query first, then uses the result to evaluate the outer query.

Subqueries solve a specific class of problem: queries that need data computed from one query in order to filter, compute, or define the scope of another. They are the mechanism for composing queries — building complex analysis from simpler pieces. Understanding where subqueries can appear and what each placement means is the entire substance of this module.

Subquery anatomy

-- Outer query
SELECT customer_id, first_name, total_amount
FROM orders
WHERE total_amount > (
  -- Inner query (subquery) — executes first
  SELECT AVG(total_amount)
  FROM orders
  WHERE order_status = 'Delivered'
);
-- The subquery returns a single number (the average)
-- The outer query compares each row's total_amount to that number

// Part 02

The Four Subquery Types

Scalar subquery

Appears in

SELECT clause, WHERE clause, HAVING clause

Returns

Exactly one row, one column — a single value

Use for

Compare a column to a computed aggregate (avg, max, min). Add a computed metric to each row.

Row subquery

Appears in

WHERE clause with = or IN

Returns

Exactly one row, multiple columns

Use for

Match a row against a tuple of values from another query.

Table subquery (derived table)

Appears in

FROM clause — after FROM or JOIN

Returns

Multiple rows and columns — a virtual table

Use for

Pre-aggregate, pre-filter, or transform data before the outer query joins or filters it.

Correlated subquery

Appears in

WHERE, SELECT, or HAVING — references outer query

Returns

One value per outer row — runs once per outer row

Use for

Row-level comparison against an aggregate computed for that specific row's group.

// Part 03

Scalar Subquery in WHERE — Compare to a Computed Value

The most common subquery type: a subquery in WHERE that returns a single value. The outer query compares each row against that value. This is the pattern for "find rows where the value is above/below the overall average" — a question that cannot be answered with a simple WHERE condition because the average is computed from the data being filtered.

Orders above the average order value

Loading FreshCart DB…
-- Find orders above the overall average order value
-- The subquery computes the average; the outer query filters rows
SELECT
  o.order_id,
  o.order_date,
  o.total_amount,
  o.order_status,
  c.first_name || ' ' || c.last_name  AS customer
FROM orders AS o
JOIN customers AS c ON o.customer_id = c.customer_id
WHERE o.total_amount > (
  SELECT AVG(total_amount)
  FROM orders
  WHERE order_status = 'Delivered'
)
ORDER BY o.total_amount DESC;Ctrl + Enter to run
Loading FreshCart database in your browser…

Loading FreshCart DB…
-- Products priced above the average price in their own category
-- The subquery uses a correlated reference — covered in Part 07
-- This version uses a scalar subquery for the overall average
SELECT
  product_name,
  category,
  unit_price,
  ROUND(unit_price - (SELECT AVG(unit_price) FROM products), 2) AS above_avg_by
FROM products
WHERE unit_price > (SELECT AVG(unit_price) FROM products)
ORDER BY above_avg_by DESC;Ctrl + Enter to run
Loading FreshCart database in your browser…

Scalar subquery in SELECT — add a computed reference to every row

Loading FreshCart DB…
-- Show each order's total_amount alongside the overall average
-- and the difference between the two
SELECT
  o.order_id,
  o.order_date,
  o.total_amount,
  ROUND((SELECT AVG(total_amount) FROM orders
         WHERE order_status = 'Delivered'), 2)   AS overall_avg,
  ROUND(o.total_amount -
        (SELECT AVG(total_amount) FROM orders
         WHERE order_status = 'Delivered'), 2)   AS vs_avg
FROM orders AS o
WHERE o.order_status = 'Delivered'
ORDER BY vs_avg DESC
LIMIT 10;Ctrl + Enter to run
Loading FreshCart database in your browser…

Loading FreshCart DB…
-- Each product with its price and the category's average price
-- Uses a scalar subquery in SELECT (correlated — runs per row)
SELECT
  product_id,
  product_name,
  category,
  unit_price,
  ROUND((
    SELECT AVG(p2.unit_price)
    FROM products AS p2
    WHERE p2.category = p.category
  ), 2)                                   AS category_avg_price,
  ROUND(unit_price - (
    SELECT AVG(p2.unit_price)
    FROM products AS p2
    WHERE p2.category = p.category
  ), 2)                                   AS vs_category_avg
FROM products AS p
ORDER BY category, vs_category_avg DESC;Ctrl + Enter to run
Loading FreshCart database in your browser…

💡 Note

A scalar subquery in SELECT that references the outer query (like the category example above) is a correlated scalar subquery. It executes once per row in the outer query — if the outer query has 100 rows, the subquery runs 100 times. This is correct but can be slow for large tables. A JOIN to a pre-aggregated subquery or a window function is usually more efficient for the same result.

// Part 04

Subquery with IN — Filter Against a List

A subquery after IN returns a list of values. The outer query keeps rows where the column matches any value in that list. This is the multi-value version of the scalar subquery — instead of one comparison value, IN provides many.

Loading FreshCart DB…
-- Orders placed by customers from metro cities
-- Subquery returns a list of customer_ids from metro cities
SELECT
  o.order_id,
  o.order_date,
  o.total_amount,
  o.order_status
FROM orders AS o
WHERE o.customer_id IN (
  SELECT customer_id
  FROM customers
  WHERE city IN ('Seattle', 'New York', 'Delhi', 'Austin', 'Chicago')
)
ORDER BY o.total_amount DESC
LIMIT 10;Ctrl + Enter to run
Loading FreshCart database in your browser…

Loading FreshCart DB…
-- Products that appear in at least one delivered order
-- Subquery returns all product_ids from delivered orders
SELECT
  p.product_id,
  p.product_name,
  p.category,
  p.unit_price
FROM products AS p
WHERE p.product_id IN (
  SELECT DISTINCT oi.product_id
  FROM order_items AS oi
  JOIN orders AS o ON oi.order_id = o.order_id
  WHERE o.order_status = 'Delivered'
)
ORDER BY p.category, p.unit_price DESC;Ctrl + Enter to run
Loading FreshCart database in your browser…

Loading FreshCart DB…
-- Stores that have delivered more than 3 orders
-- Subquery identifies qualifying store IDs
SELECT
  s.store_id,
  s.store_name,
  s.city,
  s.monthly_target
FROM stores AS s
WHERE s.store_id IN (
  SELECT store_id
  FROM orders
  WHERE order_status = 'Delivered'
  GROUP BY store_id
  HAVING COUNT(*) > 3
)
ORDER BY s.city;Ctrl + Enter to run
Loading FreshCart database in your browser…

NOT IN — exclude rows matching the list

Loading FreshCart DB…
-- Products that have NEVER been ordered
-- NOT IN: products whose product_id is not in any order_items row
SELECT
  p.product_id,
  p.product_name,
  p.category,
  p.unit_price,
  p.in_stock
FROM products AS p
WHERE p.product_id NOT IN (
  SELECT DISTINCT product_id
  FROM order_items
)
ORDER BY p.category, p.unit_price DESC;Ctrl + Enter to run
Loading FreshCart database in your browser…

⚠️ Important

NOT IN is dangerous when the subquery can return NULL values. NOT IN (1, 2, NULL) returns no rows for any comparison — because col = NULL evaluates to NULL, making the entire NOT IN condition NULL. Always add WHERE subquery_column IS NOT NULL inside the subquery when using NOT IN, or use NOT EXISTS instead.

// Part 05

Subquery in FROM — Derived Tables

A subquery in the FROM clause is called a derived table or inline view. The outer query treats it exactly like a regular table — it can be joined, filtered, grouped, and sorted. Derived tables are essential for multi-step analytical queries where an intermediate aggregation must be computed before the outer query can use it.

Pre-aggregate then join

Loading FreshCart DB…
-- Average order value per store, joined back to store details
-- The derived table computes per-store aggregates
-- The outer query joins store details to the aggregated result
SELECT
  s.store_id,
  s.store_name,
  s.city,
  s.monthly_target,
  store_stats.order_count,
  store_stats.avg_order_value,
  store_stats.total_revenue,
  ROUND(store_stats.total_revenue / s.monthly_target * 100, 1) AS target_pct
FROM stores AS s
JOIN (
  SELECT
    store_id,
    COUNT(*)                     AS order_count,
    ROUND(AVG(total_amount), 2)  AS avg_order_value,
    ROUND(SUM(total_amount), 2)  AS total_revenue
  FROM orders
  WHERE order_status = 'Delivered'
  GROUP BY store_id
) AS store_stats ON s.store_id = store_stats.store_id
ORDER BY total_revenue DESC;Ctrl + Enter to run
Loading FreshCart database in your browser…

Filter before joining — reduce rows early

Loading FreshCart DB…
-- Join only Platinum customers' delivered orders
-- Pre-filter in a derived table before joining to order_items
SELECT
  plat.first_name || ' ' || plat.last_name  AS customer,
  plat.city,
  o.order_id,
  o.order_date,
  o.total_amount
FROM (
  SELECT customer_id, first_name, last_name, city
  FROM customers
  WHERE loyalty_tier = 'Platinum'
) AS plat
JOIN orders AS o
  ON plat.customer_id = o.customer_id
  AND o.order_status = 'Delivered'
ORDER BY o.total_amount DESC;Ctrl + Enter to run
Loading FreshCart database in your browser…

Two-level aggregation — aggregate an aggregate

Loading FreshCart DB…
-- Average of per-store average order values
-- (average of averages — requires two levels of aggregation)
SELECT
  ROUND(AVG(store_avg), 2)    AS avg_of_store_avgs,
  MIN(store_avg)              AS lowest_store_avg,
  MAX(store_avg)              AS highest_store_avg
FROM (
  SELECT
    store_id,
    ROUND(AVG(total_amount), 2) AS store_avg
  FROM orders
  WHERE order_status = 'Delivered'
  GROUP BY store_id
) AS store_averages;Ctrl + Enter to run
Loading FreshCart database in your browser…

Loading FreshCart DB…
-- Customer segments: high-value customers (top 25% by spend)
-- Step 1: compute each customer's total spend (inner query)
-- Step 2: compute the 75th percentile threshold (middle query)
-- Step 3: filter customers above that threshold (outer query)
SELECT
  c.customer_id,
  c.first_name || ' ' || c.last_name   AS customer,
  c.loyalty_tier,
  cust_spend.total_spend
FROM customers AS c
JOIN (
  SELECT
    customer_id,
    ROUND(SUM(total_amount), 2) AS total_spend
  FROM orders
  WHERE order_status = 'Delivered'
  GROUP BY customer_id
) AS cust_spend ON c.customer_id = cust_spend.customer_id
WHERE cust_spend.total_spend >= (
  SELECT PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY total_spend)
  FROM (
    SELECT customer_id, SUM(total_amount) AS total_spend
    FROM orders WHERE order_status = 'Delivered'
    GROUP BY customer_id
  ) AS spend_distribution
)
ORDER BY cust_spend.total_spend DESC;Ctrl + Enter to run
Loading FreshCart database in your browser…

// Part 06

Subquery in HAVING — Filter Groups by Computed Values

Loading FreshCart DB…
-- Stores whose revenue exceeds the average store revenue
-- The subquery computes the average; HAVING filters groups above it
SELECT
  store_id,
  COUNT(*)                     AS order_count,
  ROUND(SUM(total_amount), 2)  AS revenue
FROM orders
WHERE order_status = 'Delivered'
GROUP BY store_id
HAVING SUM(total_amount) > (
  -- Average revenue per store
  SELECT AVG(store_rev)
  FROM (
    SELECT store_id, SUM(total_amount) AS store_rev
    FROM orders
    WHERE order_status = 'Delivered'
    GROUP BY store_id
  ) AS store_revenues
)
ORDER BY revenue DESC;Ctrl + Enter to run
Loading FreshCart database in your browser…

Loading FreshCart DB…
-- Product categories where average margin exceeds overall average margin
SELECT
  category,
  COUNT(*)                                                     AS product_count,
  ROUND(AVG((unit_price - cost_price) / unit_price * 100), 1) AS avg_margin_pct
FROM products
GROUP BY category
HAVING AVG((unit_price - cost_price) / unit_price * 100) > (
  SELECT AVG((unit_price - cost_price) / unit_price * 100)
  FROM products
)
ORDER BY avg_margin_pct DESC;Ctrl + Enter to run
Loading FreshCart database in your browser…

// Part 07

Correlated Subquery — Runs Once Per Outer Row

A correlated subquery references a column from the outer query. This means it cannot be evaluated independently — it must be re-executed for each row of the outer query, with the outer row's values substituted into the subquery. The result is one value per outer row.

Correlated subqueries are powerful but can be slow on large tables — they execute N times for N outer rows. For performance-critical queries, a JOIN or window function often provides the same result more efficiently. Use correlated subqueries when the logic is clearest expressed as "for each row, compute X from related rows."

Each employee compared to their department average

Loading FreshCart DB…
-- Each employee with their salary and department average
-- Correlated subquery: for each employee row, compute
-- the average salary for employees in the SAME department
SELECT
  e.employee_id,
  e.first_name || ' ' || e.last_name   AS employee,
  e.department,
  e.salary,
  ROUND((
    SELECT AVG(e2.salary)
    FROM employees AS e2
    WHERE e2.department = e.department  -- references outer row's department
  ), 0)                                AS dept_avg_salary,
  ROUND(e.salary - (
    SELECT AVG(e2.salary)
    FROM employees AS e2
    WHERE e2.department = e.department
  ), 0)                                AS vs_dept_avg
FROM employees AS e
ORDER BY e.department, e.salary DESC;Ctrl + Enter to run
Loading FreshCart database in your browser…

Each product vs its category's max price

Loading FreshCart DB…
-- Each product's price vs the maximum price in its category
SELECT
  p.product_name,
  p.category,
  p.unit_price,
  (
    SELECT MAX(p2.unit_price)
    FROM products AS p2
    WHERE p2.category = p.category   -- correlated: matches outer row's category
  )                                   AS category_max_price,
  ROUND(
    (SELECT MAX(p2.unit_price) FROM products AS p2
     WHERE p2.category = p.category) - p.unit_price
  , 2)                                AS below_max_by
FROM products AS p
ORDER BY p.category, p.unit_price DESC;Ctrl + Enter to run
Loading FreshCart database in your browser…

Correlated subquery in WHERE — the EXISTS pattern

Loading FreshCart DB…
-- Customers who have at least one order above ₹1,000
-- EXISTS: stops as soon as the first matching row is found (efficient)
SELECT
  c.customer_id,
  c.first_name || ' ' || c.last_name  AS customer,
  c.city,
  c.loyalty_tier
FROM customers AS c
WHERE EXISTS (
  SELECT 1
  FROM orders AS o
  WHERE o.customer_id = c.customer_id    -- correlated
    AND o.total_amount > 1000
    AND o.order_status = 'Delivered'
)
ORDER BY c.loyalty_tier, c.customer_id;Ctrl + Enter to run
Loading FreshCart database in your browser…

Loading FreshCart DB…
-- Stores that have at least one employee
-- EXISTS vs IN: EXISTS short-circuits on first match, often faster
SELECT
  s.store_id,
  s.store_name,
  s.city
FROM stores AS s
WHERE EXISTS (
  SELECT 1
  FROM employees AS e
  WHERE e.store_id = s.store_id    -- correlated: checks this specific store
)
ORDER BY s.city;Ctrl + Enter to run
Loading FreshCart database in your browser…

// Part 08

EXISTS and NOT EXISTS — Existence Checks

EXISTS returns TRUE if the subquery returns at least one row, FALSE if it returns no rows. It is the cleanest way to check whether a related record exists, and it short-circuits — the database stops scanning as soon as the first matching row is found, making it very efficient for existence checks.

EXISTS vs IN — when to prefer each

-- EXISTS: efficient, NULL-safe, semantically clear
-- Use for existence checks — "does at least one related row exist?"
WHERE EXISTS (
  SELECT 1           -- SELECT 1 is conventional — column content irrelevant
  FROM orders AS o
  WHERE o.customer_id = c.customer_id
    AND o.total_amount > 1000
)

-- IN: readable, good for fixed lists and small subquery results
-- Problematic with NULLs in the subquery result
WHERE customer_id IN (
  SELECT customer_id FROM orders WHERE total_amount > 1000
)

-- Both return the same rows for non-NULL values
-- Prefer EXISTS for large subquery results or when NULLs are possible
-- Prefer IN for small fixed value lists (IN (1, 2, 3)) or readable code

Loading FreshCart DB…
-- NOT EXISTS: customers who have NO orders at all
-- NULL-safe alternative to LEFT JOIN IS NULL and NOT IN
SELECT
  c.customer_id,
  c.first_name || ' ' || c.last_name  AS customer,
  c.email,
  c.joined_date
FROM customers AS c
WHERE NOT EXISTS (
  SELECT 1
  FROM orders AS o
  WHERE o.customer_id = c.customer_id
)
ORDER BY c.joined_date;Ctrl + Enter to run
Loading FreshCart database in your browser…

Loading FreshCart DB…
-- NOT EXISTS: products that have never been ordered
SELECT
  p.product_id,
  p.product_name,
  p.category,
  p.unit_price
FROM products AS p
WHERE NOT EXISTS (
  SELECT 1
  FROM order_items AS oi
  WHERE oi.product_id = p.product_id
)
ORDER BY p.category, p.unit_price DESC;Ctrl + Enter to run
Loading FreshCart database in your browser…

// Part 09

Subquery vs JOIN vs CTE — When to Use Each

Subqueries, JOINs, and CTEs often produce the same result. Choosing between them is about readability, performance, and reuse.

Approach	Best for	Avoid when	Performance
Scalar subquery	Adding one computed value per row; simple threshold comparisons	Needs to reference multiple columns from the subquery; correlated over large tables	Fine for small tables; correlated versions can be O(n²)
IN subquery	Filtering against a list computed from another table; readable anti-join with NOT IN	Subquery can return NULLs (use EXISTS instead); very large result lists	Good; optimiser often converts to JOIN internally
Derived table (FROM subquery)	Pre-aggregation before joining; two-level aggregation; isolating complex logic	The same subquery is needed multiple times (use CTE); deep nesting degrades readability	Equivalent to CTE; may or may not be materialised by optimiser
Correlated subquery	Row-level comparison against a group aggregate; EXISTS/NOT EXISTS checks	Large outer tables — runs once per row; better replaced by JOIN + GROUP BY or window function	Can be slow — O(n) subquery executions; optimiser sometimes rewrites to JOIN
JOIN	Combining columns from multiple tables; most analytics queries; when the relationship is explicit	Single existence check (use EXISTS); when the relationship is aggregate-based	Usually fastest; uses indexes; optimiser has full plan flexibility
CTE (WITH clause)	Complex multi-step logic; when the same subquery is referenced more than once; self-documenting named steps	Very simple one-use subqueries where inline is clearer	Equivalent to derived table; PostgreSQL materialises by default (MATERIALIZED hint available)

Rewriting a correlated subquery as a JOIN — performance improvement

Loading FreshCart DB…
-- Correlated subquery version (runs AVG once per employee row)
SELECT
  e.first_name,
  e.department,
  e.salary,
  ROUND((SELECT AVG(e2.salary) FROM employees e2
         WHERE e2.department = e.department), 0) AS dept_avg
FROM employees AS e
ORDER BY e.department, e.salary DESC;Ctrl + Enter to run
Loading FreshCart database in your browser…

Loading FreshCart DB…
-- JOIN version (computes AVG once per department — much faster at scale)
SELECT
  e.first_name,
  e.department,
  e.salary,
  da.dept_avg
FROM employees AS e
JOIN (
  SELECT department, ROUND(AVG(salary), 0) AS dept_avg
  FROM employees
  GROUP BY department
) AS da ON e.department = da.department
ORDER BY e.department, e.salary DESC;Ctrl + Enter to run
Loading FreshCart database in your browser…

Both produce identical results. The JOIN version computes the department average once per department (N_departments queries) rather than once per employee (N_employees queries). For a company with 10,000 employees across 20 departments, the JOIN version is 500x less work.

// Part 10

Nested Subqueries — Subqueries Inside Subqueries

Subqueries can be nested — a subquery can itself contain a subquery. SQL has no theoretical limit on nesting depth, but readability degrades rapidly beyond two levels. When you find yourself writing a three-level nested subquery, it is almost always clearer as a CTE.

Loading FreshCart DB…
-- Two-level nesting: orders from stores in the top half by revenue
SELECT
  o.order_id,
  o.store_id,
  o.total_amount,
  o.order_status
FROM orders AS o
WHERE o.store_id IN (
  -- Level 1: stores whose revenue is above the median store revenue
  SELECT store_id
  FROM (
    -- Level 2: per-store revenue totals
    SELECT store_id, SUM(total_amount) AS store_revenue
    FROM orders
    WHERE order_status = 'Delivered'
    GROUP BY store_id
  ) AS store_revs
  WHERE store_revenue >= (
    -- Level 3: median revenue
    SELECT AVG(store_revenue)
    FROM (
      SELECT store_id, SUM(total_amount) AS store_revenue
      FROM orders
      WHERE order_status = 'Delivered'
      GROUP BY store_id
    ) AS all_store_revs
  )
)
ORDER BY o.total_amount DESC
LIMIT 10;Ctrl + Enter to run
Loading FreshCart database in your browser…

Loading FreshCart DB…
-- Same logic, much cleaner as a CTE
WITH store_revenues AS (
  SELECT store_id, SUM(total_amount) AS revenue
  FROM orders WHERE order_status = 'Delivered'
  GROUP BY store_id
),
median_revenue AS (
  SELECT AVG(revenue) AS median_rev FROM store_revenues
),
top_stores AS (
  SELECT store_id FROM store_revenues, median_revenue
  WHERE revenue >= median_rev
)
SELECT o.order_id, o.store_id, o.total_amount, o.order_status
FROM orders AS o
WHERE o.store_id IN (SELECT store_id FROM top_stores)
ORDER BY o.total_amount DESC
LIMIT 10;Ctrl + Enter to run
Loading FreshCart database in your browser…

🎯 Pro Tip

The rule for nesting depth: one level of inline subquery is fine. Two levels is acceptable for simple cases. Three or more levels should almost always be refactored into CTEs. Each CTE gets a descriptive name that documents what it computes — the multi-level CTE version reads like a step-by-step explanation of the logic. The nested subquery version requires the reader to work inside-out to understand what it does.

// Part 11

What This Looks Like at Work

You are a data analyst at Stripe. The product team needs a customer health report for a quarterly review. They want: every customer with their total spend, their spend relative to the average for their loyalty tier, whether they qualify as "high value" (top 25% spenders in their tier), and their most recent order date. This requires several computed values that depend on group-level aggregates — a classic multi-subquery problem.

3:00 PM

Requirements received

Customer ID, name, loyalty tier, total spend, tier average spend, spend vs tier average, high-value flag (top 25% in tier), last order date.

3:20 PM

You plan the query structure

Three computed values need subqueries: tier average spend (correlated or JOIN), 75th percentile per tier (derived table), last order date (scalar correlated or JOIN). You choose CTEs for clarity.

Loading FreshCart DB…
-- Customer health report using CTEs and subqueries
WITH customer_spend AS (
  -- Total delivered spend per customer
  SELECT
    customer_id,
    ROUND(SUM(total_amount), 2)  AS total_spend,
    COUNT(order_id)              AS order_count,
    MAX(order_date)              AS last_order_date
  FROM orders
  WHERE order_status = 'Delivered'
  GROUP BY customer_id
),
tier_stats AS (
  -- Average and 75th percentile spend per loyalty tier
  SELECT
    c.loyalty_tier,
    ROUND(AVG(cs.total_spend), 2)   AS tier_avg_spend,
    ROUND(PERCENTILE_CONT(0.75)
      WITHIN GROUP (ORDER BY cs.total_spend), 2) AS tier_p75_spend
  FROM customers AS c
  JOIN customer_spend AS cs ON c.customer_id = cs.customer_id
  GROUP BY c.loyalty_tier
)
SELECT
  c.customer_id,
  c.first_name || ' ' || c.last_name   AS customer,
  c.loyalty_tier,
  COALESCE(cs.total_spend, 0)          AS total_spend,
  COALESCE(cs.order_count, 0)          AS order_count,
  cs.last_order_date,
  ts.tier_avg_spend,
  ROUND(COALESCE(cs.total_spend, 0) - ts.tier_avg_spend, 2) AS vs_tier_avg,
  CASE
    WHEN cs.total_spend >= ts.tier_p75_spend THEN 'High value'
    WHEN cs.total_spend IS NULL              THEN 'No orders'
    ELSE 'Standard'
  END                                  AS customer_segment
FROM customers AS c
LEFT JOIN customer_spend AS cs ON c.customer_id = cs.customer_id
JOIN tier_stats AS ts ON c.loyalty_tier = ts.loyalty_tier
ORDER BY c.loyalty_tier, total_spend DESC NULLS LAST;Ctrl + Enter to run
Loading FreshCart database in your browser…

4:00 PM

Report complete — delivered 40 minutes early

Three CTEs — customer_spend, tier_stats — each doing one clean computation, assembled in the final SELECT. Every computed metric is clearly named and the logic is readable without any nested subquery archaeology. The product team gets the complete customer health report in one query with all requested fields.

🎯 Pro Tip

The CTE-first pattern for complex reports: identify every intermediate computation the report needs (customer total spend, tier averages, percentiles), give each its own named CTE, then assemble in the final SELECT. The final SELECT reads like plain English — join customer to their spend to their tier stats. This structure makes the query maintainable: adding a new metric means adding one CTE, not restructuring a nested subquery stack.

// Part 12

Interview Prep — 5 Questions With Complete Answers

Q: What is a subquery and what are the different types?

A subquery is a SELECT statement nested inside another SQL statement. The outer query uses the subquery's result as if it were a value, a list, or a table. The database executes the subquery first, then uses the result to evaluate the outer query.

Four types based on where the subquery appears and what it returns. A scalar subquery appears in SELECT, WHERE, or HAVING and returns exactly one row and one column — a single value. It is used to compare a column against a computed aggregate or to add a computed reference value to each row. A table subquery (derived table) appears in the FROM clause and returns multiple rows and columns — a virtual table that the outer query can JOIN, filter, and aggregate. It is essential for pre-aggregation and multi-step logic. A correlated subquery references columns from the outer query, making it dependent on the outer row — it executes once per outer row. It is used for row-level comparisons against group aggregates and for EXISTS/NOT EXISTS existence checks. An IN/NOT IN subquery returns a list of values that the outer WHERE clause filters against.

The choice of subquery type follows from what the query needs: a single comparison value → scalar subquery in WHERE; a list of valid IDs → IN subquery; pre-aggregated data to join → derived table in FROM; per-row group aggregate → correlated subquery (or JOIN to a derived table for performance); existence check → EXISTS correlated subquery.

Q: What is a correlated subquery and how does it differ from a non-correlated subquery?

A non-correlated subquery executes independently of the outer query — it produces a single result (a value, a list, or a table) and the outer query uses that result. The subquery runs once, and the outer query uses the cached result for every row it evaluates. SELECT AVG(salary) FROM employees is non-correlated — it produces one number regardless of which employee row the outer query is examining.

A correlated subquery references a column from the outer query — it cannot be evaluated without knowing the current outer row's values. It executes once per outer row, substituting the current row's values each time. SELECT AVG(e2.salary) FROM employees AS e2 WHERE e2.department = e.department is correlated — it computes the average salary for the department of the current employee being examined by the outer query. For each employee row in the outer query, the subquery reruns with a different department value.

The performance implication is the key distinction: a non-correlated subquery runs once regardless of outer table size. A correlated subquery runs N times for N outer rows. On a table with 10,000 employees, a correlated subquery that computes the department average runs 10,000 times — even though there are only 20 departments. The JOIN equivalent computes the department average once per department (20 times) and joins — 500x less work. Always consider replacing correlated subqueries with JOIN to a pre-aggregated derived table or with window functions for large-table performance.

Q: When would you use EXISTS instead of IN?

Use EXISTS when you are checking for the existence of a related record — you need to know whether at least one row satisfying a condition exists in another table, but you do not need the actual values from that table. EXISTS is semantically clearest for existence checks and is always NULL-safe — it never has the NULL-returns-nothing problem that NOT IN has.

EXISTS short-circuits: as soon as one matching row is found, the database stops scanning. For large tables where matches are common, this makes EXISTS significantly faster than IN — IN must evaluate all matching rows and build the complete list before filtering. EXISTS also handles NULLs correctly — NOT EXISTS (subquery with NULLs) still works correctly, whereas NOT IN (subquery with NULLs) silently returns zero rows.

Use IN when you have a small fixed list of values (IN (1, 2, 3) — no subquery), or when the subquery returns a manageable list of values and readability matters more than maximum performance. IN is slightly more readable for simple filtering: WHERE customer_id IN (SELECT customer_id FROM vip_customers) clearly states "get me rows where the customer is in this list." EXISTS requires understanding the correlated reference: WHERE EXISTS (SELECT 1 FROM vip_customers WHERE vip_customers.customer_id = orders.customer_id) is more verbose but semantically equivalent. The practical rule: use EXISTS for large subquery results, nullable columns, and all NOT IN use cases. Use IN for small lists and when you are certain the subquery returns no NULLs.

Q: What is a derived table and when would you use one instead of a CTE?

A derived table is a subquery in the FROM clause — a temporary result set that the outer query treats as a regular table. It is defined inline within the query: FROM (SELECT store_id, SUM(total_amount) AS revenue FROM orders GROUP BY store_id) AS store_revenues. The alias (AS store_revenues) is mandatory — the outer query references it by that name. Derived tables are not stored or cached — they are computed each time the query runs.

A CTE (Common Table Expression) defined with WITH serves the same purpose — it creates a named intermediate result — but it appears before the main query rather than inline. Both are equivalent in terms of what they compute. The difference is primarily readability and reuse. A CTE can be referenced multiple times in the same query; a derived table defined inline can only be used once at the location it is defined.

When to use derived table vs CTE: use an inline derived table for simple, single-use pre-aggregation where the subquery is short and adding a CTE name would be more overhead than clarity. Use a CTE when the same intermediate result is needed more than once (CTE is defined once, referenced by name wherever needed), when the logic has multiple sequential steps that benefit from named intermediate results, or when the derived table subquery is complex enough that naming it adds meaningful documentation. As a practical guide: if a subquery in FROM is more than 5-6 lines, extract it to a named CTE for readability. If it is 2-3 lines, keeping it inline is often cleaner.

Q: How would you find customers whose total order value is above the average for their loyalty tier?

This requires comparing each customer's total spend against the average total spend for customers in the same loyalty tier. The tier average is an aggregate that varies by tier — it cannot be computed with a single scalar subquery (which would give the overall average, not the per-tier average).

Approach 1 — correlated subquery: for each customer, run a subquery that computes the average total spend for customers in the same tier. SELECT c.customer_id, c.loyalty_tier, SUM(o.total_amount) AS total_spend FROM customers AS c JOIN orders AS o ON c.customer_id = o.customer_id WHERE o.order_status = 'Delivered' GROUP BY c.customer_id, c.loyalty_tier HAVING SUM(o.total_amount) > (SELECT AVG(tier_spend) FROM (SELECT c2.customer_id, SUM(o2.total_amount) AS tier_spend FROM customers AS c2 JOIN orders AS o2 ON c2.customer_id = o2.customer_id WHERE o2.order_status = 'Delivered' AND c2.loyalty_tier = c.loyalty_tier GROUP BY c2.customer_id) AS tier_totals). This is correct but the correlated subquery in HAVING makes it verbose and potentially slow.

Approach 2 — CTE with JOIN (preferred): WITH customer_totals AS (SELECT c.customer_id, c.loyalty_tier, SUM(o.total_amount) AS total_spend FROM customers AS c JOIN orders AS o ON c.customer_id = o.customer_id WHERE o.order_status = 'Delivered' GROUP BY c.customer_id, c.loyalty_tier), tier_averages AS (SELECT loyalty_tier, AVG(total_spend) AS avg_spend FROM customer_totals GROUP BY loyalty_tier) SELECT ct.customer_id, ct.loyalty_tier, ct.total_spend, ta.avg_spend FROM customer_totals AS ct JOIN tier_averages AS ta ON ct.loyalty_tier = ta.loyalty_tier WHERE ct.total_spend > ta.avg_spend. The CTE approach computes tier averages once (not per customer), is readable, and scales efficiently to large datasets.

// Part 13

Errors You Will Hit — And Exactly Why They Happen

ERROR: subquery returns more than one row — scalar subquery used where one value expected

Cause: A scalar subquery (used in SELECT, WHERE with =, or HAVING with =) returned multiple rows instead of exactly one. For example, WHERE total_amount = (SELECT total_amount FROM orders WHERE customer_id = 5) — if customer 5 has multiple orders, the subquery returns multiple rows and the = operator fails. A scalar subquery must always return exactly one row.

Fix: Three options: (1) Add an aggregate to force one row: = (SELECT MAX(total_amount) FROM orders WHERE customer_id = 5). (2) Use IN instead of = if multiple matching values are acceptable: WHERE total_amount IN (SELECT total_amount FROM orders WHERE customer_id = 5). (3) Add a LIMIT 1 with ORDER BY to force exactly one row: = (SELECT total_amount FROM orders WHERE customer_id = 5 ORDER BY order_date DESC LIMIT 1). Choose based on which value is semantically correct.

NOT IN subquery returns zero rows — even though unmatched rows clearly exist

Cause: The NOT IN subquery returns at least one NULL value. NOT IN (1, 2, NULL) is equivalent to col != 1 AND col != 2 AND col != NULL. Since col != NULL evaluates to NULL (not TRUE), the entire AND chain evaluates to NULL for every row — returning zero results. This is the most common and most dangerous subquery NULL trap.

Fix: Add WHERE subquery_column IS NOT NULL inside the subquery: WHERE p.product_id NOT IN (SELECT product_id FROM order_items WHERE product_id IS NOT NULL). Or switch to NOT EXISTS which is NULL-safe: WHERE NOT EXISTS (SELECT 1 FROM order_items oi WHERE oi.product_id = p.product_id). Or use LEFT JOIN + IS NULL. Before using NOT IN in production, always verify the subquery cannot return NULLs: SELECT COUNT(*) FROM subquery_table WHERE join_column IS NULL.

Correlated subquery is extremely slow — query takes minutes instead of seconds

Cause: The correlated subquery executes once per outer row. With 10,000 outer rows, the subquery runs 10,000 times. If the subquery itself does a full table scan (no index on the correlated column), total work is 10,000 × full_scan_cost. On large tables, this produces O(n²) complexity — exponentially worse than a single-pass JOIN.

Fix: Replace the correlated subquery with a JOIN to a pre-aggregated derived table or CTE. Instead of SELECT AVG(e2.salary) FROM employees e2 WHERE e2.department = e.department per row, compute: (SELECT department, AVG(salary) AS avg_sal FROM employees GROUP BY department) AS dept_avgs — once per department — and JOIN to it. Also ensure the correlated column (e2.department in the subquery) is indexed. Use EXPLAIN ANALYZE to confirm the rewrite reduces total scans.

Derived table alias missing — syntax error near the subquery closing parenthesis

Cause: Every subquery used as a derived table (in the FROM clause or as a JOIN target) must have an alias. FROM (SELECT ...) is a syntax error — the alias is mandatory. This is a hard SQL rule: the derived table must be named so the outer query can reference it.

Fix: Add an alias immediately after the closing parenthesis: FROM (SELECT ...) AS my_derived_table. The alias can then be used in JOIN conditions, WHERE clauses, and SELECT column prefixes. Choose a descriptive alias that explains what the derived table contains: AS store_revenues, AS customer_spend, AS top_products — not AS t1 or AS sub.

EXISTS subquery returns unexpected results — seems to match rows it should not

Cause: The correlated condition in the EXISTS subquery is missing or too broad. EXISTS (SELECT 1 FROM orders) without a WHERE condition that links to the outer query is always TRUE for every outer row — it just checks whether the orders table has any rows at all, not whether a specific customer has orders. This turns the WHERE EXISTS into a no-op filter.

Fix: Always include a correlated condition in EXISTS that links the subquery to the current outer row: WHERE EXISTS (SELECT 1 FROM orders AS o WHERE o.customer_id = c.customer_id). The WHERE inside EXISTS must reference the outer alias (c.customer_id) — without this, EXISTS evaluates once globally, not per row. Test by temporarily changing EXISTS to a SELECT * and verifying the subquery returns the expected rows for a specific outer row value.

Try It Yourself

Write a single query that produces a product performance report using subqueries. For each product show: product_id, product_name, category, unit_price, total_units_sold (0 if never sold), total_revenue (0 if never sold), a 'vs_category_avg_revenue' column showing how the product's revenue compares to the average revenue of products in the same category that were sold, and a performance_tier: 'Top' if revenue is above category average, 'Average' if within ±20% of category average, 'Below' if more than 20% below, 'Unsold' if never sold. Use a derived table in FROM for sales aggregation and a correlated subquery for the category average.

🎯 Key Takeaways

✓A subquery is a SELECT inside another SQL statement. The database executes the inner query first and the outer query uses the result as a value, list, or table.
✓Four types: scalar subquery (one value — in SELECT, WHERE, HAVING), IN subquery (a list of values), derived table (virtual table in FROM), correlated subquery (references outer row — runs once per outer row).
✓Scalar subquery in WHERE: compare each row against a computed aggregate like average or maximum. Must return exactly one row — use MAX(), MIN(), or AVG() to guarantee this.
✓IN subquery: filter rows against a list returned by the subquery. NOT IN is dangerous when the subquery can return NULLs — use NOT EXISTS or LEFT JOIN IS NULL instead.
✓Derived table in FROM: pre-aggregate or pre-filter data before the outer query uses it. Every derived table must have an alias. Two-level aggregation (average of averages) requires a derived table.
✓Correlated subquery: references the outer query's columns — executes once per outer row. Powerful but potentially O(n²) at scale. Replace with JOIN to pre-aggregated derived table for large tables.
✓EXISTS: semantically clearest for existence checks. Short-circuits on first match. NULL-safe. NOT EXISTS is the correct alternative to NOT IN when NULLs are possible.
✓Subquery vs JOIN: JOINs are usually more efficient and flexible. Use subqueries when the logic requires a computed value as the filter threshold or when an existence check is needed.
✓Subquery vs CTE: derived tables and CTEs are computationally equivalent. CTEs are preferred when: the same subquery is referenced more than once, the logic has multiple steps, or naming the intermediate result improves readability.
✓Nesting depth: one level is fine, two is acceptable, three or more should be refactored into CTEs. Deeply nested subqueries are hard to read and maintain.

What comes next

In Module 37, you learn correlated subqueries in depth — every pattern, performance implications, and when to rewrite them as window functions or JOINs for production-scale queries.

Module 37 → Correlated Subqueries

Discussion

Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.

Continue with GitHub