Subqueries
Queries inside queries — scalar subqueries in SELECT, subqueries in WHERE and FROM, correlated subqueries, and when to use each type versus a JOIN or CTE
// Part 01
What a Subquery Is
A subquery is a SELECT statement nested inside another SQL statement. The outer query treats the subquery's result as if it were a table, a single value, or a list — depending on where the subquery appears and what it returns. The database executes the inner query first, then uses the result to evaluate the outer query.
Subqueries solve a specific class of problem: queries that need data computed from one query in order to filter, compute, or define the scope of another. They are the mechanism for composing queries — building complex analysis from simpler pieces. Understanding where subqueries can appear and what each placement means is the entire substance of this module.
// Part 02
The Four Subquery Types
// Part 03
Scalar Subquery in WHERE — Compare to a Computed Value
The most common subquery type: a subquery in WHERE that returns a single value. The outer query compares each row against that value. This is the pattern for "find rows where the value is above/below the overall average" — a question that cannot be answered with a simple WHERE condition because the average is computed from the data being filtered.
Orders above the average order value
Scalar subquery in SELECT — add a computed reference to every row
// Part 04
Subquery with IN — Filter Against a List
A subquery after IN returns a list of values. The outer query keeps rows where the column matches any value in that list. This is the multi-value version of the scalar subquery — instead of one comparison value, IN provides many.
NOT IN — exclude rows matching the list
// Part 05
Subquery in FROM — Derived Tables
A subquery in the FROM clause is called a derived table or inline view. The outer query treats it exactly like a regular table — it can be joined, filtered, grouped, and sorted. Derived tables are essential for multi-step analytical queries where an intermediate aggregation must be computed before the outer query can use it.
Pre-aggregate then join
Filter before joining — reduce rows early
Two-level aggregation — aggregate an aggregate
// Part 06
Subquery in HAVING — Filter Groups by Computed Values
// Part 07
Correlated Subquery — Runs Once Per Outer Row
A correlated subquery references a column from the outer query. This means it cannot be evaluated independently — it must be re-executed for each row of the outer query, with the outer row's values substituted into the subquery. The result is one value per outer row.
Correlated subqueries are powerful but can be slow on large tables — they execute N times for N outer rows. For performance-critical queries, a JOIN or window function often provides the same result more efficiently. Use correlated subqueries when the logic is clearest expressed as "for each row, compute X from related rows."
Each employee compared to their department average
Each product vs its category's max price
Correlated subquery in WHERE — the EXISTS pattern
// Part 08
EXISTS and NOT EXISTS — Existence Checks
EXISTS returns TRUE if the subquery returns at least one row, FALSE if it returns no rows. It is the cleanest way to check whether a related record exists, and it short-circuits — the database stops scanning as soon as the first matching row is found, making it very efficient for existence checks.
// Part 09
Subquery vs JOIN vs CTE — When to Use Each
Subqueries, JOINs, and CTEs often produce the same result. Choosing between them is about readability, performance, and reuse.
| Approach | Best for | Avoid when | Performance |
|---|---|---|---|
| Scalar subquery | Adding one computed value per row; simple threshold comparisons | Needs to reference multiple columns from the subquery; correlated over large tables | Fine for small tables; correlated versions can be O(n²) |
| IN subquery | Filtering against a list computed from another table; readable anti-join with NOT IN | Subquery can return NULLs (use EXISTS instead); very large result lists | Good; optimiser often converts to JOIN internally |
| Derived table (FROM subquery) | Pre-aggregation before joining; two-level aggregation; isolating complex logic | The same subquery is needed multiple times (use CTE); deep nesting degrades readability | Equivalent to CTE; may or may not be materialised by optimiser |
| Correlated subquery | Row-level comparison against a group aggregate; EXISTS/NOT EXISTS checks | Large outer tables — runs once per row; better replaced by JOIN + GROUP BY or window function | Can be slow — O(n) subquery executions; optimiser sometimes rewrites to JOIN |
| JOIN | Combining columns from multiple tables; most analytics queries; when the relationship is explicit | Single existence check (use EXISTS); when the relationship is aggregate-based | Usually fastest; uses indexes; optimiser has full plan flexibility |
| CTE (WITH clause) | Complex multi-step logic; when the same subquery is referenced more than once; self-documenting named steps | Very simple one-use subqueries where inline is clearer | Equivalent to derived table; PostgreSQL materialises by default (MATERIALIZED hint available) |
Rewriting a correlated subquery as a JOIN — performance improvement
Both produce identical results. The JOIN version computes the department average once per department (N_departments queries) rather than once per employee (N_employees queries). For a company with 10,000 employees across 20 departments, the JOIN version is 500x less work.
// Part 10
Nested Subqueries — Subqueries Inside Subqueries
Subqueries can be nested — a subquery can itself contain a subquery. SQL has no theoretical limit on nesting depth, but readability degrades rapidly beyond two levels. When you find yourself writing a three-level nested subquery, it is almost always clearer as a CTE.
🎯 Pro Tip
The rule for nesting depth: one level of inline subquery is fine. Two levels is acceptable for simple cases. Three or more levels should almost always be refactored into CTEs. Each CTE gets a descriptive name that documents what it computes — the multi-level CTE version reads like a step-by-step explanation of the logic. The nested subquery version requires the reader to work inside-out to understand what it does.
// Part 11
What This Looks Like at Work
You are a data analyst at Stripe. The product team needs a customer health report for a quarterly review. They want: every customer with their total spend, their spend relative to the average for their loyalty tier, whether they qualify as "high value" (top 25% spenders in their tier), and their most recent order date. This requires several computed values that depend on group-level aggregates — a classic multi-subquery problem.
🎯 Pro Tip
The CTE-first pattern for complex reports: identify every intermediate computation the report needs (customer total spend, tier averages, percentiles), give each its own named CTE, then assemble in the final SELECT. The final SELECT reads like plain English — join customer to their spend to their tier stats. This structure makes the query maintainable: adding a new metric means adding one CTE, not restructuring a nested subquery stack.
// Part 12
Interview Prep — 5 Questions With Complete Answers
A subquery is a SELECT statement nested inside another SQL statement. The outer query uses the subquery's result as if it were a value, a list, or a table. The database executes the subquery first, then uses the result to evaluate the outer query.
Four types based on where the subquery appears and what it returns. A scalar subquery appears in SELECT, WHERE, or HAVING and returns exactly one row and one column — a single value. It is used to compare a column against a computed aggregate or to add a computed reference value to each row. A table subquery (derived table) appears in the FROM clause and returns multiple rows and columns — a virtual table that the outer query can JOIN, filter, and aggregate. It is essential for pre-aggregation and multi-step logic. A correlated subquery references columns from the outer query, making it dependent on the outer row — it executes once per outer row. It is used for row-level comparisons against group aggregates and for EXISTS/NOT EXISTS existence checks. An IN/NOT IN subquery returns a list of values that the outer WHERE clause filters against.
The choice of subquery type follows from what the query needs: a single comparison value → scalar subquery in WHERE; a list of valid IDs → IN subquery; pre-aggregated data to join → derived table in FROM; per-row group aggregate → correlated subquery (or JOIN to a derived table for performance); existence check → EXISTS correlated subquery.
A non-correlated subquery executes independently of the outer query — it produces a single result (a value, a list, or a table) and the outer query uses that result. The subquery runs once, and the outer query uses the cached result for every row it evaluates. SELECT AVG(salary) FROM employees is non-correlated — it produces one number regardless of which employee row the outer query is examining.
A correlated subquery references a column from the outer query — it cannot be evaluated without knowing the current outer row's values. It executes once per outer row, substituting the current row's values each time. SELECT AVG(e2.salary) FROM employees AS e2 WHERE e2.department = e.department is correlated — it computes the average salary for the department of the current employee being examined by the outer query. For each employee row in the outer query, the subquery reruns with a different department value.
The performance implication is the key distinction: a non-correlated subquery runs once regardless of outer table size. A correlated subquery runs N times for N outer rows. On a table with 10,000 employees, a correlated subquery that computes the department average runs 10,000 times — even though there are only 20 departments. The JOIN equivalent computes the department average once per department (20 times) and joins — 500x less work. Always consider replacing correlated subqueries with JOIN to a pre-aggregated derived table or with window functions for large-table performance.
Use EXISTS when you are checking for the existence of a related record — you need to know whether at least one row satisfying a condition exists in another table, but you do not need the actual values from that table. EXISTS is semantically clearest for existence checks and is always NULL-safe — it never has the NULL-returns-nothing problem that NOT IN has.
EXISTS short-circuits: as soon as one matching row is found, the database stops scanning. For large tables where matches are common, this makes EXISTS significantly faster than IN — IN must evaluate all matching rows and build the complete list before filtering. EXISTS also handles NULLs correctly — NOT EXISTS (subquery with NULLs) still works correctly, whereas NOT IN (subquery with NULLs) silently returns zero rows.
Use IN when you have a small fixed list of values (IN (1, 2, 3) — no subquery), or when the subquery returns a manageable list of values and readability matters more than maximum performance. IN is slightly more readable for simple filtering: WHERE customer_id IN (SELECT customer_id FROM vip_customers) clearly states "get me rows where the customer is in this list." EXISTS requires understanding the correlated reference: WHERE EXISTS (SELECT 1 FROM vip_customers WHERE vip_customers.customer_id = orders.customer_id) is more verbose but semantically equivalent. The practical rule: use EXISTS for large subquery results, nullable columns, and all NOT IN use cases. Use IN for small lists and when you are certain the subquery returns no NULLs.
A derived table is a subquery in the FROM clause — a temporary result set that the outer query treats as a regular table. It is defined inline within the query: FROM (SELECT store_id, SUM(total_amount) AS revenue FROM orders GROUP BY store_id) AS store_revenues. The alias (AS store_revenues) is mandatory — the outer query references it by that name. Derived tables are not stored or cached — they are computed each time the query runs.
A CTE (Common Table Expression) defined with WITH serves the same purpose — it creates a named intermediate result — but it appears before the main query rather than inline. Both are equivalent in terms of what they compute. The difference is primarily readability and reuse. A CTE can be referenced multiple times in the same query; a derived table defined inline can only be used once at the location it is defined.
When to use derived table vs CTE: use an inline derived table for simple, single-use pre-aggregation where the subquery is short and adding a CTE name would be more overhead than clarity. Use a CTE when the same intermediate result is needed more than once (CTE is defined once, referenced by name wherever needed), when the logic has multiple sequential steps that benefit from named intermediate results, or when the derived table subquery is complex enough that naming it adds meaningful documentation. As a practical guide: if a subquery in FROM is more than 5-6 lines, extract it to a named CTE for readability. If it is 2-3 lines, keeping it inline is often cleaner.
This requires comparing each customer's total spend against the average total spend for customers in the same loyalty tier. The tier average is an aggregate that varies by tier — it cannot be computed with a single scalar subquery (which would give the overall average, not the per-tier average).
Approach 1 — correlated subquery: for each customer, run a subquery that computes the average total spend for customers in the same tier. SELECT c.customer_id, c.loyalty_tier, SUM(o.total_amount) AS total_spend FROM customers AS c JOIN orders AS o ON c.customer_id = o.customer_id WHERE o.order_status = 'Delivered' GROUP BY c.customer_id, c.loyalty_tier HAVING SUM(o.total_amount) > (SELECT AVG(tier_spend) FROM (SELECT c2.customer_id, SUM(o2.total_amount) AS tier_spend FROM customers AS c2 JOIN orders AS o2 ON c2.customer_id = o2.customer_id WHERE o2.order_status = 'Delivered' AND c2.loyalty_tier = c.loyalty_tier GROUP BY c2.customer_id) AS tier_totals). This is correct but the correlated subquery in HAVING makes it verbose and potentially slow.
Approach 2 — CTE with JOIN (preferred): WITH customer_totals AS (SELECT c.customer_id, c.loyalty_tier, SUM(o.total_amount) AS total_spend FROM customers AS c JOIN orders AS o ON c.customer_id = o.customer_id WHERE o.order_status = 'Delivered' GROUP BY c.customer_id, c.loyalty_tier), tier_averages AS (SELECT loyalty_tier, AVG(total_spend) AS avg_spend FROM customer_totals GROUP BY loyalty_tier) SELECT ct.customer_id, ct.loyalty_tier, ct.total_spend, ta.avg_spend FROM customer_totals AS ct JOIN tier_averages AS ta ON ct.loyalty_tier = ta.loyalty_tier WHERE ct.total_spend > ta.avg_spend. The CTE approach computes tier averages once (not per customer), is readable, and scales efficiently to large datasets.
// Part 13
Errors You Will Hit — And Exactly Why They Happen
🎯 Key Takeaways
- ✓A subquery is a SELECT inside another SQL statement. The database executes the inner query first and the outer query uses the result as a value, list, or table.
- ✓Four types: scalar subquery (one value — in SELECT, WHERE, HAVING), IN subquery (a list of values), derived table (virtual table in FROM), correlated subquery (references outer row — runs once per outer row).
- ✓Scalar subquery in WHERE: compare each row against a computed aggregate like average or maximum. Must return exactly one row — use MAX(), MIN(), or AVG() to guarantee this.
- ✓IN subquery: filter rows against a list returned by the subquery. NOT IN is dangerous when the subquery can return NULLs — use NOT EXISTS or LEFT JOIN IS NULL instead.
- ✓Derived table in FROM: pre-aggregate or pre-filter data before the outer query uses it. Every derived table must have an alias. Two-level aggregation (average of averages) requires a derived table.
- ✓Correlated subquery: references the outer query's columns — executes once per outer row. Powerful but potentially O(n²) at scale. Replace with JOIN to pre-aggregated derived table for large tables.
- ✓EXISTS: semantically clearest for existence checks. Short-circuits on first match. NULL-safe. NOT EXISTS is the correct alternative to NOT IN when NULLs are possible.
- ✓Subquery vs JOIN: JOINs are usually more efficient and flexible. Use subqueries when the logic requires a computed value as the filter threshold or when an existence check is needed.
- ✓Subquery vs CTE: derived tables and CTEs are computationally equivalent. CTEs are preferred when: the same subquery is referenced more than once, the logic has multiple steps, or naming the intermediate result improves readability.
- ✓Nesting depth: one level is fine, two is acceptable, three or more should be refactored into CTEs. Deeply nested subqueries are hard to read and maintain.
What comes next
In Module 37, you learn correlated subqueries in depth — every pattern, performance implications, and when to rewrite them as window functions or JOINs for production-scale queries.
Module 37 → Correlated SubqueriesDiscussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.