Query Processing & Optimization
What actually happens between typing a SQL query and seeing results — parsing, algebra transformation, cost estimation, join algorithms, and how the database chooses the fastest execution plan.
What Happens When You Run a SQL Query — The Complete Pipeline
You type a SQL query and press Enter. In less than a millisecond, the database returns thousands of rows sorted, filtered, joined, and aggregated exactly as you asked. What happened in that millisecond is a remarkably sophisticated process involving parsing, semantic analysis, algebraic transformation, cost estimation, physical plan selection, and execution. Understanding this pipeline is what separates engineers who write SQL from engineers who understand why some SQL queries are fast and others are catastrophically slow.
The most important stage by far is the Query Optimiser — step 04. The difference between the fastest and slowest plan for a complex query can be five or six orders of magnitude: 2 milliseconds vs 30 minutes. The optimiser's job is to find the fast plan. Understanding how it works lets you write queries that are easy for the optimiser to handle — and diagnose when it makes a suboptimal choice.
Relational Algebra — The Internal Language of Query Processing
SQL is a declarative language — you specify what you want, not how to get it. Internally, the database translates SQL into relational algebra — a procedural language of set operations where each operator takes one or two relations as input and produces a relation as output. Unlike SQL, relational algebra expressions specify a precise order of operations. The query optimiser manipulates relational algebra expressions to find the most efficient ordering.
The Core Relational Algebra Operators
Filters rows. σ_condition(R) returns all rows from relation R where the condition is true.
σ_{city='Bengaluru'}(customers) — all customers in Bengaluru
O(n) — must examine every row unless an index exists. With index on the selection attribute: O(log n + k) where k = matching rows.
Selects specific columns. π_{col1,col2,...}(R) returns only the specified columns from all rows.
π_{name, city}(customers) — only name and city columns
O(n) — must process every row. If DISTINCT is required, additional sort or hash step: O(n log n).
Combines rows from two relations where values of common attributes match. Automatically joins on all attributes with the same name.
orders ⋈ customers — join on customer_id (common attribute)
Depends on algorithm: O(n²) naive, O(n+m) hash join, O(n log m) sort-merge join.
Returns every combination of rows from two relations. If R has m rows and S has n rows, R × S has m×n rows.
customers × restaurants — every customer-restaurant pair
O(m×n) — exponentially expensive. A join with predicate is always preferred.
Returns all rows from either relation, eliminating duplicates. Both relations must have the same schema.
active_customers ∪ premium_customers
O(m+n) for union all; O((m+n) log(m+n)) for union with deduplication.
Returns rows in the first relation that are not in the second.
all_customers − customers_with_orders
O((m+n) log(m+n)) with sort; O(m+n) with hash.
Renames a relation or its attributes. Necessary when self-joining a table to give each copy a distinct name.
ρ_{mgr}(employees) — rename employees to mgr for the manager copy
O(1) — purely a metadata operation, no data movement.
SQL to Relational Algebra — The Translation
-- SQL QUERY:
SELECT c.name, o.total_amount
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE c.city = 'Bengaluru'
AND o.status = 'delivered'
ORDER BY o.total_amount DESC;
-- NAIVE RELATIONAL ALGEBRA TRANSLATION (what a simple parser produces):
π_{c.name, o.total_amount} (
σ_{order_by: total_amount DESC} (
σ_{c.city='Bengaluru' AND o.status='delivered'} (
customers ⋈_{c.customer_id=o.customer_id} orders
)
)
)
-- This reads:
-- 1. Form the join of customers and orders (EXPENSIVE — all rows combined first)
-- 2. Filter by city AND status (filter happens AFTER the expensive join)
-- 3. Project to name and total_amount
-- 4. Sort by total_amount DESC
-- OPTIMISED RELATIONAL ALGEBRA (what the optimiser produces):
π_{c.name, o.total_amount} (
sort_{total_amount DESC} (
σ_{city='Bengaluru'}(customers)
⋈_{customer_id=customer_id}
σ_{status='delivered'}(orders)
)
)
-- This reads:
-- 1. Filter customers to only Bengaluru customers FIRST (reduces rows from 10M to 50K)
-- 2. Filter orders to only delivered FIRST (reduces rows from 50M to 20M)
-- 3. Join the two reduced relations (much smaller input to the join)
-- 4. Project to name and total_amount
-- 5. Sort the result
-- The key optimisation: PREDICATE PUSHDOWN
-- Push filters (σ) as close to the data source as possible.
-- Reduces the size of intermediate results at every step.Algebraic Equivalences — The Rules the Optimiser Uses to Transform Queries
The query optimiser applies algebraic equivalences to transform a query tree into an equivalent but cheaper form. Two relational algebra expressions are equivalent if they produce exactly the same result for every valid database instance. These equivalences are the mathematical foundation of query optimisation — the optimiser applies them systematically to search the space of equivalent plans.
Selections can be pushed through joins. Apply filters as early as possible — before the join — to reduce the number of rows that the join must process.
Impact: Reducing customers from 10M rows to 50K before joining with orders reduces the join input by 200x.
-- Before pushdown: join all rows, then filter
SELECT * FROM orders o JOIN customers c ON o.customer_id = c.customer_id
WHERE c.city = 'Bengaluru';
-- After pushdown (done automatically by optimiser):
-- Filter customers first → join smaller setProjections can be pushed through joins when the projected columns are not needed by the join condition. Carrying fewer columns through intermediate results reduces memory and I/O.
Impact: If a row is 500 bytes and you only need 50 bytes of it, pushing projection reduces intermediate result size by 10x.
-- Only need name and total_amount, but join requires customer_id
-- Optimiser projects each table to only needed columns before joiningJoins are commutative — R ⋈ S = S ⋈ R. The optimiser uses this to try the smaller relation first, placing it in the "build" phase of a hash join for better memory usage.
Impact: Always build the hash table from the smaller relation. For a 100-row and 10M-row join, always hash the 100-row table.
-- The order of tables in FROM/JOIN does not determine execution order
-- The optimiser chooses the better order based on statisticsJoins are associative — (R ⋈ S) ⋈ T = R ⋈ (S ⋈ T). For n tables, there are (2n-2)!/(n-1)! possible join orderings. The optimiser searches this space for the cheapest.
Impact: For 5 tables: 120 possible join orders. For 10 tables: 1.7 million. Optimiser uses dynamic programming to find the best without exhaustive search.
-- 4-table join: 24 possible orderings
-- Optimiser picks cheapest based on cardinality estimatesA selection with a conjunction (AND) can be split into a cascade of individual selections. This allows each part of the condition to be pushed to where it is most effective.
Impact: WHERE city='Bengaluru' AND age>30 AND is_active=true can use three separate indexes if available, combining their results.
-- WHERE city='Bengaluru' AND status='delivered' AND amount>500
-- Each condition can be pushed to its respective table independentlyA selection on a Cartesian product can be converted to a join with that condition as the join predicate. The join uses the condition to filter during combination rather than after.
Impact: Avoids materialising the full Cartesian product (m×n rows) before filtering. The join directly combines matching rows.
-- A SELECT with WHERE that combines two tables is a join, not a cross product
-- The optimiser always converts σ(R×S) to R⋈S formCost Estimation — How the Optimiser Predicts Plan Cost Without Running It
After generating candidate execution plans using algebraic transformations, the optimiser must choose the cheapest one. It cannot run all plans and pick the fastest — that would defeat the purpose. Instead it uses a cost model to estimate the cost of each plan without executing it. The accuracy of cost estimation is the single most important factor in optimiser quality.
What "Cost" Means — The Cost Model Components
The cost model estimates the total resources consumed by a plan. In disk-based systems, I/O cost (number of page reads from disk) dominates and is the primary metric. CPU cost (number of operations like comparisons and hash computations) is secondary. Memory cost (how much buffer pool space the plan requires) is also tracked because running out of memory forces expensive disk-spill operations.
-- COST MODEL INPUTS (what PostgreSQL tracks in pg_class and pg_statistic):
-- 1. TABLE STATISTICS (updated by ANALYZE):
SELECT relname, reltuples, relpages
FROM pg_class
WHERE relname IN ('customers', 'orders');
-- reltuples: estimated row count (cardinality)
-- relpages: number of 8KB pages the table occupies
-- 2. COLUMN STATISTICS (updated by ANALYZE):
SELECT attname, n_distinct, correlation, most_common_vals, histogram_bounds
FROM pg_stats
WHERE tablename = 'orders';
-- n_distinct: number of distinct values (positive = exact, negative = fraction of rows)
-- correlation: how well column order matches physical storage order (1.0 = perfectly correlated)
-- most_common_vals: array of most frequent values and their frequencies
-- histogram_bounds: bucket boundaries for estimating range predicates
-- 3. COST PARAMETERS (in postgresql.conf):
SHOW seq_page_cost; -- cost of reading a page in a sequential scan (default: 1.0)
SHOW random_page_cost; -- cost of reading a page via random access (default: 4.0)
-- random_page_cost > seq_page_cost because sequential disk reads are faster
-- For SSDs: set random_page_cost = 1.1 (almost same as sequential)
SHOW cpu_tuple_cost; -- cost of processing one row (default: 0.01)
SHOW cpu_index_tuple_cost; -- cost of processing one index entry (default: 0.005)
SHOW cpu_operator_cost; -- cost of a comparison or function call (default: 0.0025)
-- COST FORMULA for sequential scan:
-- cost = relpages * seq_page_cost + reltuples * cpu_tuple_cost
-- For orders table: 6250 pages * 1.0 + 50000 rows * 0.01 = 6250 + 500 = 6750
-- COST FORMULA for index scan:
-- cost = (index pages read) * random_page_cost + (matching rows) * cpu_index_tuple_cost
-- + (matching rows) * random_page_cost (heap fetches)
-- For highly selective index (1% selectivity): far cheaper than full scan
-- For low-selectivity index (50% of table): more expensive than full scan!Cardinality Estimation — Predicting Row Counts Through Operations
The most critical and most error-prone part of cost estimation is predicting how many rows each operation will produce. These row count estimates (called cardinality estimates) propagate through the query plan — an error early in the plan compounds into large errors in later stages.
-- SELECTIVITY: fraction of rows that satisfy a condition
-- = (estimated output rows) / (total input rows)
-- EQUALITY CONDITION: col = value
-- selectivity = 1 / n_distinct(col)
-- Example: city = 'Bengaluru', n_distinct(city) = 10 cities
-- selectivity = 1/10 = 0.1 = 10% of rows
-- If customers has 1M rows: estimated output = 1M * 0.1 = 100,000 rows
-- RANGE CONDITION: col BETWEEN a AND b
-- Uses histogram: count buckets that fall within [a,b] / total buckets
-- Example: salary BETWEEN 50000 AND 80000
-- Histogram: [20K, 40K, 60K, 80K, 100K, 150K] — 2 of 5 buckets → selectivity ≈ 0.4
-- CONJUNCTION (AND): multiply selectivities (assumes independence)
-- P(A AND B) ≈ P(A) * P(B)
-- WHERE city='Bengaluru' AND status='delivered'
-- selectivity = 0.1 * 0.2 = 0.02 = 2% of rows
-- PROBLEM: if city and status are correlated (Bengaluru users place more delivered orders),
-- the independence assumption underestimates the true selectivity → wrong plan
-- DISJUNCTION (OR): 1 - (1-P(A)) * (1-P(B))
-- WHERE city='Bengaluru' OR status='delivered'
-- selectivity = 1 - (1-0.1) * (1-0.2) = 1 - 0.9*0.8 = 1 - 0.72 = 0.28
-- JOIN CARDINALITY ESTIMATE:
-- |R ⋈_{R.A=S.B} S| ≈ |R| * |S| / max(n_distinct(R.A), n_distinct(S.B))
-- Assumes uniform distribution of join attribute values
-- customers ⋈ orders on customer_id:
-- |customers| = 100K, |orders| = 5M, n_distinct(customer_id) = 100K (same)
-- Estimated join output: 100K * 5M / 100K = 5M rows
-- (Each customer has ~50 orders on average → total 5M order rows, correct)
-- STALE STATISTICS = BAD PLANS:
-- If ANALYZE hasn't been run recently:
-- The optimiser thinks the table has the OLD row count
-- Might choose a full scan instead of an index scan (or vice versa)
-- Always run: VACUUM ANALYZE table_name; after large data loads
-- CHECK if statistics are stale:
SELECT relname, n_live_tup, n_dead_tup, last_analyze, last_autoanalyze
FROM pg_stat_user_tables
ORDER BY last_analyze NULLS FIRST;When Statistics Fail — Multi-Column Correlation
The independence assumption in cardinality estimation (multiply selectivities for AND conditions) fails when columns are correlated. PostgreSQL 10+ introduced extended statistics to address this.
-- PROBLEM: city and state are correlated (city='Mumbai' always means state='Maharashtra')
-- Bad estimate: P(city='Mumbai' AND state='Maharashtra') = P(city) * P(state)
-- = 0.05 * 0.1 = 0.005 (only 0.5% of rows)
-- Actual: 30% of all rows (Mumbai is a major city in Maharashtra — high correlation)
-- Optimiser makes wrong plan based on the bad estimate
-- SOLUTION: Create extended statistics for correlated columns
CREATE STATISTICS stat_city_state ON city, state FROM customers;
ANALYZE customers;
-- PostgreSQL now tracks the joint distribution of city and state together
-- Selectivity estimates for conditions on both columns are far more accurate
-- VIEW extended statistics:
SELECT stxname, stxkeys, stxkind
FROM pg_statistic_ext
WHERE stxrelid = 'customers'::regclass;
-- stxkind: d = n_distinct, f = functional dependencies, m = MCV (most common values)
-- Extended statistics types:
-- DEPENDENCIES: one column functionally determines another (zip → city)
-- MCV: track most common combinations of values
-- NDISTINCT: track distinct combinations
CREATE STATISTICS stat_zip_city (dependencies)
ON zip_code, city FROM addresses;
-- Now optimiser knows: given zip_code='400001', city is always 'Mumbai'Join Algorithms — The Three Ways Databases Execute Joins
The join operation is the most computationally expensive operation in most SQL queries. The algorithm chosen for each join in a query plan significantly impacts performance. There are three main join algorithms. The optimiser chooses among them based on the size of the inputs, available memory, and available indexes.
Join Algorithm Selection Summary
| Algorithm | Best When | Cost | Memory | Supports Non-Equi |
|---|---|---|---|---|
| Nested Loop | Outer is tiny OR index on inner | O(n×m) → O(n log m) with index | Very low | ✓ Yes |
| Sort-Merge | Both large + already sorted + ORDER BY needed | O(n log n + m log m) | Medium (sort buffers) | ✗ Equality only |
| Hash | Both large, equality join, fits in work_mem | O(n+m) | High (hash table) | ✗ Equality only |
Reading Execution Plans — Understanding Every Node in EXPLAIN Output
The execution plan is the optimiser's output — a tree of physical operators that the execution engine will run. Reading plans fluently is a production skill. Every time a query is slow, the first step is reading its plan.
Plan Structure — Trees and Data Flow
A query plan is a tree. Data flows from the leaf nodes (table scans) upward through intermediate nodes (joins, sorts, aggregations) to the root (which produces the final result). At each node, the optimiser estimates the startup cost (time to produce the first row) and total cost (time to produce all rows).
-- QUERY:
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT c.name, r.name AS restaurant, COUNT(o.order_id) AS order_count,
SUM(o.total_amount) AS total_spent
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
JOIN restaurants r ON o.restaurant_id = r.restaurant_id
WHERE c.city = 'Bengaluru'
AND o.status = 'delivered'
GROUP BY c.customer_id, c.name, r.restaurant_id, r.name
HAVING COUNT(o.order_id) >= 3
ORDER BY total_spent DESC
LIMIT 10;
-- EXAMPLE OUTPUT (annotated):
Sort (cost=8234.56..8234.59 rows=10 width=68)
(actual time=45.123..45.127 rows=10 loops=1)
Sort Key: (sum(o.total_amount)) DESC
Sort Method: top-N heapsort Memory: 25kB
Buffers: shared hit=3240, read=127
-> HashAggregate (cost=8100.00..8200.00 rows=847 width=68)
(actual time=44.001..44.890 rows=847 loops=1)
Group Key: c.customer_id, c.name, r.restaurant_id, r.name
Filter: (count(o.order_id) >= 3)
Rows Removed by Filter: 12341
Buffers: shared hit=3240, read=127
-> Hash Join (cost=312.00..7500.00 rows=24000 width=52)
(actual time=5.456..38.234 rows=23188 loops=1)
Hash Cond: (o.restaurant_id = r.restaurant_id)
Buffers: shared hit=3240, read=127
-> Hash Join (cost=145.00..6800.00 rows=24000 width=32)
(actual time=2.123..31.456 rows=23188 loops=1)
Hash Cond: (o.customer_id = c.customer_id)
-> Seq Scan on orders o
(cost=0.00..6200.00 rows=48000 width=20)
(actual time=0.025..24.678 rows=47921 loops=1)
Filter: ((status)::text = 'delivered')
Rows Removed by Filter: 2079
-> Hash (cost=120.00..120.00 rows=2000 width=16)
(actual time=1.987..1.987 rows=1952 loops=1)
Buckets: 2048 Batches: 1 Memory Usage: 112kB
-> Seq Scan on customers c
(cost=0.00..120.00 rows=2000 width=16)
(actual time=0.013..1.456 rows=1952 loops=1)
Filter: ((city)::text = 'Bengaluru')
Rows Removed by Filter: 8048
-> Hash (cost=80.00..80.00 rows=500 width=24)
(actual time=1.234..1.234 rows=500 loops=1)
Buffers: shared hit=5
-> Seq Scan on restaurants r
(cost=0.00..80.00 rows=500 width=24)
(actual time=0.009..0.789 rows=500 loops=1)
Planning Time: 2.456 ms
Execution Time: 45.234 ms
-- ─────────────────────────────────────────────────────────────────
-- DECODING EVERY ELEMENT:
-- ─────────────────────────────────────────────────────────────────
-- cost=X..Y: X = startup cost (first row), Y = total cost (all rows)
-- cost units are arbitrary but relative: higher = more expensive
-- estimated vs actual: if they differ greatly → stale statistics
-- actual time=X..Y rows=Z loops=N:
-- X = milliseconds to first row, Y = milliseconds total
-- Z = actual rows output by this node
-- N = how many times this node was executed (inner side of NLJ runs many times)
-- Buffers: shared hit=X read=Y:
-- hit = pages found in buffer pool (fast, RAM)
-- read = pages read from disk (slow)
-- A high read count on a leaf node = candidate for an index
-- "Rows Removed by Filter: 12341" on HashAggregate → HAVING filtered 12K groups
-- "Rows Removed by Filter: 2079" on Seq Scan → WHERE filtered some rows
-- Large numbers here = predicate pushdown didn't fire or no index available
-- The PLAN TREE reads bottom-up:
-- Deepest indented nodes run FIRST (leaf nodes)
-- Root node (Sort) runs LAST and produces the final outputDiagnosing Bad Plans — The Five Most Common Problems
Diagnosis: A sequential scan is reading 5M rows to find 12K matches. This is 99.75% waste.
Fix: Add an index on the filtered column. CREATE INDEX idx_orders_status ON orders(status) WHERE status != 'delivered' — or a composite/partial index matching the exact query predicate.
Diagnosis: Estimated 15 rows, got 48K. The optimiser made catastrophically wrong plan choices based on these bad estimates — it probably chose NLJ when it should have chosen hash join.
Fix: Run VACUUM ANALYZE on the relevant tables. Check for column correlation (use extended statistics). Check for non-uniform distributions that histograms miss.
Diagnosis: Batches: 8 means the hash table didn't fit in work_mem and spilled to disk 8 times. This is dramatically slower than an in-memory hash join.
Fix: Increase work_mem for this session: SET work_mem = '256MB'. Or reduce the size of the build side (push more filters earlier to reduce input rows).
Diagnosis: The sort couldn't fit in work_mem and had to merge-sort from disk. Dramatically slower.
Fix: Increase work_mem. Or add an index on the sort column — if the data is already in index order, no separate sort is needed (index scan returns pre-sorted data).
Diagnosis: The inner side of a nested loop ran 84,523 times — once per outer row. Even though each lookup is fast (0.05ms), 84K × 0.05ms = 4.2 seconds total.
Fix: If both sides are large, a hash join or sort-merge join would process each side once. This NLJ choice might indicate stale statistics (outer side estimated too small). Run ANALYZE.
Inside the Query Optimiser — How PostgreSQL Chooses a Plan
The query optimiser is one of the most sophisticated components in any software system. For a query joining 5 tables, there are over 100 possible join orderings alone — and for each ordering, multiple algorithm choices at every node. The search space is enormous. Understanding the optimiser's strategy explains both when it works brilliantly and when it fails.
Dynamic Programming — The Standard Join Ordering Algorithm
PostgreSQL uses dynamic programming (Selinger algorithm) to find the optimal join order for queries with up to join_collapse_limittables (default 8). The algorithm builds up optimal solutions for subsets of tables incrementally — the optimal plan for joining {A,B,C} is built from the optimal plan for {A,B} plus C, or {A,C} plus B, or {B,C} plus A. Taking the minimum cost across all these options gives the globally optimal plan.
-- 3-TABLE JOIN: customers, orders, restaurants
-- Possible join orderings:
-- (customers ⋈ orders) ⋈ restaurants
-- (customers ⋈ restaurants) ⋈ orders ← cross join! no direct predicate between these two
-- (orders ⋈ restaurants) ⋈ customers
-- DYNAMIC PROGRAMMING APPROACH:
-- Step 1: Cost of single-table access (scan or index):
-- cost(customers) = 120 pages * 1.0 = 120
-- cost(orders) = 6250 pages * 1.0 = 6250
-- cost(restaurants) = 5 pages * 1.0 = 5
-- Step 2: Cost of all 2-table joins:
-- cost(customers ⋈ orders) via hash join = 120 + 6250 + 120(build) = ~6490
-- output: 5M rows (each customer has ~50 orders)
-- cost(orders ⋈ restaurants) via hash join = 6250 + 5 + 5(build) = ~6260
-- output: 5M rows (each order has one restaurant)
-- cost(customers ⋈ restaurants) = cross join: not applicable (no join predicate)
-- Step 3: Best 3-table join options:
-- Option A: (customers ⋈ orders) ⋈ restaurants
-- = cost 6490 + join 5M rows with 500 rows via hash = 6490 + 5000 = ~11490
-- Option B: (orders ⋈ restaurants) ⋈ customers
-- = cost 6260 + join 5M rows with 10K customers via hash = 6260 + 5000 = ~11260
-- Option B is slightly cheaper → CHOSEN
-- With predicate pushdown applied first:
-- After filtering customers WHERE city='Bengaluru': 1952 rows (not 100K)
-- After filtering orders WHERE status='delivered': 48K rows (not 5M)
-- Option B revisited: (filtered_orders ⋈ restaurants) ⋈ filtered_customers
-- = 6260(scan+filter) + small hash join = much cheaper
-- This is why predicate pushdown is applied before join ordering decisions.
-- GEQO (Genetic Query Optimization): for queries > join_collapse_limit tables
-- Uses genetic algorithm instead of exhaustive DP
-- Finds near-optimal but not guaranteed optimal solution
-- Faster than exhaustive search for very large join counts
SHOW geqo_threshold; -- default 12: use GEQO for joins with > 12 tables
SHOW join_collapse_limit; -- default 8: use DP for ≤ 8 tablesHints and Optimiser Control — When the Optimiser Gets It Wrong
The optimiser makes wrong choices when its statistics are stale or when the data has unusual distributions that the statistics model poorly. PostgreSQL does not support query hints natively (unlike Oracle's /*+ INDEX(table, idx) */). Instead, it provides configuration parameters to guide the optimiser.
-- DISABLE specific join types (force the optimiser away from bad choices):
SET enable_nestloop = off; -- disable NLJ (force hash or merge join)
SET enable_hashjoin = off; -- disable hash join
SET enable_mergejoin = off; -- disable sort-merge join
SET enable_seqscan = off; -- disable sequential scans (force index scans)
-- WARNING: these are session-level settings. Restore after debugging.
-- Also: disabling an operator doesn't mean it's never used —
-- if no alternative exists, the disabled operator is used anyway.
-- FORCE a specific index:
-- PostgreSQL doesn't have direct index hints.
-- Workaround: use SET enable_seqscan = off temporarily,
-- OR rewrite the query to be more index-friendly.
-- UPDATE STATISTICS when the optimiser is using stale estimates:
ANALYZE customers; -- update stats for one table
ANALYZE; -- update stats for all tables
VACUUM ANALYZE orders; -- clean dead tuples AND update stats
-- CHECK why the optimiser chose a plan:
EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
SELECT ...;
-- VERBOSE: shows which index was considered and rejected, output column lists
-- pg_hint_plan extension: provides Oracle-style hints for PostgreSQL
-- After installing: /*+ IndexScan(orders idx_orders_customer) */ in query
-- Not built-in — requires extension installation
-- SET STATISTICS: increase histogram precision for problem columns
ALTER TABLE orders ALTER COLUMN total_amount SET STATISTICS 500;
-- Default: 100 histogram buckets
-- For columns with complex distributions: increase to 500
ANALYZE orders (total_amount);
-- Now the histogram for total_amount has 500 buckets → better range estimatesThe Performance Debugging Session — Taking a 45-Second Query to 80ms
This is the full realistic workflow of diagnosing and fixing a slow query in production. Every step uses concepts from this module.
-- Business requirement: monthly restaurant performance report for operations team
SELECT
r.name AS restaurant_name,
r.city,
r.cuisine_type,
COUNT(DISTINCT o.customer_id) AS unique_customers,
COUNT(o.order_id) AS total_orders,
SUM(o.total_amount) AS gross_revenue,
AVG(o.total_amount) AS avg_order_value,
COUNT(rev.review_id) AS review_count,
AVG(rev.rating) AS avg_rating
FROM restaurants r
LEFT JOIN orders o
ON r.restaurant_id = o.restaurant_id
AND o.order_date >= '2024-01-01'
AND o.order_date < '2024-02-01'
AND o.status = 'delivered'
LEFT JOIN reviews rev
ON o.order_id = rev.order_id
GROUP BY r.restaurant_id, r.name, r.city, r.cuisine_type
ORDER BY gross_revenue DESC NULLS LAST;
-- Runtime: 45 seconds on a 50M row orders tableEXPLAIN (ANALYZE, BUFFERS)
-- [query above]
-- RELEVANT PARTS OF OUTPUT:
Sort (actual time=44823.456..44825.234 rows=843 loops=1)
-> HashAggregate (actual time=44820.123..44822.456 rows=843 loops=1)
-> Hash Left Join (actual time=312.456..44810.234 rows=48234567 loops=1)
Hash Cond: (o.order_id = rev.order_id)
-> Hash Left Join (actual time=5.234..38452.123 rows=48234567 loops=1)
Hash Cond: (r.restaurant_id = o.restaurant_id)
-> Seq Scan on restaurants r (rows=843 loops=1)
-> Hash (actual time=4.123..4.123 rows=847 loops=1)
-> Seq Scan on orders o
(actual time=0.025..34521.234 rows=48234567 loops=1)
Filter: (order_date >= '2024-01-01' AND order_date < '2024-02-01'
AND status = 'delivered')
Rows Removed by Filter: 1765432
-- KEY OBSERVATIONS:
-- 1. Seq Scan on orders reading 48M+ rows in 34 seconds → MASSIVE bottleneck
-- 2. No index on order_date + status → full scan every time
-- 3. Hash Left Join on reviews producing 48M intermediate rows → too many rows in join-- FIX 1: Add composite index matching the filter (date range + status)
CREATE INDEX CONCURRENTLY idx_orders_date_status
ON orders(order_date, status)
WHERE status = 'delivered';
-- Partial index: only delivered orders (most relevant for reporting)
-- order_date is the range column — leftmost in composite for range scans
-- FIX 2: Add index on reviews for the join
CREATE INDEX CONCURRENTLY idx_reviews_order
ON reviews(order_id)
INCLUDE (rating);
-- INCLUDE rating → covering index for the LEFT JOIN + AVG(rating) operation
-- FIX 3: Rewrite to use CTE for clarity and optimiser hints
WITH delivered_jan AS (
SELECT
restaurant_id,
order_id,
customer_id,
total_amount
FROM orders
WHERE order_date >= '2024-01-01'
AND order_date < '2024-02-01'
AND status = 'delivered' -- uses new partial index
)
SELECT
r.name, r.city, r.cuisine_type,
COUNT(DISTINCT d.customer_id) AS unique_customers,
COUNT(d.order_id) AS total_orders,
SUM(d.total_amount) AS gross_revenue,
AVG(d.total_amount) AS avg_order_value,
COUNT(rev.review_id) AS review_count,
AVG(rev.rating) AS avg_rating
FROM restaurants r
LEFT JOIN delivered_jan d ON r.restaurant_id = d.restaurant_id
LEFT JOIN reviews rev ON d.order_id = rev.order_id
GROUP BY r.restaurant_id, r.name, r.city, r.cuisine_type
ORDER BY gross_revenue DESC NULLS LAST;
-- AFTER FIXES — EXPLAIN output:
-- Index Scan on orders using idx_orders_date_status
-- (actual time=0.123..234.567 rows=48234 loops=1) ← 48K not 48M rows!
-- Hash Left Join (actual time=...) rows=48234 loops=1 ← 1000x fewer rows
-- Total Execution Time: 82.456 ms ← from 45,000ms to 82ms = 548x fasterThe optimisation process was: identify the bottleneck (sequential scan on orders), understand why (missing index on date + status), add the right index (partial composite covering the filter), and verify with EXPLAIN that the plan changed. The result is a 548× improvement — from a query that times out API requests to one that completes in 82 milliseconds. This exact workflow applies to every slow query in production.
Query Processing Interview Questions — Complete Answers
Predicate pushdown moves filter conditions (WHERE clauses) as close to the data source as possible — ideally before any join operation. It is the most impactful optimisation because intermediate result sizes have a multiplicative effect on query cost. A join between two 10M-row tables produces up to 100 trillion row combinations in the worst case. Filtering each table to 10K rows before joining reduces the join input to at most 100M combinations — a million times less work. In relational algebra: σ_{condition}(R ⋈ S) is rewritten to σ_{R_condition}(R) ⋈ σ_{S_condition}(S). Modern optimisers apply this automatically for any condition that only references one table's attributes. The application-level implication: always put the most selective WHERE conditions on individual tables, not on derived tables or CTEs if possible, to help the optimiser push them down.
The optimiser chooses nested loop join when: (1) The outer relation is very small — if the outer side has 10 rows, the loop runs only 10 times regardless of the inner table size. With an index on the inner table's join column, 10 index lookups is extremely cheap. (2) An index exists on the inner table's join column — this turns each inner "scan" into an O(log n) index lookup, making the total cost O(outer_size × log(inner_size)). (3) The join condition is non-equality (theta join) — hash joins only work for equality conditions. Sort-merge join can handle some inequalities but nested loop is the general solution. (4) The query optimizer estimates the inner side is tiny after applying filters. Hash joins win when both sides are large and fit in work_mem. Sort-merge wins when data is already sorted on the join key or when output ordering is needed.
A logical plan expresses the query in terms of relational algebra operations — selection, projection, join, aggregation — without specifying how each operation will be physically executed. It describes what data to retrieve and how it relates, in an implementation-independent way. A physical plan specifies the exact algorithms for each operation: which specific scan type (sequential scan, index scan, bitmap index scan), which join algorithm (nested loop, hash join, sort-merge), how sorting will be done (in-memory quicksort, external merge sort, using an existing index). The logical plan is the output of the query rewriter. The optimiser transforms the logical plan into the optimal physical plan by assigning physical algorithms to each logical operation based on cost estimation. EXPLAIN shows the physical plan — the actual algorithms the execution engine will run.
SQL is declarative — multiple SQL expressions can describe the same result. The database produces correct results for all of them but the execution plans differ dramatically. Example: EXISTS vs IN vs JOIN can all find customers who have orders, but the execution plan for each differs: EXISTS stops at the first match (short-circuit), IN may materialise the entire subquery result, JOIN may build a hash table. Another example: correlated subquery vs window function — both compute row-level aggregates, but the correlated subquery re-executes once per row (O(n²)) while the window function makes one pass (O(n)). Filter placement also matters: WHERE on an indexed column uses the index; applying the same filter inside a subquery or CTE may prevent index use. The key principle: the same logical result can be expressed with wildly different execution costs. Understanding query plans tells you which expression the optimiser can execute most efficiently.
The optimiser chooses a sequential scan over an index scan in several legitimate cases: (1) Low selectivity — if the query returns more than ~15-20% of the table rows, a sequential scan is actually cheaper. Fetching many random pages via an index generates lots of random I/O, while a sequential scan reads all pages in order (fast sequential I/O). (2) Small table — if the table fits in a few pages, a sequential scan reads those pages in one operation. An index scan would add the overhead of traversing the index tree plus fetching the same pages. (3) Stale statistics — if ANALYZE hasn't been run recently, the optimiser may believe the table is small or the condition is not selective, leading it to choose a scan. Solution: VACUUM ANALYZE. (4) Correlation near 0 — if the indexed column's physical order is random relative to its values, index scans generate random I/O. The optimiser's correlation statistic models this. Solution: CLUSTER the table or increase random_page_cost weight.
🎯 Key Takeaways
- ✓Query processing pipeline: Parser (syntax check) → Semantic Analyser (validate tables/columns/permissions) → Query Rewriter (rule-based transforms) → Optimiser (cost-based plan selection) → Execution Engine (run the plan). The optimiser is where most of the intelligence lives.
- ✓Relational algebra is the internal representation of queries. Seven core operators: σ (selection/WHERE), π (projection/SELECT columns), ⋈ (join), × (cartesian product), ∪ (union), − (difference/EXCEPT), ρ (rename/alias). Each takes relations as input and produces a relation as output.
- ✓Predicate pushdown is the most impactful query optimisation: apply WHERE filters before joins to reduce intermediate result sizes. The optimiser does this automatically, but writing queries with explicit per-table filters helps the optimiser and makes intent clear.
- ✓The cost model estimates I/O (page reads), CPU (comparisons), and memory (buffer usage) for each plan node. random_page_cost (default 4.0) > seq_page_cost (default 1.0) reflects that sequential I/O is faster. On SSDs, set random_page_cost = 1.1.
- ✓Cardinality estimation (predicting row counts) is the most error-prone part of optimisation. The independence assumption (P(A AND B) = P(A)×P(B)) fails for correlated columns. Use extended statistics (CREATE STATISTICS) to capture correlations. Stale statistics produce wrong estimates — run VACUUM ANALYZE regularly.
- ✓Three join algorithms: Nested Loop (best for small outer + index on inner), Sort-Merge (best for large + already sorted + ORDER BY needed), Hash Join (best for large equality joins that fit in work_mem). The optimiser chooses based on input sizes and index availability.
- ✓Hash join with Batches > 1 means work_mem was too small and the hash table spilled to disk. Increase work_mem (SET work_mem = '256MB') for analytical queries. Sort with "external merge" also means disk spill — same fix.
- ✓EXPLAIN ANALYZE is the primary diagnostic tool. Key metrics: actual rows vs estimated rows (large difference = stale statistics), Rows Removed by Filter (large = index opportunity), Buffers read (large = index or memory issue), loops count on inner side of NLJ (large = potential hash join candidate).
- ✓PostgreSQL uses dynamic programming (Selinger algorithm) for join ordering with ≤ 8 tables, genetic algorithm for more. It finds the plan with minimum estimated cost among all join orderings and algorithm combinations. Cannot use hints natively — guide via ANALYZE, SET enable_xxx = off, or CREATE STATISTICS.
- ✓A 548× improvement (45 seconds → 82ms) from adding one index is not unusual. The workflow is always: find the slow query, run EXPLAIN ANALYZE, identify the bottleneck node (usually a Seq Scan with many rows removed), understand why (no index, stale stats, wrong join order), apply the fix, verify the plan changed.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.