Indexes
Why queries slow down at scale, how indexes fix them, and the data structures powering every fast database lookup — explained from scratch.
Why Queries Slow Down — And What Indexes Do About It
Imagine a physical library with one million books, completely unsorted. No catalogue, no shelves organised by topic, no numbering system. Just books piled randomly across thousands of shelves. Someone walks in and asks: "Do you have a book called Database System Concepts by Silberschatz?" The only way to find it is to walk through every single shelf and read every single book title until you either find it or exhaust every possibility. In the worst case — the book isn't there — you checked one million books for nothing.
This is exactly what a database does when you query a table with no index. It reads every single row from disk, evaluates your WHERE condition against each one, and keeps the matches. This operation is called a full table scan (or sequential scan), and at small data volumes it is perfectly acceptable. At millions of rows, it becomes the difference between a query taking 2 milliseconds and 8 seconds.
Now imagine the library installs a catalogue — a small organised book that tells you exactly which shelf and position every book is on, sorted alphabetically by title. Finding "Database System Concepts" takes two seconds: open the catalogue, flip to D, find the entry, walk directly to shelf 47 row 3. Done. You checked perhaps 20 entries in the catalogue instead of one million books on the shelves. That catalogue is a database index.
Query: SELECT * FROM orders WHERE customer_id = 'C001'
On a table with 50 million rows the database reads all 50 million rows, evaluates customer_id on each, keeps the matches. One row per disk read in the worst case.
Query time: ~8 seconds
Disk reads: potentially all data pages
Same query on the same 50 million rows. The database checks the B+ tree index — a structure about log₂(50,000,000) ≈ 26 comparisons deep — and jumps directly to matching rows.
Query time: ~2 milliseconds
Disk reads: 3–4 pages maximum
That is not a small difference. It is a 4000× speedup for one missing index on one query. At Swiggy's scale — tens of millions of orders, queries running thousands of times per second — a missing index can bring down an entire service. Adding the right index can fix a production incident in under a minute. This is why indexes are one of the most practically important topics in all of database engineering.
What an Index Actually Is
An index is a separate data structure stored alongside the table, maintained automatically by the DBMS. It contains a sorted copy of the values in the indexed column(s), along with pointers to the actual row locations on disk. When you query using an indexed column, the database uses the index to find the right disk location directly — without reading unrelated rows at all.
The key properties of an index: it speeds up reads dramatically, it costs storage space (an index can be as large as the table itself for wide tables), and it slows down writes slightly (every INSERT, UPDATE, and DELETE must also update every index on that table). This trade-off — faster reads, slower writes, more storage — is the central tension of index design.
-- CREATE an index (the most common operation)
CREATE INDEX idx_orders_customer ON orders(customer_id);
-- Index name convention: idx_{table}_{column(s)}
-- This creates a B+ tree index by default in PostgreSQL and MySQL
-- CREATE a UNIQUE index (also enforces uniqueness constraint)
CREATE UNIQUE INDEX idx_customers_email ON customers(email);
-- Equivalent to: ALTER TABLE customers ADD CONSTRAINT UNIQUE(email)
-- CREATE INDEX CONCURRENTLY (PostgreSQL — safe on live production tables)
CREATE INDEX CONCURRENTLY idx_orders_date ON orders(order_date);
-- Without CONCURRENTLY: locks the table during index creation (dangerous in production)
-- With CONCURRENTLY: builds the index without any table lock (takes longer but safe)
-- DROP an index
DROP INDEX idx_orders_customer;
-- View indexes on a table (PostgreSQL)
SELECT indexname, indexdef
FROM pg_indexes
WHERE tablename = 'orders';
-- View indexes on a table (MySQL)
SHOW INDEX FROM orders;How Databases Store Data on Disk — Pages, Blocks, and Heap Files
To understand why indexes work the way they do, you need to understand how databases physically store data. This is the layer that index structures are specifically designed to navigate efficiently.
Pages — The Fundamental Unit of Disk Storage
Databases never read or write individual rows to disk. They operate in fixed-size units called pages (also called blocks). A page is typically 8KB in PostgreSQL (configurable), 16KB in MySQL InnoDB. Every read from disk retrieves at least one full page. Every write writes at least one full page.
A single page contains many rows. For a typical orders table where each row is about 200 bytes, an 8KB page holds roughly 40 rows. This means reading one order from disk also reads the 39 rows stored near it on the same page — whether you wanted those rows or not. This is why sequential scans are expensive: they read every page in the table, even if only a fraction of rows on each page match your query.
// Index scan: follow index pointer → read EXACTLY the target page(s)
Heap Files — Unordered Row Storage
By default, database tables store rows in a heap file — an unordered collection of pages where rows are inserted wherever space is available. There is no particular order to the rows. A row inserted today might be on page 1. The next row might go to page 847 because that is where the last available space was. Heap files make inserts fast (just find free space and write) but make targeted lookups slow (no structure to navigate to a specific row without an index).
This is why a new table with no indexes behaves as we described: finding one row requires reading the entire heap file page by page. An index is a separately maintained structure that provides the navigation shortcut the heap file inherently lacks.
The Buffer Pool — How Databases Cache Pages in RAM
The database does not read from disk every time it needs a page. It maintains a buffer pool — a cache of recently accessed pages kept in RAM. When a query needs a page, the database checks the buffer pool first. If the page is there (a cache hit), it uses the RAM copy — microseconds. If not (a cache miss), it reads from disk — milliseconds.
This means query performance depends on the buffer pool hit rate. Frequently accessed index pages (especially the upper levels of a B+ tree) stay permanently in the buffer pool on a busy system — making index lookups even faster in practice than the raw numbers suggest. The root and upper levels of a B+ tree on a production system are almost always in RAM.
B+ Tree — The Data Structure Behind Every Standard Index
The vast majority of database indexes use a data structure called the B+ tree. Understanding why B+ trees specifically — and not binary search trees, red-black trees, or hash tables — requires understanding the constraints of disk-based storage.
Why Not Binary Search Trees?
A binary search tree (BST) gives O(log₂ n) search time and seems like a natural choice for an index. But BSTs have one fatal property for disk-based storage: each node has at most 2 children. For a table with 50 million rows, a balanced BST has a height of log₂(50,000,000) ≈ 26 levels. Each level potentially requires a separate disk read (if the node is not in the buffer pool). 26 disk reads × ~5 milliseconds per disk read = 130 milliseconds per query. At 10,000 queries per second, this is unacceptable.
The solution: make each node much wider. Instead of 2 children, allow thousands of children per node. This dramatically reduces the height of the tree. A B+ tree node with 1000 children on 50 million rows has height log₁₀₀₀(50,000,000) ≈ 3 levels. Three disk reads instead of 26. This is the core design insight of B+ trees.
B+ Tree Structure — Internal Nodes and Leaf Nodes
A B+ tree has two types of nodes with different purposes:
Store only keys — the indexed values. No pointers to actual table rows. Their only purpose is navigation: given a search key, which child node should we follow next? Internal nodes can have hundreds to thousands of children, keeping the tree shallow.
↓ ↓ ↓ ↓ ↓ ↓
(6 child pointers)
Store the actual key values AND pointersto the corresponding rows in the heap file (or the actual row data for clustered indexes). This is where every search ultimately terminates. All leaf nodes are linked together in a doubly-linked list.
[30→row][35→row][40→row] ↔
[50→row][55→row][60→row]
The Critical Property — All Leaf Nodes Are Linked
The defining feature that makes B+ trees superior to B-trees for databases: all leaf nodes are connected in a doubly-linked listfrom left to right. After finding the first matching key using the tree traversal, you can follow the linked list forward to retrieve all subsequent matching keys in sorted order — without going back up the tree.
This is what makes range queries (WHERE age BETWEEN 25 AND 35, WHERE order_date > '2024-01-01') fast. Find the first matching leaf node using tree traversal, then walk the linked list collecting all subsequent matches until the range condition is no longer satisfied. No re-traversal of the tree needed for each match.
// Step 1: Traverse root→level2→leaf containing C040 (3 reads)
// Step 2: Walk linked list forward: C040, C041, C042... until C060
// Total disk reads: 3 (tree) + N (matching leaf pages) — not all pages
B+ Tree Search — Step by Step
// QUERY: SELECT * FROM orders WHERE customer_id = 'C042'
// B+ tree index exists on customer_id
// STEP 1: Read the ROOT node (almost always in buffer pool — cached in RAM)
// Root contains: [C030 | C060 | C090]
// C042 > C030 AND C042 < C060 → follow pointer to SECOND child
// STEP 2: Read the INTERNAL NODE at level 2 (may be cached)
// Node contains: [C040 | C050 | C060]
// C042 > C040 AND C042 < C050 → follow pointer to FIRST child of this node
// STEP 3: Read the LEAF NODE
// Leaf contains: [C040→page12, C041→page18, C042→page7, C043→page22, C044→page9]
// C042 found! Pointer says: the row is on page 7
// STEP 4: Read DATA PAGE 7 from the heap file
// Retrieve the actual row data for customer C042
// TOTAL DISK READS: 4 (root + level2 + leaf + data page)
// For 50 million rows with branching factor 1000: 3 levels + 1 = 4 reads
// Compare: full table scan would read every page of the heap file
// For 50M rows at 40 rows/page: 1,250,000 page reads
// RANGE QUERY: WHERE customer_id BETWEEN 'C040' AND 'C060'
// Step 1-3: same as above — navigate to leaf containing C040
// Step 4: read all matching rows from data pages for C040, C041... C060
// Step 5: follow leaf linked list forward to C041's leaf page (if different page)
// Continue until C060 is passed
// TOTAL: 3 tree reads + (number of matching leaf pages) + (number of matching data pages)
// Still dramatically fewer reads than full scanB+ Tree Insertion — How the Tree Grows
Insertions are what make indexes cost something. Every time a row is inserted into the table, the database must also insert the new key into every index on that table. B+ tree insertion works as follows: navigate to the correct leaf node, insert the key in sorted position. If the leaf node is full (has reached maximum capacity), split it into two nodes and push the middle key up to the parent internal node. If the parent is also full, split it too. This splitting can cascade all the way to the root — when the root splits, a new root is created and the tree grows one level taller.
// ORDER: INSERT INTO orders VALUES ('C042', ...) — index on customer_id
// Before insertion — leaf node is full (max 4 keys in this example):
// Leaf: [C040 | C041 | C043 | C044]
// Insert C042 in sorted position → overflow:
// [C040 | C041 | C042 | C043 | C044] → exceeds capacity
// SPLIT: divide into two leaf nodes
// Left leaf: [C040 | C041 | C042]
// Right leaf: [C043 | C044]
// Push middle key (C042) up to parent internal node
// Parent internal node before: [C030 | C060]
// After push-up: [C030 | C042 | C060] — with new pointer to right leaf
// Parent not full → no further splitting needed
// INSERT COST: O(log_B n) comparisons + potentially O(log_B n) page writes
// B = branching factor (typically 100-1000)
// At 50M rows: ~3 page reads + potentially 3 page writes for splits (rare)
// Most insertions do NOT cause splits — leaf nodes are often half-emptyB+ Tree Deletion
Deletion is the inverse of insertion. Navigate to the leaf node containing the key and remove it. If the leaf node falls below the minimum occupancy threshold (typically 50%), either borrow a key from a sibling node (redistribution) or merge with a sibling (underflow merge). Merging can cascade up the tree similarly to how splits cascade during insertion. In practice, databases often leave slightly underfull pages without immediate merging — merging is triggered periodically by maintenance operations (VACUUM in PostgreSQL).
Every Index Type — What It Is, When to Use It, When to Avoid It
When NOT to Add an Index — The Cases That Hurt More Than They Help
Every index has a cost. Storage space for the index structure itself. CPU and disk writes to maintain the index on every INSERT, UPDATE, and DELETE. Buffer pool space competing with actual data pages. More index options for the query optimiser to evaluate. Adding indexes blindly — indexing every column "just in case" — is a common mistake that makes write-heavy systems significantly slower without meaningful read improvement.
For small tables, a full table scan reads only a handful of pages and completes in microseconds. The overhead of maintaining and traversing an index exceeds the benefit of avoiding a sequential scan. The query optimiser will likely ignore your index and do a sequential scan anyway — it can calculate which approach is cheaper.
Cardinality is the number of distinct values in a column. A column with only 2 distinct values (is_active: true/false) has very low cardinality. An index on a low-cardinality column is often useless because the index lookup returns half the table — at that point a sequential scan is equally fast and avoids the two-step lookup overhead.
Tables that receive thousands of inserts per second (event logs, clickstreams, sensor data, audit trails) pay the index maintenance cost on every write. If these tables are rarely queried interactively, the performance cost of maintaining indexes far exceeds the benefit of faster occasional reads.
An index is only useful when it matches a query's access pattern. Indexing a column that is only ever selected (in the SELECT list) but never used to filter or sort provides zero query performance benefit while adding full write overhead.
When loading large amounts of data into a table, it is faster to drop all non-essential indexes before the load, perform the bulk insert, then recreate the indexes afterward. Maintaining indexes during a bulk load of 10 million rows is dramatically slower than rebuilding them once after the load completes.
EXPLAIN and EXPLAIN ANALYZE — Reading the Database's Execution Plan
The EXPLAIN command shows you how the database plans to execute a query — which indexes it will use, which join algorithms it chooses, how many rows it estimates at each step. EXPLAIN ANALYZE actually executes the query and shows both the plan AND the actual measured runtime statistics. This is the primary diagnostic tool for query performance.
Key Scan Types — What They Mean
Sequential (full table) scan. Reads every row. Expected for small tables or when no useful index exists. On large tables with a selective WHERE clause: a problem.
Uses an index to find rows, then fetches the actual row data from the heap (two-step). Good for selective queries returning a small fraction of the table.
Uses a covering index — all needed columns are in the index. Never touches the heap file. The fastest possible access pattern.
Builds a bitmap of matching row locations from the index, then fetches all matching heap pages in physical order. Efficient for queries matching many rows via multiple indexes.
-- Run EXPLAIN ANALYZE on a slow query
EXPLAIN ANALYZE
SELECT o.order_id, c.name, o.total_amount
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE o.status = 'pending'
AND o.total_amount > 500;
-- EXAMPLE OUTPUT:
-- Hash Join (cost=1500.00..8400.00 rows=450 width=64)
-- (actual time=45.123..156.789 rows=423 loops=1)
-- Hash Cond: (o.customer_id = c.customer_id)
-- -> Seq Scan on orders o
-- (cost=0.00..6500.00 rows=850 width=40)
-- (actual time=0.025..145.234 rows=847 loops=1)
-- Filter: ((status = 'pending') AND (total_amount > 500))
-- Rows Removed by Filter: 9153153
-- -> Hash (cost=600.00..600.00 rows=72000 width=28)
-- (actual time=32.456..32.456 rows=72000 loops=1)
-- -> Seq Scan on customers c
-- (cost=0.00..600.00 rows=72000 width=28)
-- (actual time=0.012..18.234 rows=72000 loops=1)
-- Planning Time: 2.345 ms
-- Execution Time: 157.123 ms
-- READING THE OUTPUT:
-- "Seq Scan on orders" → full table scan on 10M row table → PROBLEM
-- "Rows Removed by Filter: 9153153" → reading 10M rows to find 847 → 99.99% waste
-- "Execution Time: 157ms" → acceptable? not for a user-facing API endpoint
-- DIAGNOSIS:
-- The Seq Scan on orders is the bottleneck
-- WHERE clause: status = 'pending' AND total_amount > 500
-- FIX: Add a partial index on pending orders with total_amount
CREATE INDEX idx_orders_pending_large
ON orders(total_amount)
WHERE status = 'pending';
-- AFTER INDEX:
-- Index Scan using idx_orders_pending_large on orders
-- (actual time=0.123..2.456 rows=847 loops=1)
-- Index Cond: (total_amount > 500)
-- Execution Time: 4.567 ms
-- 35x improvement from adding one targeted index
-- KEY NUMBERS TO LOOK FOR IN EXPLAIN ANALYZE:
-- "Rows Removed by Filter" large → index could help filter earlier
-- "actual rows" >> "estimated rows" → statistics are stale → run ANALYZE
-- "actual time" large for a node → that node is the bottleneck
-- "loops" > 1 → this node runs multiple times (e.g., inner side of nested loop join)
-- cost=X..Y → X=startup cost (first row), Y=total cost (all rows)Common Query Optimisation Patterns
-- PATTERN 1: Leading wildcard LIKE cannot use B+ tree index
-- SLOW: leading % forces full scan
SELECT * FROM customers WHERE name LIKE '%Sharma%';
-- The B+ tree is sorted by name — we don't know where names containing "Sharma" start
-- BETTER: trailing wildcard CAN use index (scans from prefix)
SELECT * FROM customers WHERE name LIKE 'Sharma%';
-- Uses index: starts at "Sharma" in sorted order, reads forward
-- BEST for arbitrary substring search: Full-Text Search
CREATE INDEX idx_customers_name_fts ON customers USING GIN(to_tsvector('english', name));
SELECT * FROM customers WHERE to_tsvector('english', name) @@ to_tsquery('Sharma');
-- PATTERN 2: Function on indexed column disables the index
-- SLOW: applying function to column prevents index use
SELECT * FROM orders WHERE EXTRACT(YEAR FROM order_date) = 2024;
-- Function applied to order_date → cannot use B+ tree index on order_date
-- FAST: use range instead of function
SELECT * FROM orders WHERE order_date >= '2024-01-01' AND order_date < '2025-01-01';
-- Range query → B+ tree index on order_date used efficiently
-- PATTERN 3: Implicit type conversion disables index
-- SLOW: comparing VARCHAR column to integer causes implicit cast
SELECT * FROM customers WHERE customer_id = 42; -- customer_id is VARCHAR
-- FAST: match types explicitly
SELECT * FROM customers WHERE customer_id = '42'; -- same type as column
-- PATTERN 4: OR conditions often prevent index use
-- SLOW: OR can force full scan (depends on optimizer sophistication)
SELECT * FROM orders WHERE status = 'pending' OR status = 'confirmed';
-- FAST: use IN (optimizer handles better)
SELECT * FROM orders WHERE status IN ('pending', 'confirmed');
-- OR BEST: UNION ALL if each condition benefits from different index
SELECT * FROM orders WHERE status = 'pending'
UNION ALL
SELECT * FROM orders WHERE status = 'confirmed';
-- PATTERN 5: ORDER BY + LIMIT uses index efficiently
-- If index matches ORDER BY column, no separate sort step is needed
CREATE INDEX idx_orders_date ON orders(order_date DESC);
-- This query is extremely fast:
SELECT * FROM orders ORDER BY order_date DESC LIMIT 20;
-- Uses index scan directly — no sort, reads exactly 20 leaf pagesIndex Maintenance — Bloat, Fragmentation, and Keeping Indexes Healthy
Indexes degrade over time. Heavy insert and delete workloads leave partially empty pages (fragmentation). Deleted rows leave dead entries in index leaf nodes (bloat). Both conditions make the index slower than it should be — the database must read more pages to retrieve the same data, and the index takes more disk space than necessary. Regular maintenance is required.
Index Bloat and Fragmentation
When a row is deleted from a table, the corresponding entry in every index is marked as dead — but the space is not immediately reclaimed. Over time, especially in high-churn tables (many deletes and updates), indexes can become bloated with dead entries. The index size grows while the useful data shrinks. PostgreSQL's VACUUM process cleans dead tuples, but indexes may still become fragmented (many half-empty pages in a non-contiguous layout on disk).
-- POSTGRESQL: VACUUM cleans dead tuples from tables and indexes
VACUUM orders; -- clean dead tuples, non-blocking
VACUUM ANALYZE orders; -- clean + update statistics (optimizer uses statistics)
VACUUM FULL orders; -- full rewrite, reclaim all space (BLOCKS table — use in maintenance window)
-- Check for index bloat (PostgreSQL):
SELECT
schemaname,
tablename,
indexname,
pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
idx_scan AS times_used,
idx_tup_read,
idx_tup_fetch
FROM pg_stat_user_indexes
ORDER BY pg_relation_size(indexrelid) DESC;
-- Find unused indexes (indexes never used since last stats reset):
SELECT schemaname, tablename, indexname, idx_scan
FROM pg_stat_user_indexes
WHERE idx_scan = 0
AND indexname NOT LIKE '%_pkey' -- exclude primary keys
ORDER BY schemaname, tablename;
-- Consider dropping indexes with idx_scan = 0 — they are pure overhead
-- REINDEX: rebuild an index from scratch (removes bloat, fixes fragmentation)
REINDEX INDEX idx_orders_customer; -- specific index
REINDEX TABLE orders; -- all indexes on table
REINDEX INDEX CONCURRENTLY idx_orders_customer; -- non-blocking (PostgreSQL 12+)
-- MYSQL: ANALYZE TABLE updates statistics
ANALYZE TABLE orders;
-- MYSQL: OPTIMIZE TABLE rebuilds table and indexes (equivalent to VACUUM FULL + REINDEX)
OPTIMIZE TABLE orders; -- ⚠ locks table during operation — use in maintenance window
-- Statistics matter: the query optimizer uses row count estimates and
-- value distributions to choose execution plans. Stale statistics cause
-- the optimizer to make wrong decisions (wrong join order, wrong index).
-- Run ANALYZE regularly on frequently updated tables.The 3am Alert — Diagnosing and Fixing a Missing Index in Production
This is the most realistic production scenario involving indexes. It happens at every company that operates at scale. Understanding this sequence — alert, diagnosis, fix — is what makes you a valuable engineer.
The post-mortem lesson: The new feature was deployed without checking its query plan. The WHERE clause changed from just customer_id (indexed) to customer_id + status (not fully covered by the index). The index strategy should be reviewed in code review for any query that changes filter conditions on large tables. One missing index caused a 9-minute production incident affecting thousands of users.
Index Interview Questions — With Complete Answers
A clustered index determines the physical storage order of rows in the table. The table data pages ARE the leaf nodes of the index — data and index are one structure. A table can have only one clustered index because rows can only be physically sorted one way. A non-clustered index is a separate structure from the table. Its leaf nodes contain the indexed key values and pointers back to the actual rows in the heap file. A table can have many non-clustered indexes. In MySQL InnoDB, the primary key is automatically the clustered index. In PostgreSQL, all tables are heap files by default (non-clustered).
WHERE A=1: YES — uses the index (first column is A, leftmost prefix rule satisfied). WHERE B=2: NO (or very limited) — B is the second column; without filtering on A first, the index cannot be traversed to find all B=2 entries efficiently. WHERE A=1 AND B=2: YES — uses both columns optimally (leftmost prefix A, then narrows by B). WHERE B=2 AND C=3: NO — neither B nor C is the leftmost prefix; without A, the index cannot be used for navigation. The leftmost prefix rule: an index on (A,B,C) efficiently answers queries that filter on A, or (A,B), or (A,B,C) — not B alone, C alone, or (B,C) without A.
Use a partial index when queries consistently target a specific subset of table rows that is much smaller than the full table. Classic cases: queries on active records only (WHERE is_active = true), pending/unprocessed items (WHERE status = 'pending'), recent data (WHERE created_at > '2024-01-01'). A partial index on 1% of rows is 100x smaller than a full index, updates 100x less frequently on writes, and scans 100x faster. It is ideal when the indexed subset is highly selective and the queries targeting it are frequent and performance-critical.
A B+ tree index is sorted — it allows efficient lookup by navigating from a known starting point in the sorted order. A leading wildcard (%keyword%) means the search key can start with any character. There is no "starting point" in the sorted tree — every entry could potentially match. The database must scan the entire index (or table) to find matches. This is as slow as a full table scan. Only trailing wildcards (keyword%) can use B+ tree indexes because there IS a known starting point (start at "keyword" in sorted order and scan forward). For arbitrary substring search, use full-text search indexes (GIN with tsvectors in PostgreSQL) which use inverted indexes designed for this purpose.
Index cardinality refers to the number of distinct values in an indexed column. High cardinality means many distinct values (e.g., customer_id — nearly as many distinct values as rows). Low cardinality means few distinct values (e.g., status with 5 values, is_active with 2 values). Cardinality matters because it determines how selective an index is. A highly selective index (high cardinality) narrows down the result set dramatically — it is worth the lookup overhead. A low-selectivity index (low cardinality) might return 50% of the table — at that point a full table scan is equally fast and avoids the double-lookup overhead. The query optimiser uses cardinality statistics to decide whether to use an index or scan the table directly.
A covering index is one that contains all columns needed by a specific query — both the columns used in WHERE/ORDER BY conditions and the columns in the SELECT list. When a query is answered entirely from the index without touching the actual table rows, it is called an "Index Only Scan" in PostgreSQL. This is the fastest possible access pattern because: (1) the index is typically much smaller than the table, (2) index pages are more likely to be cached in the buffer pool, (3) there is no second lookup to fetch actual row data. Use covering indexes for high-frequency, performance-critical queries. Design them by putting search columns in the index key and additional needed columns in INCLUDE. The cost is larger index size and more write overhead — justified only for queries with very high frequency.
🎯 Key Takeaways
- ✓A full table scan reads every page in the table. An index lookup reads only the relevant pages. For a 50-million row table, this is the difference between 1.25 million page reads and 4 page reads — a 4000x difference.
- ✓Databases operate in fixed-size pages (8KB in PostgreSQL). Every disk read retrieves at least one full page. Indexes work by reducing the number of pages that need to be read to answer a query.
- ✓B+ trees are the standard index structure because they are wide (thousands of children per node), shallow (3-4 levels for billions of rows), and have linked leaf nodes enabling efficient range scans. Binary search trees are too deep for disk-based storage.
- ✓The linked leaf node list in a B+ tree is what makes range queries fast. Find the first match via tree traversal, then walk forward through the linked list — no re-traversal needed for each subsequent match.
- ✓Clustered index: table rows physically stored in index key order — one per table. Non-clustered index: separate structure with pointers to heap rows — many per table. Clustered is fastest for range queries on the key; non-clustered adds a heap lookup step.
- ✓Composite index leftmost prefix rule: index on (A,B,C) efficiently serves queries filtering on A, (A,B), or (A,B,C). It cannot efficiently serve queries filtering on B alone, C alone, or (B,C) without A. Column order in composite indexes is a critical design decision.
- ✓Covering index: all columns needed by a query are in the index — enables Index Only Scan with zero table page reads. Design by putting WHERE/ORDER BY columns in the index key and additional SELECT columns in INCLUDE.
- ✓Hash indexes are O(1) for equality lookups but completely useless for range queries. B+ tree is the safe default for all index types. Hash indexes are a micro-optimisation only for columns never used in ranges or ordering.
- ✓Partial indexes cover only rows satisfying a WHERE condition. A partial index on 1% of rows is 100x smaller, faster, and cheaper to maintain than a full index. Use for queries that consistently target a selective subset.
- ✓Do NOT index: small tables (full scan is fast), low-cardinality columns (index returns too many rows), write-heavy tables with rare reads (write overhead exceeds read benefit), columns never used in WHERE/JOIN/ORDER BY. Use EXPLAIN ANALYZE to diagnose and pg_stat_user_indexes to find unused indexes.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.