Storage & File Organization
The physical foundation everything else rests on — how databases talk to disks, how data is laid out in pages, how files are organised for fast access, and how the buffer pool keeps the most useful data in RAM.
The Storage Hierarchy — Why Physical Storage Determines Everything
Every design decision in a DBMS — how data is organised on disk, how the buffer pool works, why B+ trees have high fan-out, why sequential reads are preferred over random reads — traces back to one fundamental constraint: different storage media have wildly different performance characteristics, and most database data lives on the slowest of them.
The storage hierarchy orders storage types by speed and capacity. The fastest storage (CPU registers, L1 cache) holds bytes at a time and operates in nanoseconds. The slowest practical storage (spinning hard drives) holds terabytes but operates in milliseconds — a million times slower. Understanding this hierarchy is understanding why database engineering is fundamentally a problem of minimising slow storage accesses.
The Fundamental Access Pattern Rule
The storage hierarchy creates one rule that governs all of database storage design: minimise random I/O, maximise sequential I/O. On a spinning disk, reading 1 MB sequentially takes about 10ms. Reading 1 MB as 256 random 4KB reads takes about 1.25 seconds — 125× slower. On NVMe SSDs, the gap is smaller but still significant (sequential bandwidth is 5–10× higher than random access throughput). Every data structure and file organisation decision in this module is ultimately about this principle.
// SPINNING DISK:
// Sequential read throughput: ~150 MB/s
// Random read latency: ~8ms (seek + rotational delay)
// Random read throughput: 1 read / 8ms = 125 IOPS = ~500 KB/s
// RATIO: sequential is 300× faster for the same data volume
// NVMe SSD:
// Sequential read throughput: ~5,000 MB/s
// Random read latency: ~100µs
// Random read IOPS: ~500,000 = ~2,000 MB/s
// RATIO: sequential is ~2.5× faster
// DRAM (buffer pool):
// Sequential throughput: ~50,000 MB/s
// Random access latency: ~100ns
// Effectively no penalty for random access in DRAM
// IMPLICATION FOR DATABASE DESIGN:
// Full table scan (sequential): read all N pages in order
// → exploits OS prefetching, disk elevator algorithm, hardware readahead
// → cost: N × (page_size / sequential_throughput)
// Index scan with many random heap fetches (random):
// → each page fetch may require a separate disk seek
// → at some selectivity threshold (~15-20% of table), full scan beats index scan
// → this is why PostgreSQL's planner sometimes chooses seq scan over index scan
// B+ tree design: high fan-out minimises height, minimises random reads (one per level)
// Bulk loading: fills pages sequentially, exploiting sequential write throughput
// Clustered index: keeps related rows physically adjacent → range scan = sequential readDisk Mechanics — Why Seek Time Is the Enemy
Understanding spinning disk mechanics explains why every database optimisation targets sequential access. Even as SSDs become dominant, the mental model of disk mechanics explains buffer pool design, file organisation choices, and index structure decisions that remain relevant regardless of storage medium.
Circular magnetic disks that spin at 5,400–15,000 RPM. Data is encoded magnetically on concentric circular tracks. Multiple platters share a common spindle.
Electromagnetic head that floats nanometres above the platter surface. One head per platter surface. All heads move together on a common arm — seeking one track moves all heads simultaneously.
A concentric circle on a platter. The outermost track is track 0. Each track is divided into sectors (typically 512 bytes or 4KB per sector). Data is written sector by sector along a track.
The set of tracks at the same radial position across all platters and both surfaces. Data on the same cylinder can be accessed without moving the arm — important for sequential multi-platter reads.
Time to move the read/write arm to the correct track. 3–12ms typical. This is the dominant cost of random disk access and the primary reason random I/O is so expensive.
Time waiting for the desired sector to rotate under the head after the arm has seeked. Average = half a rotation. At 7,200 RPM: 60s/7200 × 0.5 ≈ 4ms average.
Time to actually read/write the data once the head is positioned. Proportional to data size. At 150 MB/s, reading 8KB takes ~0.05ms — tiny compared to seek + rotational latency.
Seek time + Rotational latency + Transfer time. For a random 8KB read: ~8ms + ~4ms + 0.05ms ≈ 12ms. For sequential: only transfer time after the first seek, so ~0.05ms per 8KB page.
// RANDOM READ ACCESS TIME (spinning disk):
// seek_time = 8ms (average arm movement)
// rot_latency = 4ms (average half-rotation at 7200 RPM)
// transfer_time = 0.05ms (8KB page at 150 MB/s)
// total = 12.05ms per random page read
// IOPS = 1000ms / 12.05ms ≈ 83 random IOPS
// SEQUENTIAL READ THROUGHPUT:
// After initial seek + rotation (12ms for first page):
// Each subsequent page: only transfer_time = 0.05ms
// Because: head stays at the same track, platters keep spinning
// Effective throughput ≈ 150 MB/s = 18,750 pages/second
// COMPARISON:
// Random: 83 IOPS × 8KB = 664 KB/s effective throughput
// Sequential: 18,750 pages/s × 8KB = 150 MB/s
// Sequential is 225× faster for the same data volume
// HOW THIS SHAPES DATABASE DESIGN:
// 1. Full table scan: sequential — reads pages in disk order
// → 18,750 pages/second → 150 MB/s
// 2. Index scan (many heap fetches): random — one seek per row
// → 83 fetches/second → 664 KB/s
// Crossover: at ~0.44% selectivity (664/150,000 ≈ 0.44%)
// For selectivity > 0.44%: full scan can be FASTER than index scan
// This is exactly why PostgreSQL sometimes ignores your index
// SSDS CHANGE THE NUMBERS:
// NVMe: seek ≈ 0, rot ≈ 0, transfer ≈ 100µs per random 4KB read
// Sequential NVMe: 5,000 MB/s = still 2.5× faster than random NVMe
// The principle remains: sequential > random, just less dramaticallyPages and Blocks — The Fundamental Unit of Database I/O
Neither the database nor the OS reads individual bytes or rows from disk. All disk I/O happens in fixed-size chunks. The OS works in blocks (typically 4KB, determined by the filesystem). The DBMS works in pages (a configurable multiple of the OS block size). PostgreSQL defaults to 8KB pages. MySQL InnoDB uses 16KB pages. The DBMS page is the unit of all buffer pool management — pages are loaded into and evicted from the buffer pool as complete units.
The choice of page size involves a trade-off. Larger pages mean fewer I/O operations to read the same amount of data (better for sequential scans) but waste more space when only part of a page is needed (worse for point lookups on sparse data). 8KB is the sweet spot for most OLTP workloads because it balances scan efficiency with per-lookup overhead.
The Slotted Page Format — How a Page Is Laid Out Internally
Inside each page, the DBMS must track which rows are present, where each row starts, and which slots are free. The standard layout for variable-length records is the slotted page format.
// SLOTTED PAGE STRUCTURE:
// Page header (fixed, at page start):
// page_id: which page this is in the file
// lsn: log sequence number (for crash recovery)
// num_slots: current number of slot entries
// free_space_ptr: pointer to start of free space (between slot array and records)
// flags: dirty flag, page type, etc.
// Slot array (grows forward from end of header):
// Each slot: [offset: 2 bytes | length: 2 bytes]
// slot[0] → (offset=7900, length=48): row 0 starts 7900 bytes into page, is 48 bytes
// slot[1] → (offset=7848, length=52): row 1 starts 7848 bytes into page, is 52 bytes
// slot[2] → (offset=0, length=0): slot 2 is DELETED (marked with null offset)
// Record data (grows backward from end of page):
// Row 0: [customer_id=1 | name=Rahul | city=Bengaluru | age=28] at offset 7900
// Row 1: [customer_id=2 | name=Priya | city=Hyderabad | age=31] at offset 7848
// ROW ID (RID): (page_id, slot_number)
// RID = (page_47, slot_0) → always identifies the same row
// Even if the row MOVES within the page (compaction), slot[0] is updated
// to point to the new offset — RID remains stable
// This is critical for indexes: indexes store RIDs, not byte offsets
// INSERT a new row:
// 1. Check free_space: is there room for the row + one new slot entry?
// 2. Write new row at current free_space_ptr (from the end)
// 3. Decrement free_space_ptr by row size
// 4. Add new slot entry: (offset=new_row_start, length=row_size)
// 5. Increment num_slots
// 6. If no room: request a new page or compact this page (VACUUM)
// DELETE a row (logical delete):
// 1. Set slot[i].offset = 0 (or a null marker)
// 2. The space is NOT immediately reclaimed
// 3. VACUUM later compacts the page: moves remaining rows, reclaims dead space
// UPDATE a row (variable-length change):
// If new row is SAME SIZE or SMALLER: update in place, adjust slot length
// If new row is LARGER: delete old slot, insert new row at free space
// If too large for this page: move to a different page, leave a forwarding pointer
// WHY THIS DESIGN:
// Variable-length records: rows have different sizes (VARCHAR columns)
// The slot array at the front gives O(1) access to any row by slot number
// Records at the back can be compacted without changing the slot array
// RIDs through the slot array remain stable even after compactionFile Organisation — How Pages Are Arranged Into Files
A database table is stored as a collection of pages in one or more files on disk. How those pages are arranged — their logical and physical ordering — determines how efficiently different types of operations can be performed. There are four major file organisation strategies, each with distinct performance characteristics for search, insertion, and deletion.
Record Formats — Fixed-Length vs Variable-Length Records
How individual records (rows) are laid out within a page affects how efficiently the DBMS can access individual fields, how much space is consumed, and how updates are handled. There are two fundamental record formats.
Every record has exactly the same size. Fields are stored at fixed, predetermined byte offsets within the record. To access field N, compute byte_offset = sum of preceding field sizes.
Field access: O(1) — compute offset directly
Space: Wasteful for short strings (CHAR pads with spaces)
Inserts: Simple — always the same size
Updates: Always in-place — same size
Records have different sizes depending on actual data values. VARCHAR, TEXT, and BLOB columns are variable-length. The record stores both data and metadata about field boundaries.
Field access: O(n) — must parse header to find variable fields
Space: Efficient — stores only actual string length
Inserts: Requires fitting the actual record size
Updates: If new value larger: may need to move to different page
TOAST — Handling Very Large Values
What happens when a single field value is larger than an entire page? A TEXT column storing a 10MB document cannot fit in an 8KB page. PostgreSQL handles this with TOAST(The Oversized-Attribute Storage Technique). Values larger than approximately 2KB are automatically compressed or moved to a separate TOAST table, with a pointer stored in the main row.
-- TOAST strategies per column (PostgreSQL):
-- PLAIN: no compression, no out-of-line storage (for small fixed types)
-- EXTENDED: compress first, then out-of-line if still too large (default for TEXT, JSONB)
-- EXTERNAL: out-of-line without compression (fast access, more space)
-- MAIN: compress before considering out-of-line
-- Check a table's TOAST table:
SELECT relname, reltoastrelid::regclass AS toast_table
FROM pg_class
WHERE relname = 'products';
-- Check column storage strategies:
SELECT attname, attstorage
FROM pg_attribute
WHERE attrelid = 'products'::regclass AND attnum > 0;
-- attstorage: 'x' = extended (default), 'e' = external, 'm' = main, 'p' = plain
-- TOAST impact on queries:
-- SELECT * FROM articles: always decompresses/fetches large TEXT columns
-- SELECT id, title FROM articles: does NOT touch TOAST if TEXT columns not selected
-- Always select only needed columns to avoid unnecessary TOAST accessBuffer Pool Management — The DBMS's Window Into Disk
The buffer pool is the DBMS-managed area of main memory (RAM) that caches disk pages. It is the single most important performance component in a database system. When a query needs a page, the DBMS checks the buffer pool first. If the page is there (a hit), it is returned in nanoseconds. If not (a miss), the DBMS must read it from disk (milliseconds) and load it into the buffer pool, possibly evicting another page to make room.
The buffer pool is organised as an array of fixed-size frames, each holding exactly one page. A page table maps page IDs to frame numbers. Each frame also maintains a dirty bit (modified since it was read from disk — must be written back before eviction) and a pin count(number of currently active threads using this frame — cannot be evicted while pinned).
Buffer Pool Mechanics — The Complete Request Flow
// HOW THE BUFFER POOL HANDLES A PAGE REQUEST:
// Step 1: Query needs page P (e.g., page 47 of orders table)
// Step 2: Look up page table: is page 47 in any frame?
// CASE A: PAGE HIT (page is in buffer pool)
// Page 47 found in frame 23
// Increment pin_count[23] (mark as in-use, cannot evict while pinned)
// Return pointer to frame 23 contents
// Thread accesses the data directly in RAM — O(1) ns
// CASE B: PAGE MISS (page not in buffer pool)
// Step 2b: Find a free frame (evict if necessary)
// a. Find a frame with pin_count = 0 (not currently in use)
// b. If dirty_bit[frame] = 1:
// Write frame contents back to disk first (WAL write)
// Reset dirty_bit = 0
// Step 2c: Read page 47 from disk into the chosen frame
// disk_read(page_47_disk_location → frame_X)
// Cost: ~8-12ms for spinning disk, ~100µs for NVMe
// Step 2d: Update page table: page 47 → frame X
// Step 2e: Pin the frame: pin_count[X]++
// Step 2f: Return pointer to frame X
// When thread FINISHES using the page:
// Decrement pin_count[frame]
// If the page was MODIFIED: set dirty_bit[frame] = 1
// (Do NOT write to disk immediately — buffer pool batches writes)
// BUFFER POOL SIZE MATTERS ENORMOUSLY:
-- Check PostgreSQL buffer pool configuration:
SHOW shared_buffers; -- default: 128MB (way too small for production)
-- Recommendation: 25% of RAM for a dedicated database server
-- 32GB RAM server: shared_buffers = 8GB
-- This allows 8GB / 8KB = 1,048,576 pages in the buffer pool
-- Check buffer pool hit rate (should be > 99% for OLTP):
SELECT
sum(heap_blks_hit) AS heap_hits,
sum(heap_blks_read) AS heap_reads,
ROUND(
sum(heap_blks_hit)::numeric
/ NULLIF(sum(heap_blks_hit) + sum(heap_blks_read), 0) * 100,
2
) AS hit_rate_pct
FROM pg_statio_user_tables;
-- If hit_rate_pct < 95%: increase shared_buffers or add RAMPage Replacement Policies — Which Page to Evict
When the buffer pool is full and a new page must be loaded, the DBMS must choose which existing page to evict. This is the page replacement problem — the same problem CPU caches face. The choice of replacement policy significantly affects the buffer pool hit rate.
Evict the page that was least recently accessed. The intuition: if a page hasn't been used in a long time, it is unlikely to be used soon. Tracks a timestamp or order of last access for every frame.
Implementation: Maintain a doubly-linked list of frames ordered by last access time. On each access, move the frame to the head of the list. Evict from the tail. O(1) per access with a hashmap for frame lookup.
The LRU-K problem (Sequential Flooding): A full table scan reads every page in the table exactly once. With basic LRU, these scan pages flood the buffer pool, evicting the frequently-used index and "hot" pages that are accessed constantly. After the scan, those hot pages must be re-read from disk. The scan, which needed each page only once, has destroyed the working set.
An efficient approximation of LRU using a circular buffer with a single "clock hand" pointer. Each frame has a reference bit (0 or 1). When a frame is accessed: set reference bit to 1. When the clock hand sweeps to a frame with bit=1: reset it to 0 and advance (give it a second chance). When the hand reaches a frame with bit=0: evict it.
// CLOCK ALGORITHM:
// frames = circular array of buffer frames
// hand = current position of clock hand
// ref_bit[i] = 0 or 1 for each frame
// On PAGE ACCESS (hit or after loading):
ref_bit[frame] = 1 // mark recently used
// On EVICTION NEEDED:
while True:
if pin_count[hand] > 0:
hand = (hand + 1) % total_frames // skip pinned frames
elif ref_bit[hand] == 1:
ref_bit[hand] = 0 // give second chance
hand = (hand + 1) % total_frames
else:
// ref_bit[hand] == 0: evict this frame
if dirty_bit[hand]:
write frame to disk
load new page into hand's frame
ref_bit[hand] = 1
break
// ADVANTAGE: O(1) amortised, much lower overhead than true LRU
// (no linked list manipulation on every access)
// USED BY: PostgreSQL's buffer pool uses a clock-sweep algorithmEvict the most recently used page instead of the least recently used. This sounds counterintuitive but is optimal for sequential table scans — a page just scanned will not be needed again during the scan, so evicting it immediately frees space for the next page.
When to use MRU: specifically for pages being read by a sequential scan. Modern databases use "buffer ring" strategies where sequential scans use a small private ring buffer (a few frames cycling with MRU) instead of the main buffer pool — preventing scan pages from evicting the working set.
-- PostgreSQL buffer ring for sequential scans:
-- A sequential scan larger than 1/4 of shared_buffers gets a "bulk read" ring
-- The ring has only ~256KB (32 pages) of buffer frames
-- Scan pages cycle through this ring — never polluting the main buffer pool
-- The main buffer pool's working set is preserved
-- This is why a large table scan doesn't destroy buffer pool performance
-- for concurrent OLTP queries in PostgreSQL.
-- Check if a scan is using the ring strategy:
EXPLAIN (ANALYZE, BUFFERS)
SELECT COUNT(*) FROM orders;
-- "Buffers: shared hit=X read=Y"
-- If Y is large but hit rate for other queries is still high:
-- The ring strategy is working correctlyLRU-K tracks the timestamps of the last K accesses to each page (not just the most recent). The eviction candidate is the page whose K-th most recent access was furthest in the past. With K=2, pages that have only been accessed once (like scan pages) have no second access timestamp and are treated as if their second access was infinitely far in the past — making them the first eviction candidates. Frequently accessed pages (many recent accesses) are protected.
LRU-2 is the most common choice. It effectively solves the sequential flooding problem: scan pages (accessed once) are immediately vulnerable to eviction. Index root pages (accessed thousands of times per second) have very recent K-th access timestamps and are practically never evicted.
Write-Back vs Write-Through — When Dirty Pages Hit Disk
// WRITE-THROUGH: write to disk immediately on every modification
// Pro: disk always has current data — simple recovery
// Con: every write requires a disk I/O — catastrophically slow for write-heavy workloads
// NOT used by production databases (except for specific WAL writes)
// WRITE-BACK (standard in all production databases):
// Modifications go to buffer pool only (dirty bit = 1)
// Dirty pages written to disk:
// a. When evicted from buffer pool (must write before evicting)
// b. By background writer process (periodically, before frames get too dirty)
// c. At CHECKPOINT (all dirty pages flushed to disk)
// Pro: batches writes, dramatically reduces disk I/O
// Con: data in buffer pool is newer than data on disk
// crash between modification and disk write → need WAL for recovery
-- PostgreSQL background writer (bgwriter) configuration:
SHOW bgwriter_delay; -- default: 200ms between write rounds
SHOW bgwriter_lru_maxpages; -- max pages written per round (default: 100)
SHOW bgwriter_lru_multiplier; -- lookahead for pages likely to be needed soon
-- CHECKPOINT: flush all dirty buffer pool pages to disk
-- Occurs at: explicit CHECKPOINT command, wal segment filled, periodic timeout
SHOW checkpoint_completion_target; -- default: 0.9 (spread writes over 90% of interval)
SHOW checkpoint_timeout; -- default: 5min (checkpoint at least every 5min)
-- After a crash: WAL replay restores any dirty pages not yet flushed
-- The checkpoint determines how far back WAL replay must go
-- More frequent checkpoints = faster crash recovery, more write I/O
-- Less frequent checkpoints = slower crash recovery, less write I/ORow Store vs Column Store — Why Data Warehouses Store Data Differently
All the file organisations discussed so far are row-oriented (NSM — N-ary Storage Model): all columns of a row are stored together on the same page. This is optimal for OLTP workloads that access entire rows (SELECT * FROM orders WHERE order_id = 5001 — fetch all columns of one row).
Data warehouse (OLAP) workloads have a completely different access pattern: they access a few columns from millions of rows (SELECT AVG(total_amount), COUNT(*), city FROM orders GROUP BY city — three columns from all rows). For this pattern, row storage is terrible — every page read brings in all columns, but only 3 of perhaps 20 columns are needed. 85% of every I/O is wasted reading irrelevant columns. Columnar storage (DSM — Decomposition Storage Model) solves this.
Point lookup (all columns of one row)
INSERT one row
UPDATE one row (few columns)
Full column scan (SELECT AVG(amount) FROM orders)
Compression (mixed types per page)
Point lookup (all columns of one row)
INSERT one row (must update all column files)
Full column scan (SELECT AVG(amount))
Compression (same type per file → RLE, delta encoding)
SIMD vectorised operations on column arrays
Column Store Compression — Why Columnar Data Compresses So Well
Storing all values of one column together enables dramatically better compression than row stores, because adjacent values in a column are often similar or even identical. Three compression techniques are especially effective on columnar data:
// 1. RUN-LENGTH ENCODING (RLE): compress runs of repeated values
// city column (sorted): [Bengaluru, Bengaluru, Bengaluru, Bengaluru, Hyderabad, Hyderabad, Mumbai]
// RLE compressed: [(Bengaluru, 4), (Hyderabad, 2), (Mumbai, 1)]
// Compression ratio: 7 values → 3 entries (57% smaller)
// Works best on: low-cardinality sorted columns (status, country, category)
// 2. DELTA ENCODING: store differences instead of absolute values
// timestamp column: [1704067200, 1704067260, 1704067320, 1704067380, ...]
// Delta encoded: [1704067200, +60, +60, +60, ...] (first value + deltas)
// For regularly-spaced timestamps: [base, step=60, count=N] → 3 values for N rows!
// Works best on: monotonically increasing or slowly changing numeric values
// 3. DICTIONARY ENCODING: replace strings with integer codes
// city column: [Bengaluru, Hyderabad, Mumbai, Bengaluru, Pune, Bengaluru, ...]
// Dictionary: {0: Bengaluru, 1: Hyderabad, 2: Mumbai, 3: Pune}
// Encoded: [0, 1, 2, 0, 3, 0, ...]
// String comparisons → integer comparisons (much faster)
// Works best on: moderate-cardinality string columns (city, category, status)
// 4. BIT-PACKING: store small integers using minimal bits
// A column with values 0-15 needs only 4 bits per value, not 4 bytes
// Column with 1,000,000 rows and max value 255: 1MB instead of 4MB
// COMBINED IMPACT:
-- Snowflake reports: typical columnar compression ratio 5-10×
-- A 100GB row-store table becomes 10-20GB columnar
-- Fewer pages to read → fewer I/Os → faster queries even without algorithmic changes
-- Plus: vectorised SIMD operations on contiguous same-type arrays: 4-16× faster computation
// USED BY:
-- Snowflake, BigQuery, Redshift, ClickHouse: native columnar storage
-- PostgreSQL: heap file (row store) + optional columnar extensions (pg_monostore, Citus)
-- Apache Parquet: columnar file format for data lakes (used by Spark, Pandas, dbt)Tuning a Production PostgreSQL Instance — Storage and Buffer Pool
This is the checklist a data engineer or DBA runs when setting up or tuning a PostgreSQL instance for production workloads. Every item connects to a concept from this module.
# ─────────────────────────────────────────────────────────────────
# MEMORY SETTINGS
# ─────────────────────────────────────────────────────────────────
# Buffer pool — the most important setting. Set to 25% of total RAM.
# On a 64GB RAM server:
shared_buffers = 16GB
# Per-sort/hash work memory. Affects sort, hash join, hash aggregate.
# Set to: RAM_for_connections / (max_connections × 3)
# On 64GB server, 100 connections: 64GB × 0.75 / (100 × 3) ≈ 160MB
work_mem = 128MB
# WARNING: each query node (sort, hash join) gets work_mem
# A complex query with 5 operations uses 5 × 128MB = 640MB
# Memory for VACUUM and CREATE INDEX operations
maintenance_work_mem = 2GB
# OS cache (effective_cache_size tells the planner how much RAM is available
# for caching, including OS page cache + shared_buffers)
effective_cache_size = 48GB # shared_buffers + OS cache estimate
# ─────────────────────────────────────────────────────────────────
# STORAGE TYPE SETTINGS
# ─────────────────────────────────────────────────────────────────
# For NVMe SSD (random I/O almost as fast as sequential):
random_page_cost = 1.1 # default is 4.0 (for spinning disks)
seq_page_cost = 1.0
# With random_page_cost = 1.1: planner will aggressively use indexes
# With random_page_cost = 4.0: planner prefers seq scans for lower selectivity
# ─────────────────────────────────────────────────────────────────
# WRITE-AHEAD LOG (WAL) SETTINGS
# ─────────────────────────────────────────────────────────────────
# Durability vs performance trade-off:
synchronous_commit = on # default: durable, ~1-10ms added per commit
# For non-critical writes (logging, analytics): off saves latency at cost of
# up to synchronous_commit_delay ms of data on crash
# WAL buffer (in-memory buffer before WAL is flushed to disk)
wal_buffers = 64MB # default: -1 (auto = 1/32 of shared_buffers, max 64MB)
# ─────────────────────────────────────────────────────────────────
# CHECKPOINT SETTINGS
# ─────────────────────────────────────────────────────────────────
# Spread checkpoint I/O over this fraction of the checkpoint interval:
checkpoint_completion_target = 0.9 # write dirty pages over 90% of interval
# Prevents I/O spike at checkpoint time
checkpoint_timeout = 15min # maximum time between checkpoints
max_wal_size = 4GB # trigger checkpoint if WAL grows past this
# ─────────────────────────────────────────────────────────────────
# AUTOVACUUM (MVCC dead tuple cleanup)
# ─────────────────────────────────────────────────────────────────
autovacuum = on
autovacuum_vacuum_scale_factor = 0.1 # vacuum when 10% of rows are dead
autovacuum_analyze_scale_factor = 0.05 # analyze when 5% of rows changed
autovacuum_vacuum_cost_delay = 2ms # IO throttle for autovacuum
# ─────────────────────────────────────────────────────────────────
# MONITORING COMMANDS
# ─────────────────────────────────────────────────────────────────
-- Check buffer pool hit rate:
SELECT ROUND(100 * sum(blks_hit) / NULLIF(sum(blks_hit) + sum(blks_read), 0), 2)
AS hit_rate_pct FROM pg_stat_database;
-- Target: > 99% for OLTP workloads
-- Check pages currently in buffer pool:
SELECT c.relname, count(*) AS pages_in_pool,
pg_size_pretty(count(*) * 8192) AS size_in_pool
FROM pg_buffercache b
JOIN pg_class c ON b.relfilenode = c.relfilenode
GROUP BY c.relname ORDER BY pages_in_pool DESC LIMIT 20;Storage Interview Questions — Complete Answers
Disk I/O operates in fixed-size blocks — the hardware reads or writes a minimum of one block at a time (typically 512 bytes to 4KB). Reading a single byte requires reading an entire block containing it. Database pages are sized to match or exceed the disk block size (PostgreSQL uses 8KB, MySQL InnoDB uses 16KB) so that each I/O operation is maximally useful. Reading one page brings in approximately 40 rows (for a 200-byte average row). If rows were accessed individually, the disk would need one I/O per row — 40× more I/O for the same data. Pages are also the unit of buffer pool management — it is practical to track 1 million 8KB frames in a buffer pool, whereas tracking 40 million individual row locations would be impractical.
A dirty page is a buffer pool frame that has been modified since it was last read from disk — its in-memory content is newer than the disk copy. Before a dirty page can be evicted from the buffer pool, its content must be written back to disk (or at least its changes recorded in the WAL) to preserve durability. A pinned page is a frame that is currently in use by one or more active threads — its pin count is greater than zero. A pinned frame cannot be evicted regardless of its dirty status, because evicting it while a thread is reading from or writing to it would cause data corruption. The pin count is incremented when a thread starts using a page and decremented when it finishes. A frame can be both dirty (modified) and pinned (in use) simultaneously.
An index scan on a non-clustered index involves two steps: read the index to get the row ID, then fetch the actual row from the heap (a random I/O). If the query returns many rows, each row fetch is a separate random disk access — on spinning disk each takes ~8-12ms. A sequential scan reads the entire table in physical page order, exploiting sequential I/O (which is 200-300× faster than random I/O per unit of data on spinning disk). The crossover point: when more than approximately 15-20% of the table rows match the query, a sequential scan reads less total data (all rows read sequentially) vs the index scan (random heap fetches for each matching row). The database query optimiser computes this crossover using the table's statistics (row count, selectivity estimates) and the cost parameters (random_page_cost vs seq_page_cost). Setting random_page_cost = 1.1 for SSDs (where random I/O is much cheaper) causes the optimiser to prefer index scans at much lower selectivity thresholds.
LRU sequential flooding occurs when a full table scan fills the buffer pool with pages that are each needed only once — while evicting the frequently-accessed "hot" pages (index root nodes, popular rows) that were providing high buffer pool hit rates. After the scan, all those hot pages must be re-read from disk, causing a spike in disk I/O and query latency. Databases solve this several ways. PostgreSQL uses a "buffer ring" strategy: sequential scans larger than 1/4 of shared_buffers use a small dedicated ring of ~32 pages with MRU replacement. Scan pages cycle through this ring and never touch the main buffer pool. LRU-K (especially LRU-2) is another solution: pages accessed only once (scan pages) have no second access timestamp and are immediately eligible for eviction, while frequently accessed pages (with many recent accesses) are protected. SQL Server uses a "stolen" buffer pool strategy where sequential scan allocations are tracked separately and are first candidates for eviction.
Choose columnar storage for analytical (OLAP) workloads that scan a small number of columns across a large number of rows — aggregations, GROUP BY, time-series analysis, reporting. The benefits are: (1) I/O reduction — only the needed columns are read, avoiding irrelevant columns (a query on 3 of 20 columns reads 85% less data). (2) Better compression — a column file contains one data type, enabling run-length encoding, delta encoding, and dictionary encoding that can achieve 5-10× compression ratios, further reducing I/O. (3) SIMD vectorisation — processors can apply the same operation to multiple values simultaneously when they are packed contiguously with the same type. Choose row storage for OLTP workloads that access complete rows — point lookups by primary key, inserts, updates, deletes. Inserting one row into a column store requires updating all column files. Use both: row store for transactional data, column store for the analytical data warehouse built from it.
🎯 Key Takeaways
- ✓Storage hierarchy from fastest to slowest: CPU registers → L1/L2/L3 cache → DRAM (buffer pool) → NVMe SSD → SATA SSD → spinning HDD → network storage. The DBMS is primarily concerned with the DRAM↔disk boundary. Everything above DRAM is managed automatically by the CPU.
- ✓Sequential I/O is 200-300× faster than random I/O on spinning disk, and still 2-5× faster on SSDs. All database storage design — page size, file organisation, B+ tree fan-out, bulk loading — optimises for sequential access and minimises random I/O.
- ✓Disk access time = seek time + rotational latency + transfer time. Random 8KB read on HDD: ~12ms. Sequential reads after the first: ~0.05ms each. This gap explains why full table scans sometimes beat index scans at high selectivity.
- ✓Pages (8KB in PostgreSQL, 16KB in MySQL InnoDB) are the fundamental unit of all database I/O. The buffer pool holds pages in RAM frames. The slotted page format stores variable-length records with a slot array at the front and records growing backward from the end.
- ✓Heap file: unordered, fast inserts (O(1)), O(n) scan without index, default in PostgreSQL. Sorted file: O(log n) range scans, expensive inserts. ISAM: historical, superseded by B+ trees. Clustered file: related tables interleaved for fast joins.
- ✓Buffer pool: array of frames in DRAM, page table maps page IDs to frames, dirty bit tracks modifications, pin count prevents eviction of in-use pages. Hit rate should exceed 99% for OLTP. Set shared_buffers to 25% of RAM.
- ✓LRU replacement policy: evict least recently used page. Clock policy: efficient approximation with reference bits, O(1). MRU: optimal for sequential scan pages — evict the most recently used. LRU-K: tracks K most recent accesses, protects frequently-used pages from scan flooding.
- ✓Write-back: modifications held in buffer pool until eviction or checkpoint. Dirty pages must be written to disk (or WAL must record changes) before a frame can be reused. Checkpoint: all dirty pages flushed to disk, establishes recovery starting point.
- ✓Row store (NSM): all columns of a row stored together. Optimal for OLTP (point lookups, inserts, updates). Column store (DSM): each column stored separately. Optimal for OLAP (aggregations, few columns across many rows). 5-10× better compression via RLE, delta encoding, dictionary encoding.
- ✓The practical architecture: row-store transactional database (PostgreSQL/MySQL) for application writes → ETL/ELT pipeline → columnar data warehouse (Snowflake/BigQuery/Redshift) for analytics. This separation is the foundation of every modern data engineering stack.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.