Databases — What They Are and How They Work Internally
Storage engines, B-trees, indexes, buffer pools, WAL — the inside story.
Why Every Data Engineer Must Understand Database Internals
Most people learn databases from the outside — they learn SQL syntax, how to create tables, how to run queries. This is enough to get data out of a database. It is not enough to understand why a query takes 4 minutes instead of 4 seconds, why a pipeline slows down a live application, why two transactions that seem independent can deadlock, or why replication lag causes data inconsistencies in your pipelines.
Every database you will ever work with — PostgreSQL, MySQL, Snowflake, MongoDB, Redis, Cassandra — has internals built on the same small set of fundamental concepts. A B-tree is a B-tree whether you are in PostgreSQL or MongoDB. Write-Ahead Logging exists in PostgreSQL, MySQL, SQLite, and modern cloud warehouses. Buffer pool management appears in every database that reads from disk.
Learn these concepts once, at depth, and every database you encounter for the rest of your career becomes easier to understand, debug, and optimise.
Storage Engines — How Data Is Physically Organised on Disk
A storage engine is the component of a database responsible for how data is physically written to and read from disk. The rest of the database — query parsing, optimisation, transaction management — sits on top of the storage engine. The engine's design determines what operations are fast, what operations are slow, and what consistency guarantees the database can offer.
There are two dominant storage engine designs: B-tree based(used by PostgreSQL, MySQL InnoDB, SQLite) and LSM-tree based(used by RocksDB, Cassandra, LevelDB, Apache HBase). Understanding the trade-off between them explains the performance profile of almost every database you will encounter.
B-tree based storage engines
B-tree engines organise data in a tree structure of fixed-size pages (typically 8–16 KB each). Every read and write operation navigates this tree to find the right page. Updates are done in-place — the old value on the page is directly overwritten with the new value.
B-tree page structure (PostgreSQL uses 8 KB pages):
Root page (level 3)
├── Internal page A (level 2) — covers order_id 1–5,000,000
│ ├── Internal page A1 (level 1) — covers 1–500,000
│ │ ├── Leaf page 1 — rows: order_id 1–100
│ │ ├── Leaf page 2 — rows: order_id 101–200
│ │ └── ...
│ └── Internal page A2 — covers 500,001–1,000,000
└── Internal page B (level 2) — covers 5,000,001–10,000,000
Leaf pages contain the actual row data (or pointers to it).
Internal pages contain only keys and child page pointers.
What this means for operations:
READ by primary key (order_id = 9284751):
Root → Internal page → Leaf page → row found
= 3–4 page reads regardless of table size (O log n)
At 100M rows, still only 4 page reads. Fast.
FULL TABLE SCAN (no WHERE clause or non-indexed column):
Must read ALL leaf pages sequentially
100M rows × 100 rows/page = 1M page reads. Slow.
UPDATE existing row:
Find the page (3–4 reads), modify the value in-place,
write the page back. Fast for individual updates.
INSERT new row:
Find the correct leaf page, insert in sorted order.
If page is full → page split (expensive operation).LSM-tree based storage engines
LSM-tree (Log-Structured Merge-tree) engines work completely differently. They never modify data in-place. Instead, every write is appended to an in-memory buffer. When the buffer fills, it is flushed to a new immutable file on disk. Periodically, the engine merges these files together in a background compaction process.
LSM-tree write path (Cassandra, RocksDB, HBase):
WRITE (INSERT or UPDATE order 9284751):
1. Write to in-memory buffer (MemTable) ← very fast
2. Append to Write-Ahead Log (for crash safety) ← sequential write
3. Return success to client ← done, no disk random write
BACKGROUND (when MemTable fills up):
4. Flush MemTable to disk as new SSTable file ← immutable sorted file
5. Periodically COMPACT multiple SSTables ← merge + remove old versions
into fewer, larger sorted files
RESULT of multiple updates to same key:
SSTable 1 (oldest): {order_id: 9284751, status: "placed"}
SSTable 2: {order_id: 9284751, status: "confirmed"}
SSTable 3 (newest): {order_id: 9284751, status: "delivered"}
READ order 9284751: check MemTable first,
then SSTables newest-to-oldest, return first match found.
Bloom filters on each SSTable tell you if key exists
without reading the whole file.
TRADE-OFFS vs B-tree:
✓ Writes are much faster (sequential append, no page splits)
✓ Better write throughput at high velocity
✗ Reads are slower (check multiple SSTables)
✗ Compaction consumes background CPU and I/O
✗ Space amplification (same data in multiple SSTables until compaction)
Best for: write-heavy workloads, time-series data, IoT streams, Kafka consumer state| Property | B-tree Engine | LSM-tree Engine |
|---|---|---|
| Write speed | Moderate (random writes, possible page splits) | Very fast (sequential append to memory + log) |
| Read speed | Fast (O log n via tree navigation) | Moderate (check MemTable + multiple SSTables) |
| Update | In-place overwrite | Append new version, old version removed at compaction |
| Delete | Mark as deleted (vacuum reclaims space) | Write tombstone marker, removed at compaction |
| Space usage | Efficient (updates overwrite) | Higher temporarily (multiple versions until compaction) |
| Best for | Mixed read/write, OLTP workloads | Write-heavy, high-velocity, time-series data |
| Examples | PostgreSQL, MySQL (InnoDB), SQLite, SQL Server | Cassandra, RocksDB, HBase, LevelDB, InfluxDB |
Indexes — The Mechanism That Makes Databases Usable
An index is a separate data structure that the database maintains alongside the actual table data. It stores a subset of column values in a sorted, searchable format along with pointers back to the full rows. Without indexes, every query that cannot use the primary key requires reading every row in the table. With the right index, that same query reads only a handful of pages.
Understanding indexes at this level — not just "add an index to make queries faster" but specifically how they work — is what allows you to diagnose why a query is slow, whether an index is being used, and whether an index is hurting write performance more than it helps read performance.
B-tree index structure
A B-tree index for a column stores the column values in sorted order in a tree of pages. The leaf pages of the index store the indexed values alongside pointers (row IDs or primary key values) to the actual table rows. Navigation from root to leaf takes O(log n) time — for a table of 100 million rows with a balanced B-tree of height 4, finding a value requires reading exactly 4 pages.
Table: orders (100 million rows)
Index: idx_orders_city ON orders(city)
INDEX STRUCTURE (simplified):
Root page: [Chennai | Hyderabad | Mumbai]
↓ ↓ ↓
Level 1: [Ahm..Che] [Hy..Mum] [Mum..Pun] [Pun..Vis]
↓
Leaf pages: [Mumbai row_ptr_1, Mumbai row_ptr_2, ...]
[Mumbai row_ptr_3, Mumbai row_ptr_4, ...]
QUERY: SELECT * FROM orders WHERE city = 'Mumbai'
EXECUTION WITH INDEX:
1. Read root page → navigate right subtree (Mumbai > Hyderabad)
2. Read level-1 internal page → find leaf page range for Mumbai
3. Read first leaf page → find first Mumbai entry + row pointer
4. Follow row pointer → read actual table page for this row
5. Continue through leaf pages until city > 'Mumbai'
Total: ~4 index pages + N data pages (one per matching row)
For 50,000 Mumbai orders in 100M rows: ~50,004 page reads
QUERY WITHOUT INDEX:
Full table scan: 100,000,000 rows / 100 rows per page = 1,000,000 page reads
With index: ~50,004 page reads
Speedup: ~20× for this query
RANGE QUERY benefit:
SELECT * FROM orders WHERE city = 'Mumbai' AND amount > 500
Index on city narrows to Mumbai rows, then filters amount.
Index on (city, amount) composite — narrows to both criteria in one pass.Index types every data engineer must know
The Buffer Pool — The Most Important Performance Variable
A database's buffer pool (also called page cache or shared buffers) is the in-memory cache between the disk and the query engine. When the database needs a page — a 8 KB block of data from a table or index — it first checks whether that page is already in the buffer pool. A cache hit means reading from RAM in nanoseconds. A cache miss means reading from disk in milliseconds. The ratio between hits and misses determines 80% of database performance.
How the buffer pool works
Buffer pool: 4 GB RAM allocated to PostgreSQL shared_buffers
State before a query runs:
Buffer pool (4 GB): [page_1847, page_9234, page_2341 ... 500,000 pages cached]
QUERY: SELECT * FROM orders WHERE order_id = 9284751
Step 1: Query executor needs leaf page of orders index (page_8823)
→ Check buffer pool: page_8823 is in cache? YES → cache hit
→ Read from RAM: ~100 nanoseconds
Step 2: Follow row pointer to data page (page_44219)
→ Check buffer pool: page_44219 in cache? NO → cache miss
→ Read from disk: ~1 millisecond (10,000× slower than RAM)
→ Load page_44219 into buffer pool (evict least-recently-used page)
Step 3: Return row data to client
BUFFER POOL EVICTION (when cache is full):
PostgreSQL uses Clock Sweep (approximate LRU):
Each page has a usage counter
Pages used recently have higher counters
When eviction needed, scan and evict first page with counter = 0
Dirty pages (modified but not yet written to disk) must be flushed
before eviction — adds latency
WHY THIS MATTERS FOR DATA ENGINEERS:
When you run a large analytical query on a production OLTP database,
it reads millions of pages from disk into the buffer pool.
This EVICTS the hot pages that application queries use frequently.
After your analytical query, every application query hits cache misses
until the buffer pool refills with hot pages.
Result: you ran a "read-only" analytical query and slowed down
the production application for the next 10–30 minutes.
This is why analytical queries must NOT run on production databases.Buffer pool sizing — why it matters so much
PostgreSQL's default shared_buffers setting is 128 MB — far too small for any production workload. The recommended setting is 25% of total RAM. On a 64 GB server, that is 16 GB of buffer pool. A properly sized buffer pool means frequently-accessed data lives in RAM permanently and query performance is consistently fast. An undersized buffer pool means constant cache misses and slow, variable query performance.
Buffer pool hit ratio = (cache hits) / (cache hits + cache misses) × 100%
-- PostgreSQL: check buffer pool hit ratio
SELECT
sum(heap_blks_hit) AS cache_hits,
sum(heap_blks_read) AS disk_reads,
ROUND(
sum(heap_blks_hit)::numeric /
NULLIF(sum(heap_blks_hit) + sum(heap_blks_read), 0) * 100, 2
) AS hit_ratio_pct
FROM pg_statio_user_tables;
Interpreting the result:
> 99% → Excellent. Frequently-accessed data is in memory.
95–99% → Good. Some disk reads, acceptable for mixed workloads.
90–95% → Warning. Significant disk I/O. Consider increasing shared_buffers.
< 90% → Problem. Most queries hit disk. Database is I/O bound.
Either increase RAM, reduce working set size, or both.
Real impact of hit ratio on query latency:
99% hit ratio: average page read = 100ns×0.99 + 1ms×0.01 = ~10 μs
95% hit ratio: average page read = 100ns×0.95 + 1ms×0.05 = ~50 μs
80% hit ratio: average page read = 100ns×0.80 + 1ms×0.20 = ~200 μs
50% hit ratio: average page read = 100ns×0.50 + 1ms×0.50 = ~500 μsWrite-Ahead Log — Crash Safety, Replication, and CDC
The Write-Ahead Log (WAL) — called the redo log in MySQL and the transaction log in SQL Server — is one of the most important concepts in database engineering. It is the mechanism that makes databases crash-safe, it is the source of replication between primary and replica databases, and it is the source of Change Data Capture (CDC) that data engineers use to ingest database changes in real-time.
What the WAL is and why it exists
Every change a database makes to data goes through two steps: write to the WAL first, then modify the actual data page in the buffer pool. The WAL is a sequential, append-only log of every database operation — every INSERT, UPDATE, and DELETE, with enough information to reproduce or reverse the operation.
Why write to the WAL before changing the data? Because the WAL is sequential. Sequential writes to disk are dramatically faster than random writes (up to 100× faster on spinning disks, and still significantly faster on SSDs). By writing changes to the WAL first in one sequential pass, and flushing the WAL to disk, the database can safely acknowledge a transaction as committed even before the actual data pages have been written back to disk. If the server crashes before the data pages are written, the database replays the WAL at startup to recover the missing writes.
WITHOUT WAL (naive approach):
UPDATE orders SET status = 'delivered' WHERE order_id = 9284751
1. Find the data page containing order 9284751 (random read)
2. Modify the status value in the page in buffer pool
3. Write the modified page to its original disk location (random write)
4. Return success to application
Problem: step 3 is a random write — slow.
If crash between step 2 and step 3: data change is LOST.
WITH WAL (actual approach — PostgreSQL, MySQL, SQL Server):
1. Find the data page (random read → probably buffer pool hit)
2. Write WAL record to WAL file (sequential append — very fast):
{txn_id: 847291, operation: UPDATE, table: orders,
old: {status: "confirmed"}, new: {status: "delivered"},
row_pointer: page_44219/slot_7}
3. Flush WAL file to disk (fsync — ensures durability)
4. Modify the page in buffer pool (in memory only, not yet on disk)
5. Return success to application ← WAL guarantees durability
Data page will be written to disk LATER during checkpoint.
If crash before checkpoint: replay WAL at startup → data recovered.
WAL write is sequential → fast even under high write load.
WAL record contains everything needed to reproduce or reverse a change:
- Transaction ID
- Table and page location
- Before image (old values) — used for rollback and MVCC
- After image (new values) — used for redo on crash recoveryWAL as the source of replication
Read replicas in PostgreSQL work by streaming the WAL from the primary database to the replica and replaying it. The replica receives each WAL record, applies the same change to its own copy of the data, and stays in sync with the primary. This is called streaming replication, and it is the basis of every high-availability database setup in production.
WAL as the source of Change Data Capture
This is the most relevant WAL concept for data engineers. CDC (Change Data Capture) — the technique of capturing every INSERT, UPDATE, and DELETE from a database and streaming them into a data pipeline — works by reading the WAL. Tools like Debezium read the PostgreSQL WAL through its logical replication protocol, decode each change record, and publish it as a structured event to Kafka. This enables near-real-time data pipelines without polling the source database repeatedly.
PostgreSQL WAL record (internal binary format):
{lsn: 0/1A3F2B8, txn: 847291, op: UPDATE, rel: orders,
old: {order_id:9284751, status:"confirmed"},
new: {order_id:9284751, status:"delivered"}}
Debezium decodes WAL → structured Kafka event:
{
"before": {
"order_id": 9284751,
"status": "confirmed"
},
"after": {
"order_id": 9284751,
"status": "delivered"
},
"op": "u", ← u=update, c=create, d=delete
"ts_ms": 1710698072847, ← timestamp of the change
"source": {
"db": "production",
"table": "orders",
"lsn": 28437128, ← log sequence number (position in WAL)
"txId": 847291
}
}
This Kafka event is published to topic "production.public.orders"
Any consumer (data pipeline, search indexer, cache invalidator)
can subscribe and react to every database change in near-real-time.
Latency: change committed in PostgreSQL → Kafka event available
typically 50–500 milliseconds
Key benefit for data engineers: your data lake sees every change
within seconds, not the next morning's batch.Transactions, ACID, and MVCC — How Databases Handle Concurrency
A transaction is a group of database operations that execute as a single atomic unit — either all succeed together or all fail together with no partial state left behind. Transactions are what make databases safe for concurrent use by thousands of connections simultaneously. Without transaction semantics, concurrent reads and writes produce inconsistent, corrupt, or impossible data.
ACID — the four guarantees of a transaction
MVCC — how isolation works without blocking reads
The naive way to implement isolation would be to lock every row being read or written so other transactions cannot touch it. This works but causes read queries to wait for write transactions to complete, and write transactions to wait for read queries to finish — a serial bottleneck that makes concurrent databases slow.
Modern databases — PostgreSQL, Oracle, MySQL InnoDB — use a technique called MVCC (Multi-Version Concurrency Control) that allows readers and writers to operate concurrently without blocking each other, by giving each transaction a consistent snapshot of the data as it existed at a specific point in time.
PostgreSQL MVCC mechanism:
Every row has two hidden fields:
xmin: transaction ID that created this row version
xmax: transaction ID that deleted this row version (0 if not deleted)
Timeline:
T=100 (txn 100): INSERT order {id:9284751, status:'placed'}
Row: {id:9284751, status:'placed', xmin:100, xmax:0}
T=200 (txn 200): UPDATE status = 'confirmed'
OLD row: {id:9284751, status:'placed', xmin:100, xmax:200} ← marked deleted
NEW row: {id:9284751, status:'confirmed', xmin:200, xmax:0} ← new version
T=300 (txn 300): UPDATE status = 'delivered'
OLD row: {id:9284751, status:'confirmed', xmin:200, xmax:300} ← marked deleted
NEW row: {id:9284751, status:'delivered', xmin:300, xmax:0} ← current version
All three versions exist on disk simultaneously!
NOW: two concurrent queries run at T=250:
QUERY A (analyst, started before txn 300):
Snapshot: sees all transactions committed before T=250
Sees: status='confirmed' (txn 200 committed, txn 300 not yet)
Returns: 'confirmed'
QUERY B (application write, started at T=300):
Sees: status='delivered' (txn 300 committed)
Returns: 'delivered'
NEITHER QUERY BLOCKED THE OTHER.
Readers see a consistent point-in-time snapshot.
Writers create new row versions, not locking old ones.
This is why SELECT queries in PostgreSQL never block UPDATE queries
and UPDATE queries never block SELECT queries.
VACUUM: old row versions are eventually cleaned up by the
background VACUUM process once no transaction can see them.
Without VACUUM, the table grows forever (table bloat).How a SQL Query Becomes a Physical Execution Plan
When you run a SQL query, the database does not execute it literally in the order you wrote it. It parses the SQL, builds a logical plan, optimises it into a physical plan, and then executes the physical plan. Understanding this process is what allows you to read EXPLAIN output, understand why the database chose one plan over another, and rewrite queries that choose inefficient plans.
SQL string: SELECT c.name, SUM(o.amount) as total
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.created_at >= '2026-01-01'
GROUP BY c.name
ORDER BY total DESC
LIMIT 10
STEP 1: PARSING
Lexer tokenises: [SELECT, c.name, SUM, (, o.amount, ), ...]
Parser builds Abstract Syntax Tree (AST)
Validates syntax — catches typos, missing clauses
STEP 2: SEMANTIC ANALYSIS
Resolves table and column names against catalog
Validates types: created_at >= '2026-01-01' — date comparison valid?
Expands SELECT * to explicit column names if needed
STEP 3: LOGICAL PLAN
Relational algebra tree — what to compute, not how:
Limit(10)
Sort(total DESC)
Aggregate(GROUP BY c.name, SUM(o.amount))
Join(o.customer_id = c.id)
Filter(o.created_at >= '2026-01-01')
Scan(orders)
Scan(customers)
STEP 4: OPTIMISATION (the most important step)
Query optimizer rewrites the logical plan for efficiency:
- Push filters DOWN before joins (filter orders before joining customers)
- Choose join algorithm: hash join? merge join? nested loop?
- Choose index scans vs sequential scans based on statistics
- Decide join order (smaller table first)
Statistics used: row counts, column cardinality, value distributions
Optimizer estimates cost of each plan in abstract "cost units"
Chooses plan with lowest estimated cost
STEP 5: PHYSICAL PLAN
Concrete execution plan with specific algorithms:
Limit
Sort (external sort — result too large for memory)
HashAggregate (hash table for GROUP BY)
Hash Join (hash customers table, probe with filtered orders)
Index Scan on idx_orders_created_at (filter pushdown)
Sequential Scan on customers
STEP 6: EXECUTION
Each node in the plan pulls rows from its children (volcano model)
Rows flow up the plan from leaf nodes to the root
Result returned to client
To see the plan PostgreSQL chose:
EXPLAIN ANALYZE SELECT ... (same query)
Shows: plan nodes, estimated vs actual rows, actual timing per nodeReading EXPLAIN output — the most practical skill for query optimisation
EXPLAIN ANALYZE
SELECT * FROM orders WHERE city = 'Bangalore' AND amount > 1000;
Output:
Bitmap Heap Scan on orders (cost=892.14..4821.33 rows=12847 width=89)
(actual time=12.847..89.234 rows=13102 loops=1)
Recheck Cond: (city = 'Bangalore')
Filter: (amount > 1000::numeric)
Rows Removed by Filter: 28432
-> Bitmap Index Scan on idx_orders_city
(cost=0.00..888.93 rows=41279 width=0)
(actual time=11.234..11.234 rows=41534 loops=1)
Index Cond: (city = 'Bangalore')
Planning Time: 0.847 ms
Execution Time: 94.127 ms
HOW TO READ THIS:
"cost=892.14..4821.33" → estimated cost (startup..total) in arbitrary units
"rows=12847" → estimated number of rows (optimizer's guess)
"actual time=12.847..89.234" → actual timing in milliseconds
"rows=13102" → actual rows returned (compare to estimate!)
"Rows Removed by Filter: 28432" → 28k rows passed index but failed amount filter
KEY INSIGHT from this output:
The index returned 41,534 rows for city='Bangalore'
But only 13,102 had amount > 1000 (filter removed 28,432)
This means a composite index on (city, amount) would be better:
CREATE INDEX idx_orders_city_amount ON orders(city, amount);
New plan would return only ~13,102 rows from the index directly,
skipping the 28,432 wasted reads.
Warning sign: estimated rows vs actual rows differ by 10×+
Means statistics are stale → run ANALYZE to update themOLTP vs OLAP — Two Fundamentally Different Database Designs
Now that you understand the internals, the OLTP vs OLAP distinction becomes fully clear. These are not just two categories of databases — they are two fundamentally different design philosophies optimised for completely different workloads.
| Dimension | OLTP | OLAP |
|---|---|---|
| Stands for | Online Transaction Processing | Online Analytical Processing |
| Primary purpose | Run the business — record and retrieve individual transactions fast | Analyse the business — aggregate large volumes of historical data |
| Query pattern | Read/write individual rows by primary key or indexed lookup | Scan millions of rows, aggregate columns, compute metrics |
| Storage layout | Row-oriented (all columns of a row together) | Columnar (all values of a column together) |
| Optimised for | Low latency individual operations (< 10ms) | High throughput analytical scans (seconds to minutes) |
| Concurrent users | Thousands of concurrent writers + readers | Tens of analysts, not thousands of app connections |
| Data volume | Current/recent operational data (GB to low TB) | Full historical data (TB to PB) |
| Schema style | Normalised (3NF) — data in one place, JOINs required | Denormalised — pre-joined for fast query performance |
| Indexes | Many indexes on common lookup columns | Partitioning + clustering keys instead of traditional indexes |
| Transactions | Full ACID, high-concurrency write transactions | Append-only or batch load — minimal concurrent writes |
| Examples | PostgreSQL, MySQL, SQL Server, Oracle | Snowflake, BigQuery, Redshift, Azure Synapse, ClickHouse |
The data engineer's job is to bridge these two worlds — extract data from OLTP systems where it is created and load it into OLAP systems where it can be efficiently analysed. Understanding why you cannot run analytical queries on OLTP systems (buffer pool pollution, write amplification from index scans, contention with application queries) and why you cannot run application transactions on OLAP systems (no row-level locks, no fast primary-key lookups, slow individual writes) is fundamental knowledge.
Diagnosing a Slow Pipeline Using Database Internals
Your daily ingestion pipeline pulls the last 24 hours of transactions from PostgreSQL. It runs every morning at 5 AM and normally finishes in 18 minutes. This morning it ran for 3 hours and is still going when you arrive at 9 AM. Nothing in the pipeline code changed. What happened?
Step 1 — Check if the query is running: You connect to PostgreSQL and run SELECT pid, query, state, wait_event_type, wait_event, query_start FROM pg_stat_activity. You find your ingestion query in the "active" state with wait_event_type = "Lock". It is waiting for a lock.
Step 2 — Find the blocking query: You run SELECT * FROM pg_blocking_pids(pid) on your ingestion process. It returns a PID. You check that PID — it is a transaction started at 11 PM last night (10 hours ago) by a developer running an ad-hoc UPDATE statement that never committed. It has been sitting with an open transaction, holding row locks, for 10 hours.
Step 3 — Understand why this blocked you: Your ingestion query uses REPEATABLE READ isolation and was waiting for consistent snapshot data. The open transaction was holding locks on rows your query needed to read, causing a lock wait. This is the isolation mechanism working as designed — protecting you from reading uncommitted data — but the developer's abandoned transaction is blocking the entire pipeline.
Step 4 — Resolution: You confirm with the developer that the transaction can be safely terminated (it was a test query that was never meant to stay open). You run SELECT pg_terminate_backend(pid). The blocking transaction is terminated, the row locks are released, and your ingestion pipeline completes in 19 minutes.
The fix going forward: You set idle_in_transaction_session_timeout = '30min' on the PostgreSQL server — any transaction that sits idle for more than 30 minutes without executing a query is automatically terminated. This prevents a single forgotten developer query from blocking production pipelines overnight.
This scenario plays out at real companies regularly. The data engineer who understands database internals — transactions, locks, MVCC, pg_stat_activity — diagnoses and resolves it in 20 minutes. The one who does not spends hours restarting the pipeline, escalating to DBAs, and not understanding why.
5 Interview Questions — With Complete Answers
Errors You Will Hit — And Exactly Why They Happen
🎯 Key Takeaways
- ✓Every database is built on six internal concepts: storage engine, B-tree index, buffer pool, Write-Ahead Log, transactions and MVCC, and query execution. Learn these once and every database you encounter becomes understandable.
- ✓B-tree storage engines store data in fixed-size pages organised as a tree. Lookups are O(log n) — fast. Full table scans read every page — slow. LSM-tree engines append writes to memory and flush to immutable disk files — writes are faster, reads require checking multiple files.
- ✓A B-tree index allows the query engine to jump directly to matching rows without scanning the whole table. Without an index on a filter column, every query is a full scan. The cost of an index is write overhead — every INSERT, UPDATE, or DELETE must also update all indexes on the table.
- ✓The buffer pool is the in-memory cache between disk and queries. A 99% cache hit ratio means queries read mostly from RAM. Running a large analytical scan on an OLTP database evicts hot pages from the buffer pool, slowing down application queries for minutes to hours after the scan completes.
- ✓The Write-Ahead Log (WAL) makes databases crash-safe by recording every change before modifying data pages. WAL is also the source of replication (replicas stream and replay WAL) and Change Data Capture (Debezium reads WAL to publish database changes to Kafka in near-real-time).
- ✓MVCC (Multi-Version Concurrency Control) allows readers and writers to operate concurrently without blocking each other. Each transaction sees a consistent snapshot of the database as it existed when the transaction started. Old row versions accumulate as dead rows and must be reclaimed by VACUUM.
- ✓Long-running pipeline transactions prevent VACUUM from reclaiming dead row versions, causing table bloat. Always keep transaction scope as small as possible in ingestion code. Set idle_in_transaction_session_timeout on source databases to auto-terminate abandoned sessions.
- ✓The query executor translates SQL into a physical plan via parsing, semantic analysis, logical planning, optimisation, and execution. EXPLAIN ANALYZE shows the actual plan chosen, estimated vs actual row counts, and timing per node — essential for diagnosing slow queries.
- ✓OLTP databases are row-oriented, normalised, and optimised for fast individual record access under high write concurrency. OLAP databases are columnar, denormalised, and optimised for fast aggregation over large datasets. Never run analytical queries on OLTP databases — the buffer pool pollution alone degrades the application.
- ✓The most practical database internals skills for a data engineer: reading EXPLAIN ANALYZE output, identifying and terminating blocking transactions with pg_stat_activity, monitoring buffer pool hit ratio, understanding why WAL is the source of CDC, and knowing when table bloat is causing slow queries.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.