ACID Properties and Transactions
Why ACID exists, what each property means, and what happens when it breaks.
The Problem ACID Was Invented to Solve
Imagine a bank transfer. Priya sends ₹5,000 to Rahul. Two things must happen: subtract ₹5,000 from Priya's account, add ₹5,000 to Rahul's account. Simple. Now imagine the server crashes after the subtraction but before the addition. Priya's account is ₹5,000 lighter. Rahul's account shows nothing. The money has vanished.
Or consider a different failure: while the bank transfer is being processed, another query reads both account balances at the exact moment between the subtraction and the addition. It sees Priya's balance already reduced but Rahul's not yet increased. It calculates total money in the system as ₹5,000 less than the actual total. It is seeing a reality that never truly existed.
These are not hypothetical edge cases. They happen constantly in systems without proper transaction semantics. Every concurrent database — which means every database worth using — faces these problems at scale. ACID is the set of four guarantees that databases provide to ensure these problems never occur. Understanding ACID deeply is not just theoretical knowledge — it directly determines how you design data pipelines that handle failures correctly.
Atomicity — All or Nothing
Atomicity is the guarantee that a transaction — which may contain dozens of individual operations — is treated as a single indivisible unit. Either every single operation in the transaction succeeds and the changes are committed, or the entire transaction is rolled back and the database is left exactly as it was before the transaction began. There is no partial state. There is no "most of the changes succeeded."
The word "atomic" comes from the Greek word for indivisible. An atomic operation cannot be split. You observe it before it starts, or after it finishes. You never observe it in the middle.
How atomicity works internally
Atomicity is implemented using the Write-Ahead Log (WAL) we covered in Module 09. Every operation in a transaction is written to the WAL before the data pages are modified. When a transaction commits, a commit record is written to the WAL and flushed to disk. If a crash occurs before the commit record is written, the database replays the WAL on restart, sees no commit record for that transaction, and discards all its changes. If the commit record was written, all changes are replayed and applied.
BANK TRANSFER TRANSACTION:
BEGIN;
UPDATE accounts SET balance = balance - 5000 WHERE account_id = 'A001'; -- Step 1
UPDATE accounts SET balance = balance + 5000 WHERE account_id = 'B001'; -- Step 2
COMMIT;
SCENARIO 1: Server crashes AFTER step 1, BEFORE step 2
WAL contains: Step 1 operation, NO commit record
On restart: database sees no commit → ROLLS BACK step 1
Result: A001 balance unchanged, B001 balance unchanged
Money is NOT lost. ✓
SCENARIO 2: Server crashes AFTER commit record is written
WAL contains: Step 1, Step 2, COMMIT record
On restart: database replays both steps
Result: A001 reduced by 5000, B001 increased by 5000
Transaction is fully applied. ✓
SCENARIO 3: Step 2 fails due to a constraint violation
(e.g. B001 account has been frozen and cannot receive funds)
Database automatically rolls back step 1 within the transaction
Result: A001 balance unchanged (step 1 reversed)
Error returned to application: transaction failed. ✓
No partial state. ✓
WITHOUT ATOMICITY (naive implementation):
Step 1 executes and modifies A001 directly
Step 2 fails for any reason
Result: A001 reduced, B001 unchanged — money GONE
This is why atomicity is not optional for financial systems.Atomicity in data pipelines
Atomicity in data pipelines means: either the entire batch of rows is written or none of them are. If your pipeline writes 100,000 rows in a transaction and fails on row 73,241, the entire batch is rolled back. The next run starts from the beginning of the batch.
This is why idempotency — the ability to safely rerun a pipeline — is built on atomicity. If writes are atomic, a failed run leaves no partial data behind. The next run can safely write the complete batch without worrying about duplicating what was already written.
# Python pipeline — atomic batch insert using a transaction
import psycopg2
def load_orders_batch(orders: list[dict], conn) -> None:
"""Load a batch of orders atomically — all succeed or all fail."""
with conn.cursor() as cur:
try:
# Everything inside this block is one transaction
cur.execute("BEGIN;")
for order in orders:
cur.execute(
"""INSERT INTO silver.orders
(order_id, customer_id, amount, status, created_at)
VALUES (%s, %s, %s, %s, %s)
ON CONFLICT (order_id) DO UPDATE
SET status = EXCLUDED.status,
updated_at = NOW()""",
(order['id'], order['customer_id'],
order['amount'], order['status'], order['created_at'])
)
cur.execute("COMMIT;")
print(f"Successfully committed {len(orders)} orders")
except Exception as e:
cur.execute("ROLLBACK;") # ← atomicity: undo all changes
print(f"Transaction rolled back due to: {e}")
raise # re-raise so the orchestrator knows it failed
# Without the transaction:
# If row 73,241 fails, rows 1–73,240 are already committed
# Next run re-inserts rows 1–73,240 → duplicates
# With ON CONFLICT this is handled, but messy
#
# With the transaction:
# If row 73,241 fails, rows 1–73,240 are rolled back
# Next run starts cleanly from row 1Consistency — Valid Before and After
Consistency is the guarantee that a transaction takes the database from one valid state to another valid state. Every constraint, rule, and invariant defined in the schema must be satisfied both before and after the transaction. A transaction that would violate any constraint is aborted entirely — atomicity ensures the invalid intermediate state is never persisted.
Consistency is the only ACID property that is partly the application's responsibility. The database enforces its own constraints — foreign keys, CHECK constraints, NOT NULL, UNIQUE — automatically. But business rules that are not expressed as database constraints must be enforced by the application or pipeline code.
What consistency means in practice
DATABASE-ENFORCED CONSISTENCY (automatic):
Schema constraints the database checks on every transaction:
NOT NULL: customer_id cannot be NULL
UNIQUE: order_id cannot repeat
FOREIGN KEY: customer_id must exist in customers table
CHECK: order_amount must be > 0
status must be in ('placed','confirmed','delivered','cancelled')
Example: trying to insert an order with amount = -100
INSERT INTO orders (order_id, customer_id, amount, status)
VALUES (9284751, 4201938, -100.00, 'placed');
→ ERROR: new row violates check constraint "chk_order_amount"
→ Transaction aborted. No partial state. ✓
APPLICATION-ENFORCED CONSISTENCY (your responsibility):
Business rules that cannot be expressed as simple DB constraints:
"A delivered order cannot be cancelled"
"A customer cannot have more than 5 active orders simultaneously"
"An order amount cannot exceed the customer's credit limit"
"Delivery time cannot be set before the order was placed"
These rules must be checked in application/pipeline code BEFORE
attempting the database write. The database has no way to know
these business rules exist.
Example of a consistency violation the DB cannot catch:
-- Order was already 'delivered' at 8:14 PM
-- At 8:45 PM, a bug in the status service sends 'cancelled'
UPDATE orders SET status = 'cancelled'
WHERE order_id = 9284751;
→ DB allows this (no constraint prevents it)
→ But it violates a business rule: you cannot cancel a delivered order
→ Application code must check current status BEFORE allowing the update
This is why data quality checks in pipelines matter:
the database enforces its constraints; your pipeline must enforce
the business rules the database cannot express as constraints.Consistency and data pipelines
For data engineers, consistency means your pipeline must leave the data in a valid state after every run. This includes not just database constraints but business-level invariants: row counts that match expected volumes, referential completeness (every order_id in order_items has a corresponding row in orders), and value distributions within expected ranges. Data quality tests in dbt are exactly this — they are consistency checks for business rules the database cannot enforce natively.
Isolation — Transactions Do Not Interfere
Isolation is the guarantee that concurrent transactions are isolated from each other — a transaction in progress cannot see the uncommitted changes of another transaction, and two transactions do not produce inconsistent results just because they run at the same time. Every transaction behaves as if it were the only transaction running, even when thousands of transactions execute simultaneously.
Isolation is the most complex ACID property to implement and the one with the most engineering nuance. Full isolation (every transaction sees a perfectly consistent snapshot with no interference) comes at a performance cost. Databases offer different isolation levels that trade off correctness guarantees against concurrency performance.
The four isolation problems — what happens without isolation
Isolation levels — the trade-off between correctness and performance
Rather than providing maximum isolation always (which is expensive), SQL databases offer four isolation levels. Each level prevents certain anomalies while allowing others, trading consistency guarantees for higher concurrency.
| Isolation Level | Dirty Read | Non-Repeatable Read | Phantom Read | Use when |
|---|---|---|---|---|
| READ UNCOMMITTED | ✗ Possible | ✗ Possible | ✗ Possible | Almost never — allows reading uncommitted data. Only acceptable for approximate analytics where staleness is fine. |
| READ COMMITTED (default in PostgreSQL) | ✓ Prevented | ✗ Possible | ✗ Possible | Most OLTP applications. Each statement sees only committed data, but two reads in same transaction may differ. |
| REPEATABLE READ (default in MySQL InnoDB) | ✓ Prevented | ✓ Prevented | ✗ Possible (standard) ✓ Prevented (PostgreSQL) | Financial reports, bank statements — where same-row reads must be consistent within one transaction. |
| SERIALIZABLE | ✓ Prevented | ✓ Prevented | ✓ Prevented | Critical financial operations, inventory management — where all anomalies are unacceptable. Slowest. |
What isolation level should data pipelines use?
INGESTION QUERIES (reading from production databases):
Use: READ COMMITTED (PostgreSQL default) or REPEATABLE READ
Why: you want to read committed data, not in-progress changes
READ COMMITTED: each of your SELECT statements sees a fresh snapshot
REPEATABLE READ: your entire transaction sees one consistent snapshot
(better for long-running analytical extractions)
PostgreSQL example:
BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
-- All reads within this transaction see the same snapshot
SELECT * FROM orders WHERE created_at >= '2026-03-17';
-- Even if new orders are committed during this query,
-- they will NOT appear here (snapshot was taken at BEGIN)
COMMIT;
WRITE PIPELINES (loading data into warehouse/lake tables):
Use: READ COMMITTED or SERIALIZABLE depending on conflict risk
For simple appends: READ COMMITTED is sufficient
For read-modify-write patterns (count then insert):
Use SERIALIZABLE to prevent lost updates
Example of dangerous read-modify-write without SERIALIZABLE:
-- Two pipeline instances run simultaneously:
T1: SELECT COUNT(*) FROM daily_summary WHERE date='2026-03-17' → 0
T2: SELECT COUNT(*) FROM daily_summary WHERE date='2026-03-17' → 0
T1: INSERT INTO daily_summary (date, count) VALUES ('2026-03-17', 48234)
T2: INSERT INTO daily_summary (date, count) VALUES ('2026-03-17', 48234)
Result: DUPLICATE ROW in daily_summary
With SERIALIZABLE: T2 fails with serialization error → retry → correct
PRACTICAL RULE FOR DATA ENGINEERS:
For reading: always use at least READ COMMITTED — never READ UNCOMMITTED
For long analytical reads: use REPEATABLE READ for snapshot consistency
For concurrent writes to shared tables: use SERIALIZABLE
Most pipelines: READ COMMITTED is fine because writes are controlled
(only one pipeline writes to a table at a time)Durability — Committed Data Survives Everything
Durability is the guarantee that once a transaction is committed, its changes are permanent. Not "probably permanent." Not "permanent unless the server crashes." Permanent — they survive server crashes, power failures, OS crashes, and restarts. If the database told the application "transaction committed," those changes will be in the database after any failure.
Durability is implemented through the Write-Ahead Log. Before the database sends the "committed" response to the application, the commit record is flushed to the WAL on durable storage (disk or SSD). Even if the server immediately crashes after sending the commit response, the WAL replay on restart will recover all committed changes. This is why the WAL flush (the fsync call that forces the WAL buffer to physical disk) is the most performance-critical operation in a high-write database.
DURABILITY IMPLEMENTATION:
Transaction COMMIT sequence:
1. All operations in transaction written to WAL buffer (in memory)
2. fsync() called → WAL buffer flushed to durable disk storage
3. "COMMIT" response sent to application ← only after fsync completes
4. Data pages modified in buffer pool (can happen lazily later)
5. At checkpoint: modified data pages written from buffer to disk
The critical point: step 3 cannot happen before step 2.
"Committed" means "safely on disk." Not "in memory."
PERFORMANCE IMPACT OF DURABILITY:
fsync() is expensive — forces all pending writes to physical disk
Typical SSD fsync latency: 0.5 – 5 milliseconds
Without fsync: database could acknowledge commits at memory speed (~microseconds)
With fsync: every transaction pays the disk latency penalty
PostgreSQL synchronous_commit settings:
on (default): fsync after every commit → fully durable, slowest writes
off: no fsync → commits acknowledged at memory speed
risk: up to ~0.6 seconds of committed transactions can be
lost on crash (the WAL buffer that was not flushed)
local: fsync to local disk, not to replicas
For data pipelines: synchronous_commit = on is correct for all
financial and critical data. For analytics data that can be
reprocessed from source, synchronous_commit = off is acceptable
and dramatically improves write throughput.
CLOUD DATABASE DURABILITY:
RDS PostgreSQL, Cloud SQL, Azure Database for PostgreSQL:
→ Multi-AZ deployments replicate WAL to a standby in a different
availability zone synchronously before acknowledging commits
→ Survive data centre failure, not just single server failure
Snowflake / BigQuery:
→ Multiple copies of data written across multiple storage nodes
→ Commit acknowledged only after all copies are confirmed
→ Effectively more durable than a single-server PostgreSQLDurability vs performance — the real trade-off
Every database that offers durability pays a performance cost for it. The fsync that makes durability possible is the same operation that limits write throughput. Some high-throughput data systems deliberately relax durability guarantees for better performance — accepting that some recently committed data might be lost on crash because it can be reprocessed from the source.
As a data engineer, you must make this trade-off consciously for each pipeline. Financial transaction data: full durability always. Analytics event data that can be reprocessed from Kafka: relaxed durability is acceptable. Understand the recovery cost of lost data before choosing a durability level.
Writing and Managing Transactions — The Practical Patterns
Understanding ACID conceptually is necessary. Using transactions correctly in code is what actually protects your data. Here are the practical patterns every data engineer must know.
Basic transaction syntax
-- SQL transaction syntax:
BEGIN; -- start transaction (also: START TRANSACTION)
UPDATE accounts SET balance = balance - 5000 WHERE id = 'A001';
UPDATE accounts SET balance = balance + 5000 WHERE id = 'B001';
INSERT INTO transfer_log (from_id, to_id, amount, ts)
VALUES ('A001', 'B001', 5000, NOW());
COMMIT; -- make all changes permanent
-- If something goes wrong:
BEGIN;
UPDATE accounts SET balance = balance - 5000 WHERE id = 'A001';
-- Something fails here...
ROLLBACK; -- undo all changes in this transaction
-- SAVEPOINT — partial rollback within a transaction:
BEGIN;
UPDATE orders SET status = 'confirmed' WHERE order_id = 9284751;
SAVEPOINT after_confirm;
UPDATE inventory SET stock = stock - 1 WHERE item_id = 'MI-001';
-- If inventory update fails:
ROLLBACK TO SAVEPOINT after_confirm; -- undo only inventory update
-- Order confirmation is preserved, inventory update is rolled back
-- Continue with other work...
COMMIT;
-- Python with psycopg2:
with psycopg2.connect(conn_string) as conn:
with conn.cursor() as cur:
try:
cur.execute("UPDATE accounts SET balance = balance - 5000 WHERE id = 'A001'")
cur.execute("UPDATE accounts SET balance = balance + 5000 WHERE id = 'B001'")
conn.commit() # explicit commit
except Exception:
conn.rollback() # explicit rollback on any error
raise
# Python with context manager (autocommit=False by default in psycopg2):
with psycopg2.connect(conn_string) as conn:
# conn is automatically committed on __exit__ if no exception
# conn is automatically rolled back on __exit__ if exception occurs
with conn.cursor() as cur:
cur.execute("UPDATE ...")Transaction scope — the most common mistake
The most common transaction mistake in data pipelines is making transactions too large. Wrapping one million row inserts in a single transaction means holding locks on all resources for the duration of the entire operation. Long transactions: block other queries waiting for locks, consume enormous amounts of WAL space, prevent VACUUM from reclaiming dead row versions (as covered in Module 09), and take a long time to roll back if they fail.
TOO LARGE (one transaction for the entire batch):
BEGIN;
INSERT INTO orders VALUES (row 1);
INSERT INTO orders VALUES (row 2);
...
INSERT INTO orders VALUES (row 1,000,000); -- after 45 minutes...
COMMIT;
Problems:
→ Holds row locks for 45 minutes
→ If row 900,000 fails, ALL 900,000 previous rows are rolled back
→ 45 minutes of WAL accumulation before any data is visible
→ Other queries that need those rows are blocked for 45 minutes
CORRECTLY SIZED (one transaction per micro-batch):
BATCH_SIZE = 10_000
for batch in chunks(all_orders, BATCH_SIZE):
with conn.cursor() as cur:
try:
cur.execute("BEGIN;")
for order in batch:
cur.execute("INSERT INTO orders ...", order)
cur.execute("COMMIT;")
checkpoint.save(batch[-1]['order_id']) # track progress
except Exception as e:
cur.execute("ROLLBACK;")
# Only this batch of 10,000 rows is lost
# Previous committed batches are safe
# Retry only from the last checkpoint
raise
Benefits:
→ Each transaction holds locks for seconds, not hours
→ On failure, only 10,000 rows need retrying (not 1,000,000)
→ Committed data is visible progressively
→ Checkpoint tracking enables resumable pipelines
RULE: keep transactions as small as possible while still being
atomic for the unit of work that must succeed or fail together.
For pipelines: one transaction per batch, not one transaction
for the entire pipeline run.Autocommit — the hidden default that catches beginners
Most database clients operate in autocommit mode by default — every statement is automatically wrapped in its own transaction and committed immediately. This means there is no "undo" for a plain UPDATE or DELETE run without explicit BEGIN. Understanding this prevents the classic mistake of running a bulk update, realising it was wrong, and discovering there is nothing to roll back.
PSQL (PostgreSQL command line) — autocommit is ON by default:
UPDATE orders SET status = 'cancelled'; -- Runs and commits immediately
-- There is no ROLLBACK possible after this.
-- If you meant to add a WHERE clause, the data is already modified.
Safe practice for destructive operations:
BEGIN; -- start explicit transaction
UPDATE orders SET status = 'cancelled'; -- run the update
SELECT count(*) FROM orders WHERE status = 'cancelled'; -- verify
-- If count looks wrong:
ROLLBACK; -- undo safely
-- If count looks right:
COMMIT; -- make permanent
PYTHON / psycopg2 — autocommit is OFF by default:
conn = psycopg2.connect(...)
cur = conn.cursor()
cur.execute("UPDATE orders SET status = 'cancelled'")
# NOT committed yet! Still in an implicit transaction.
# conn.commit() must be called explicitly.
# Without conn.commit(), changes are LOST when connection closes.
# This is why pipelines must call conn.commit() explicitly
# or use context managers that commit on clean exit.
SQLALCHEMY — autocommit behaviour:
session.execute(update_stmt) # not committed
session.commit() # committed
# or
with session.begin():
session.execute(update_stmt) # commits when with block exits cleanly
# rolls back if exception raised
THE SAFEST RULE:
In any data pipeline code:
1. Always use explicit BEGIN/COMMIT (not autocommit)
2. Always wrap writes in try/except with explicit ROLLBACK on error
3. Test your rollback path — not just the happy pathACID in Data Warehouses, Lakehouses, and NoSQL
ACID was first implemented in traditional relational databases. As the data landscape expanded, different systems made different choices about which ACID properties to support, how strictly, and at what performance cost. Understanding these choices tells you what you can and cannot rely on in each system.
| System | Atomicity | Consistency | Isolation | Durability |
|---|---|---|---|---|
| PostgreSQL | ✓ Full — WAL-based rollback | ✓ Full — all constraints enforced | ✓ MVCC — configurable isolation levels | ✓ Full — WAL fsync before commit |
| MySQL (InnoDB) | ✓ Full — redo/undo log | ✓ Full — all constraints enforced | ✓ MVCC — Repeatable Read default | ✓ Full — innodb_flush_log_at_trx_commit |
| Snowflake | ✓ Full — multi-statement transactions | ⚠ Partial — constraints defined but NOT enforced at runtime | ✓ Snapshot isolation — Serializable available | ✓ Full — multi-copy cloud storage |
| BigQuery | ✓ Per-statement atomicity only — no multi-statement transactions in standard DML | ⚠ Partial — no enforced FK/PK/UNIQUE at runtime | ✓ Snapshot isolation per query | ✓ Full — Colossus distributed storage |
| Delta Lake (Lakehouse) | ✓ Full — transaction log ensures all-or-nothing | ⚠ Partial — schema enforced, constraints not enforced | ✓ Snapshot isolation via transaction log | ✓ Full — object storage durability |
| Cassandra | ⚠ Lightweight transactions (LWT) only — expensive, use rarely | ⚠ Eventual consistency — AP system | ⚠ None by default — concurrent writes merge via LWW | ✓ Configurable consistency level on writes |
| MongoDB | ✓ Multi-document transactions since v4.0 (replica set only) | ⚠ Schema validation optional — not enforced by default | ✓ Snapshot isolation in transactions | ✓ Configurable write concern (w: majority) |
BASE — What NoSQL Systems Do Instead
Not all databases implement ACID. Many NoSQL systems — Cassandra, DynamoDB, CouchDB — use a different model called BASE, which trades consistency for availability and performance. Understanding BASE is important because many of the source systems data engineers ingest from are BASE systems.
What BASE means for data engineering pipelines
When you ingest data from a BASE system (Cassandra, DynamoDB, CouchDB), you are ingesting from a system that may have temporarily inconsistent data across its nodes. The practical implications:
CASSANDRA (AP / eventual consistency):
Problem 1 — Duplicate writes:
At-least-once delivery semantics mean the same event can be
written to Cassandra more than once during network issues.
Your extraction will see duplicate rows.
Solution: always deduplicate on (partition_key, clustering_key)
in your Silver transformation. Keep the row with the
most recent writetime().
Problem 2 — Slightly stale reads:
If you read from a non-primary replica (LOCAL_ONE consistency),
you may read a row version that was updated seconds ago on another
node but hasn't replicated yet.
Solution: use QUORUM consistency for extractions where accuracy matters.
QUORUM requires a majority of replicas to agree → no stale reads.
Problem 3 — Tombstones (delete markers):
Cassandra "deletes" rows by writing a tombstone marker.
The actual data is still present until compaction removes it.
An extraction that happens between the tombstone write and compaction
may see both the tombstone and the original data.
Solution: filter out tombstones. Check TTL columns and deletion flags.
DYNAMODB (AP by default / CP with strongly consistent reads):
Default: eventually consistent reads (may be slightly stale)
Option: strongly consistent reads (read from primary, 2× cost)
For DE pipelines: ALWAYS use strongly consistent reads (ConsistentRead=True)
to ensure you do not extract stale data.A Payment Pipeline Failure — Tracing it Through ACID
It is the last day of March 2026. The finance team is running the month-end reconciliation between the payment gateway's records and your data warehouse. The payment gateway shows ₹4,82,73,841 in settlements for March. Your warehouse shows ₹4,81,98,293. A ₹75,548 discrepancy. You are assigned to find it.
Investigation step 1 — check for pipeline failures:You look at the Airflow dashboard. March 14th's settlement ingestion pipeline shows a yellow warning — it completed but with a retry. The first attempt failed at 3:17 AM, the retry succeeded at 3:24 AM.
Investigation step 2 — check atomicity: You look at the pipeline logs for March 14th's first attempt. It failed with a database connection timeout at row 8,241 of 15,892. Because the pipeline was not using explicit transactions, the first 8,240 rows were committed before the failure. The retry then loaded all 15,892 rows again. Rows 1–8,240 exist twice.
The root cause: The pipeline was loading settlement rows with plain INSERT statements without a transaction boundary. When the connection timed out, 8,240 rows were already committed. The retry's ON CONFLICT DO UPDATE clause should have handled duplicates — except that settlement_id (the conflict key) had no UNIQUE constraint in the warehouse. The INSERT silently created duplicate rows instead of upserting.
The fix: You deduplicate the affected date's data, add a UNIQUE constraint on settlement_id, and rewrite the pipeline to wrap each file's worth of inserts in an explicit transaction. The ₹75,548 discrepancy resolves to zero. Total time to find and fix: four hours.
The lesson: Two missing things — an explicit transaction wrapping the batch and a UNIQUE constraint on the business key — allowed a normal operational failure (connection timeout) to produce a financial data discrepancy. ACID properties are not academic. They are what makes the difference between data you can trust for financial decisions and data that quietly diverges from reality.
5 Interview Questions — With Complete Answers
Errors You Will Hit — And Exactly Why They Happen
🎯 Key Takeaways
- ✓ACID solves four problems that arise in concurrent databases: partial updates (Atomicity), constraint violations after update (Consistency), reading in-progress changes (Isolation), and losing committed data on crash (Durability). Every database without ACID is vulnerable to all four.
- ✓Atomicity guarantees all-or-nothing execution. A transaction with ten operations either commits all ten or rolls back all ten. No partial state. Implemented via the WAL — changes are only made permanent after the commit record is flushed to disk.
- ✓Consistency ensures the database moves from one valid state to another valid state. The database enforces schema constraints automatically. Business rules not expressible as constraints must be enforced by your pipeline code and data quality tests.
- ✓Isolation prevents transactions from seeing each other's in-progress changes. Four isolation levels trade correctness for performance: READ UNCOMMITTED (no protection), READ COMMITTED (no dirty reads, PostgreSQL default), REPEATABLE READ (no non-repeatable reads), SERIALIZABLE (full isolation, slowest).
- ✓Durability guarantees that committed data survives crashes. Implemented via WAL fsync — the commit response is only sent after the WAL is on durable storage. Every committed transaction can be recovered on restart by replaying the WAL.
- ✓Transactions must be correctly sized for pipelines. One million rows in one transaction holds locks for the entire duration, wastes all work on failure, and consumes enormous WAL space. Batch into 10,000–100,000 row transactions with checkpoint tracking for resumability.
- ✓Autocommit is ON by default in most SQL clients — every statement is its own transaction. In Python psycopg2 it is OFF — you must call conn.commit() explicitly. Understand your client's default and always use explicit transaction control for pipeline code.
- ✓Cloud data warehouses (Snowflake, BigQuery, Redshift) provide atomicity, isolation, and durability — but do NOT enforce consistency constraints (PRIMARY KEY, UNIQUE, FOREIGN KEY, NOT NULL) at runtime. Your dbt tests are the consistency enforcement layer for warehouse data.
- ✓BASE (Basically Available, Soft State, Eventually Consistent) is the alternative to ACID used by Cassandra, DynamoDB, and other AP NoSQL systems. BASE systems trade consistency for availability. Always deduplicate data extracted from BASE systems — at-least-once delivery means duplicates are expected.
- ✓The most common ACID violation in production data pipelines is not a theoretical failure — it is missing explicit transaction boundaries on batch writes combined with missing UNIQUE constraints on business keys. These two omissions together turn a routine connection timeout into a financial data discrepancy.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.