Incremental Loading — How to Process Only New Data in Your Pipelines
Loading all data from scratch every pipeline run is safe but slow and expensive. Incremental loading — processing only new or changed records since the last run — is what makes production pipelines fast enough to run every hour instead of once a day.
Full load vs incremental load
Full load: truncate the target table, reload everything from source. Safe — always produces correct results. But for a table with 5 years of data, you reprocess everything every run. At scale this becomes impractical.
Incremental load: identify records that are new or changed since the last run, process only those, and merge them into the target. Faster, cheaper, but requires careful implementation to avoid missing data or creating duplicates.
Watermark-based incremental loading
The most common approach: track the maximum value of a timestamp column (updated_at, created_at) from the last successful run. On the next run, read only records where updated_at > last_watermark.
Store the watermark in a metadata table after each successful run. If the pipeline fails, the watermark is not updated — the next run reprocesses the same window, catching any missed records.
This is the pattern used in ADF Copy Activity with incremental watermark, Glue bookmarks, and custom PySpark incremental pipelines.
Handling late-arriving and updated records
Watermark-based loading misses records that arrive late (after the watermark has already passed) and records that are updated (updated_at changes but the record was already loaded).
For updates: use MERGE (upsert) instead of INSERT when writing to Silver or Gold. MERGE matches on primary key — if the record exists, update it; if not, insert it. Delta Lake MERGE, Synapse MERGE, and BigQuery MERGE all work this way.
For late arrivals: add a processing window buffer. Instead of loading records where updated_at > last_watermark, load where updated_at > last_watermark - 2 hours. This catches late arrivals at the cost of some reprocessing.
Change Data Capture (CDC)
CDC is the most sophisticated form of incremental loading. Instead of querying the source table, you capture every INSERT, UPDATE, and DELETE operation from the source database transaction log.
Tools: Debezium (open source), AWS DMS, Azure Data Factory Data Flow with CDC support.
CDC gives you exact change history — not just the current state of changed records but every intermediate state. Essential for audit requirements, slowly changing dimensions, and low-latency replication.