ETL vs ELT — History, Difference, When to Use Each
Why ETL dominated for 30 years, why ELT replaced it, and when each still belongs in a modern stack.
ETL vs ELT — More Than a Letter Swap
The difference between ETL and ELT is not which letters appear in what order. It is a fundamental difference in where transformation logic lives, who can see and change it, what happens to the raw data, how the pipeline fails, and what tools are used to build and maintain it. Two teams that choose different patterns for the same data problem end up with architectures that look and behave completely differently five years later.
Understanding this distinction deeply matters for three reasons. First, you will be asked about it in every data engineering interview. Second, you will inherit codebases built on both patterns and need to extend them intelligently. Third, you will make the choice yourself for new projects and the decision will compound in importance as the platform scales.
Why ETL Dominated for 30 Years — And Why It No Longer Does
ETL was not a design choice — it was a constraint. Understanding the constraint explains everything about why ETL looks the way it does and why ELT became possible only when specific infrastructure conditions changed.
The 1990s–2000s: storage was expensive, compute was centralised
ECONOMIC REALITY OF DATA WAREHOUSING IN 2000:
On-premises warehouse hardware (Teradata, IBM Netezza, Oracle Exadata):
Storage cost: $5,000 – $50,000 per terabyte
Typical warehouse: 5–20 TB = $25,000 – $1,000,000 just for storage
Hardware refresh: every 3–5 years, another $millions
CONSEQUENCE: warehouse storage was precious
You could NOT afford to store raw, unprocessed data in the warehouse.
Every byte had to earn its place. Only clean, modelled, business-ready
data was allowed through the warehouse door.
THEREFORE: transformation MUST happen before loading (ETL)
Extract from source systems (nightly, during off-peak hours)
Transform on a separate processing server (ETL server)
→ clean, type, deduplicate, join, validate
→ discard raw data after transformation
Load ONLY the clean data into the warehouse
→ warehouse stores only modelled, compressed, valuable data
ETL tools of the era:
IBM DataStage, Informatica PowerCenter, Oracle Data Integrator,
Microsoft SSIS (SQL Server Integration Services), Talend
→ GUI-based, proprietary, $100k+ enterprise licences
→ Each transformation step configured in drag-and-drop interfaces
→ Version control was difficult; collaboration was harder
→ "ETL developers" were a separate job title from software engineersThe 2010s: cloud storage changed the economics entirely
WHAT CHANGED IN THE 2010s:
Amazon S3 launched 2006:
Storage cost: $0.023 per GB per month ($23 per TB)
Compare to on-premises warehouse hardware: $5,000–$50,000/TB
Reduction: 99.5% cheaper
→ Suddenly affordable to store ALL raw data indefinitely
Cloud data warehouses (Redshift 2012, BigQuery 2012, Snowflake 2014):
Separated compute from storage
Compute: pay-per-query or pay-per-warehouse-minute
Storage: cheap (Snowflake ~$40/TB/month, BigQuery ~$20/TB/month)
Massively parallel: 100s of nodes available on demand
→ Warehouse compute was no longer the bottleneck
CONSEQUENCE: the ETL constraint evaporated
You CAN afford to store raw data (S3 is nearly free)
You CAN afford to transform inside the warehouse (cheap on-demand compute)
You DO NOT need a separate ETL server (warehouse IS the compute engine)
THEREFORE: transformation can happen AFTER loading (ELT)
Extract from source and load raw to S3 or warehouse staging
Transform inside the warehouse with SQL
→ Raw data preserved in S3 forever
→ No separate transformation server
→ SQL replaces GUI-based ETL tools
→ dbt (data build tool, launched 2016) codifies this with SQL + Git
THE RESULT: ETL tools lost their purpose
When the warehouse can run SQL on 100 nodes at $0.25/query,
there is no reason to transform on a separate server first.
The transformation logic belongs in SQL, in the warehouse, under version control.This is not "ETL is dead" — it is "the reason ETL was necessary no longer exists for most workloads." ETL tools and patterns still serve specific purposes that ELT cannot, which we will cover in Part 05.
ETL — Extract, Transform, Load
In ETL, the transformation engine sits between the source and the destination. Data is extracted from the source, flows through a pipeline that applies all transformations, and the output — clean, typed, ready-to-use data — is loaded into the destination. The destination never sees the raw data.
ETL data flow
ETL DATA FLOW:
SOURCE ETL ENGINE (Python/Spark/Informatica) DESTINATION
─────────────────────────────────────────────────────────────────────────
PostgreSQL → [Extract] Snowflake
orders table SELECT * FROM orders silver.orders
WHERE updated_at > checkpoint
[Transform — all in pipeline code]
├── Cast types (string → decimal)
├── Deduplicate (ROW_NUMBER window)
├── Normalise status (lowercase/trim)
├── Join to customers (enrichment)
├── Validate (amount > 0, status in set)
├── Reject invalid rows to DLQ
└── Add ingested_at, pipeline_run_id
[Load]
INSERT INTO silver.orders
ON CONFLICT (order_id) DO UPDATE ...
WHAT THE DESTINATION RECEIVES:
Typed, validated, deduplicated, enriched rows only.
Invalid rows never reach the destination.
Raw source data is never stored in the destination.
WHAT IS LOST:
The original raw data is discarded after transformation.
If a bug was introduced in the transformation:
→ Cannot reprocess raw data from the destination
→ Must re-extract from the source (if still available)
→ Or accept the incorrect data until a correction run
TYPICAL ETL TOOLS AND WHEN USED:
Python + Pandas/PySpark: Most common in modern pipelines
Apache Spark: Large-scale transformations
Apache Flink: Streaming transformations
Informatica PowerCenter: Legacy enterprise pipelines
IBM DataStage: Legacy enterprise, banking/finance
Microsoft SSIS: SQL Server ecosystem
Talend: Open-source ETL alternative
AWS Glue: Serverless Spark ETL on AWSWhen ETL is genuinely the right choice
ELT — Extract, Load, Transform
In ELT, the extraction layer does as little as possible — it reads from the source and lands the data in the destination with minimal transformation. All meaningful transformation happens inside the destination using SQL, typically powered by dbt. The destination holds both the raw staging data and the transformed analytical tables.
ELT data flow
ELT DATA FLOW:
SOURCE EL LAYER (thin Python loader) DESTINATION (warehouse/lake)
──────────────────────────────────────────────────────────────────────────────────
PostgreSQL → [Extract] Snowflake
orders table SELECT * FROM orders ├── raw.orders (staging)
WHERE updated_at > checkpoint ├── silver.orders (cleaned)
└── gold.daily_rev (aggregated)
[Load — minimal transformation]
INSERT INTO raw.orders ← raw data lands here
(just type coercion for load compatibility)
↓ (separate step, runs after load)
dbt TRANSFORMS INSIDE THE WAREHOUSE:
├── silver/orders.sql:
│ WITH source AS (SELECT * FROM raw.orders),
│ deduped AS (... ROW_NUMBER ...),
│ cleaned AS (... LOWER(status), DECIMAL cast, validated ...)
│ SELECT * FROM cleaned
│
├── gold/daily_revenue.sql:
│ WITH orders AS (SELECT * FROM silver.orders)
│ SELECT DATE(created_at), store_id, SUM(amount) ...
│ FROM orders GROUP BY 1, 2
│
└── schema.yml (tests):
- name: silver_orders
columns:
- name: order_id
tests: [not_null, unique]
- name: amount
tests: [{dbt_utils.accepted_range: {min_value: 0}}]
WHAT IS PRESERVED:
raw.orders contains the original source data — always.
If a bug is found in silver.orders 6 months later:
→ dbt run --select silver.orders --full-refresh
→ Reprocess from raw.orders in the same warehouse
→ No re-extraction from source needed
WHAT THE DESTINATION CONTAINS:
raw schema: exact source data, typed minimally for load
silver schema: cleaned, validated, deduplicated
gold schema: aggregated, business-ready metricsWhy raw data preservation is ELT's most underappreciated advantage
The ability to reprocess from raw data without re-extracting from the source is transformative. In ETL, if a bug is discovered in a six-month-old transformation and the raw data was discarded, you need the source system to provide historical data again — which may be impossible if the source has deleted it, changed its schema, or has rate limits that make a six-month re-extraction impractical.
In ELT, the raw data is in your warehouse (or S3). You fix the dbt model, run dbt run --full-refresh, and six months of correctly-transformed data is available in hours. This reprocessing capability compounds in value over time as bugs are inevitably discovered.
dbt — the tool that operationalises ELT
dbt (data build tool) is the standard transformation layer for ELT pipelines.
It takes SQL SELECT statements and manages running them in the right order,
testing the outputs, and generating documentation — all from within the warehouse.
PROJECT STRUCTURE:
freshmart_dbt/
├── models/
│ ├── staging/ # stg_ models: raw → typed (thin layer)
│ │ ├── stg_orders.sql
│ │ └── _sources.yml # define source tables here
│ ├── intermediate/ # int_ models: joins, complex logic
│ │ └── int_orders_with_customer.sql
│ └── marts/
│ ├── core/ # dim_ and fct_ models
│ │ └── fct_orders.sql
│ └── finance/ # domain-specific Gold models
│ └── daily_revenue.sql
├── tests/
├── macros/
└── dbt_project.yml
A SIMPLE dbt MODEL (models/staging/stg_orders.sql):
-- Materialised as a table in staging schema
-- {{ config(materialized='table') }}
WITH source AS (
SELECT * FROM {{ source('raw', 'orders') }}
),
renamed AS (
SELECT
order_id::BIGINT AS order_id,
customer_id::BIGINT AS customer_id,
amount::DECIMAL(10,2) AS order_amount,
LOWER(TRIM(status)) AS status,
created_at::TIMESTAMPTZ AS created_at
FROM source
WHERE order_id IS NOT NULL
)
SELECT * FROM renamed
dbt COMMANDS:
dbt run # run all models
dbt run -s stg_orders # run one model
dbt run -s +fct_orders # run fct_orders and all its ancestors
dbt test # run all data quality tests
dbt docs generate # generate documentation
dbt docs serve # view documentation in browser
dbt TESTS (schema.yml):
models:
- name: stg_orders
columns:
- name: order_id
tests: [not_null, unique]
- name: order_amount
tests: [not_null, {dbt_utils.accepted_range: {min_value: 0}}]
- name: status
tests:
- accepted_values:
values: [placed, confirmed, delivering, delivered, cancelled]
HOW dbt FITS INTO ELT:
1. Python/Fivetran/Airbyte: Extract + Load raw data to warehouse
2. dbt: Transform raw → staging → intermediate → Gold (inside warehouse)
3. BI tools (Metabase, Superset): Query Gold tables
No separate compute engine. No code outside the warehouse for analytics logic.ETL vs ELT — Every Dimension
| Dimension | ETL | ELT |
|---|---|---|
| Where transform happens | In a separate processing engine before the destination | Inside the destination (warehouse/lake) after loading |
| Raw data preserved? | No — raw data discarded after transformation | Yes — raw data always in staging layer |
| Reprocessing on bug fix | Must re-extract from source (may be unavailable) | Run dbt --full-refresh from raw already in warehouse |
| Transformation language | Python, Java, Scala, SQL, GUI tools | SQL (dbt) — occasionally Python for complex logic |
| Version control | Varies — some GUI tools have poor git integration | Native — dbt models are SQL files in git |
| Testability | Unit tests in Python, integration tests hard | dbt tests built-in, run automatically in CI |
| Compute location | Separate ETL server/cluster | Warehouse itself (Snowflake, BigQuery, Spark) |
| PII handling | ✓ Can mask/hash before data reaches destination | Raw PII lands first — requires separate masking step |
| Schema flexibility at load | Low — must know schema before extraction | High — raw lands with minimal schema, transform later |
| Analyst access to logic | Hard — transformation in Python/Java code | Easy — transformation in SQL in the warehouse |
| Suitable for semi-structured | Yes — Python handles JSON natively | Yes (Snowflake VARIANT, BigQuery JSON) but SQL is awkward for deep nesting |
| Cost model | ETL compute cost + warehouse storage | Warehouse compute + storage (one platform) |
| Best tools (2026) | PySpark, Python/Pandas, AWS Glue, Flink | dbt + Fivetran/Airbyte, Snowflake, BigQuery |
| Dominant era | 1990s–2015 | 2015–present |
The Modern Stack — ELT by Default, ETL Where Necessary
The framing of "ETL vs ELT" implies a binary choice. In practice, mature data platforms use both, each handling the workloads it is best suited for. The modern pattern is ELT by default, with ETL inserted for specific transformations that cannot or should not happen inside the warehouse.
MODERN DATA PLATFORM ARCHITECTURE — BOTH PATTERNS IN USE:
SOURCES
─────────────────────────────────────────────────────────────────────────
PostgreSQL (orders, customers)
Razorpay API (payments)
ShipFast API (deliveries)
Internal customer review data (contains PII)
ML pipeline (model predictions)
INGESTION LAYER (thin EL — Python scripts / Fivetran / Airbyte)
─────────────────────────────────────────────────────────────────────────
PostgreSQL CDC → raw.orders (EL — no transformation)
Razorpay API → raw.payments (EL — minimal typing for load)
ShipFast API → raw.deliveries (EL — minimal typing for load)
ETL EXCEPTIONS — Python pipelines that transform BEFORE loading:
Customer reviews → [Python ETL: mask email, hash user_id, extract sentiment]
→ raw.reviews (PII-free, sentiment_score added)
ML predictions → [Python ETL: parse model output, apply business rules]
→ raw.risk_scores
TRANSFORMATION LAYER (dbt — ELT inside Snowflake)
─────────────────────────────────────────────────────────────────────────
raw.orders → stg_orders (typed, deduplicated)
raw.payments → stg_payments (typed, validated)
raw.deliveries → stg_deliveries (typed, validated)
raw.reviews → stg_reviews (already cleaned by ETL)
stg_orders + stg_customers → int_orders_enriched (join)
int_orders_enriched → fct_orders (Gold fact table)
fct_orders → daily_revenue (Gold aggregate)
fct_orders → customer_ltv (Gold aggregate)
SERVING LAYER
─────────────────────────────────────────────────────────────────────────
Metabase / Power BI → queries Gold tables in Snowflake
FastAPI / GraphQL → queries Silver tables for product features
ML training → reads raw.* tables in S3 directly (Spark)
DECISION RULE: ELT or ETL for each source?
Ask: can the raw data touch the warehouse safely?
Yes (no PII, no compliance issue, schema is loadable): ELT
No (PII masking needed, complex Python transform, incompatible schema): ETLCommercial ELT tools — Fivetran, Airbyte, and the data movement layer
The EL part of ELT — the extraction and loading of raw data into the warehouse — has been largely commoditised by managed connector tools. These tools pre-build and maintain connectors to hundreds of data sources, handling authentication, pagination, incremental extraction, and schema evolution automatically.
| Tool | Type | Best for | Trade-off |
|---|---|---|---|
| Fivetran | Managed, fully hosted | SaaS sources (Salesforce, HubSpot, Stripe, Google Analytics) | Expensive ($1/1000 rows in some plans). No customisation of connector logic. |
| Airbyte | Open-source + managed cloud | When you need custom connectors or open-source flexibility | Self-hosted option requires Kubernetes management. Cloud version costs money. |
| Stitch | Managed, mid-market | Smaller teams, simpler sources | Fewer connectors than Fivetran. Limited transformation options. |
| dlt (data load tool) | Python library (open-source) | Custom API sources where you write the extraction code but want automatic schema management and loading | Requires Python code. No UI. More engineering than Fivetran. |
| Custom Python | Hand-built | Sources with unusual auth, pagination, or data structures that managed tools cannot handle | Full maintenance burden — schema changes, auth refresh, pagination bugs are yours to fix. |
ELT Anti-Patterns — Mistakes That Look Like ELT But Are Not
ELT has specific failure modes that appear regularly in teams transitioning from ETL. These anti-patterns adopt the form of ELT — raw data in staging, dbt transformations — without the substance, producing platforms that are fragile despite looking modern.
Migrating a Legacy ETL Pipeline to ELT — A Realistic Scenario
FreshMart's original data platform was built in 2019. It uses a 500-line Informatica workflow that extracts from PostgreSQL, applies 40 transformation steps in a GUI-based ETL tool, and loads clean data into Redshift. The data team hates it: changes require opening the GUI, transformation logic is invisible to git, and when a bug is found, re-extraction from PostgreSQL takes 8 hours.
You are asked to migrate this to a modern ELT architecture. Here is the approach a senior data engineer uses.
Step 1 — Audit the existing ETL transformations. Export the Informatica workflow documentation and categorise every transformation:
TRANSFORMATION AUDIT (from Informatica workflow):
Step 1: Extract orders from PostgreSQL WHERE updated_at > checkpoint
→ Category: EXTRACTION — keep in Python EL layer
Step 2: Cast amount from VARCHAR to DECIMAL
→ Category: TYPE CAST — move to stg_orders.sql (dbt staging model)
Step 3: Lowercase and trim status field
→ Category: NORMALISE — move to stg_orders.sql
Step 4: Filter rows where status = 'test_order'
→ Category: BUSINESS FILTER — move to stg_orders.sql
Step 5: Join to customer lookup table for city enrichment
→ Category: ENRICHMENT JOIN — move to int_orders_with_customer.sql
Step 6: Compute order_tier (high/mid/low) based on amount thresholds
→ Category: BUSINESS RULE — move to fct_orders.sql
Step 7: Hash customer_phone for privacy
→ Category: PII MASKING — KEEP in ETL (Python), must happen BEFORE load
Step 8: Compute daily revenue aggregates
→ Category: AGGREGATION — move to gold/daily_revenue.sql
Step 9: Reject rows with NULL order_id to error file
→ Category: VALIDATION — add dbt test not_null on order_id in staging
add Python DLQ for load-time rejections
Summary:
8 of 9 steps → ELT (dbt models)
1 of 9 steps → Stays ETL (PII masking before raw data reaches warehouse)
This is the typical pattern — most logic migrates to dbt, PII stays in PythonStep 2 — Build the EL layer. Write a thin Python loader that extracts from PostgreSQL with incremental watermark, applies the PII masking (the one step that must stay ETL), and loads to Snowflake's raw schema with no other transformations.
Step 3 — Build the dbt models. Create stg_orders.sql (type casting, normalisation, business filter), int_orders_with_customer.sql (enrichment join), fct_orders.sql (business rules), and daily_revenue.sql (aggregation). Add schema.yml tests at each layer.
Step 4 — Run both in parallel and compare outputs. For two weeks, both the old Informatica pipeline and the new ELT pipeline run. Daily row counts, SUM(revenue), and a sample of individual rows are compared. When the outputs match, the old pipeline is decommissioned.
The result after migration: transformation logic is 8 SQL files in git. Any engineer on the team can read, review, and change them. CI runs dbt tests on every PR. When a bug is found six months later, it is fixed in one SQL file and dbt run --full-refresh reprocesses from the preserved raw data in 40 minutes. The 8-hour re-extraction from PostgreSQL is gone forever.
5 Interview Questions — With Complete Answers
Errors You Will Hit — And Exactly Why They Happen
🎯 Key Takeaways
- ✓ETL transforms before loading — raw data is processed in a separate engine and only clean data reaches the destination. ELT loads raw data first and transforms inside the destination using SQL. The letters describe where transformation happens, which determines tool choice, raw data preservation, and reprocessing capability.
- ✓ETL dominated for 30 years because warehouse storage was expensive ($5,000–$50,000/TB). You could not afford to store raw data, so it had to be cleaned before entering. Cloud storage ($23/TB on S3) and cloud warehouses (cheap on-demand SQL compute) eliminated this constraint, making ELT economically rational.
- ✓ELT's defining advantage is raw data preservation. In ELT, the raw staging layer always contains the original source data. When a transformation bug is discovered months later, fix the dbt model and run --full-refresh. In ETL, if raw data was discarded, historical correction requires re-extraction from the source, which may be unavailable.
- ✓ETL is still the right choice when: raw data contains PII that must be masked before touching any analytical system, transformations require Python or ML capabilities unavailable in SQL, or the destination has strict schema requirements that prevent landing raw data.
- ✓dbt is the standard transformation layer in ELT. SQL SELECT statements in text files, version-controlled in git, with built-in testing (not_null, unique, accepted_values, relationships) and auto-generated documentation. dbt test runs in CI on every PR, catching data quality issues before they reach production.
- ✓The modern pattern is ELT by default with targeted ETL exceptions. Most logic lives in dbt SQL models. Python ETL is used only for PII masking, ML inference, and complex transformations unavailable in SQL. This is not a binary choice — mature platforms use both.
- ✓The staging layer in ELT must contain near-raw source data. Never apply business logic in the EL loader. Business logic belongs in dbt models where it is visible, version-controlled, tested, and independent of the loading mechanism. Applying business logic in the loader makes reprocessing from raw ineffective.
- ✓Fivetran and Airbyte are managed EL tools that pre-build connectors to hundreds of SaaS sources. Use them for standard sources (Salesforce, Stripe, Google Analytics) to save engineering time. Build custom Python EL for proprietary APIs, internal databases, and sources with unusual authentication or pagination.
- ✓The four ELT anti-patterns to avoid: transforming in the EL layer (secretly doing ETL), not preserving raw data (staging layer is already Silver), monolithic dbt models (one 300-line file doing everything), and no tests on raw or staging models (bugs propagate silently to Gold).
- ✓When comparing ETL vs ELT for reliability: both can enforce data quality equally well with the same checks. But ELT additionally preserves the ability to recover from both known and unknown bugs by reprocessing from raw. Over the lifetime of a pipeline — typically years — this reprocessing capability is the more important reliability property.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.