Azure Data Factory (ADF)
ADF is the orchestration backbone of every Azure data pipeline. It moves data from 90+ sources, triggers Databricks notebooks, handles failures, and automates everything on a schedule — all without writing infrastructure code.
What ADF is — and what it is not
Azure Data Factory is not a processing engine. It does not transform data. It does not run Python or SQL. What it does is move data and orchestrate other services that do those things. Think of it as the project manager of your data pipeline — it tells everything else what to do, in what order, and handles what happens when something goes wrong.
In a typical Azure pipeline, ADF handles three things: connecting to source systems and copying raw data into your data lake, triggering Databricks notebooks to run transformations in the right sequence, and scheduling all of this to happen automatically on a daily, hourly, or event-driven basis.
The six core concepts
The top-level container. A logical grouping of activities that together perform a task. "Ingest Sales Data" is one pipeline. You can have dozens in one ADF instance.
A single step inside a pipeline. Three types: data movement (Copy Data), transformation (run Databricks, run SQL), and control flow (If Condition, ForEach, Wait).
A named reference to data — a specific table, file, or folder. Points to a Linked Service. For example: "the sales table in this SQL Server" or "the /bronze/sales/ folder in ADLS Gen2".
The connection definition. Holds the connection string (or Key Vault reference) for a data source. Create one per external system and reuse it everywhere.
What starts a pipeline run. Three types: Schedule (set time), Storage Event (new file arrives), Tumbling Window (for historical backfills).
The compute ADF uses to run activities. Azure IR handles cloud-to-cloud. Self-hosted IR runs on-premises to connect ADF to on-prem databases.
Linked Services — always use Key Vault
Every data source ADF connects to needs a Linked Service. The most important rule: never store credentials directly in the Linked Service. Always reference Azure Key Vault. This is a hard requirement in every production environment, and interviewers specifically ask about this.
// ADF Linked Service — how ADF connects to ADLS Gen2
// Configure once in ADF UI, then reuse across all pipelines
{
"name": "LS_ADLS_Bronze",
"type": "AzureDataLakeStoreGen2",
"typeProperties": {
"url": "https://yourstorageaccount.dfs.core.windows.net",
"accountKey": {
"type": "AzureKeyVaultSecret",
"store": { "referenceName": "LS_KeyVault", "type": "LinkedServiceReference" },
"secretName": "adls-storage-key"
}
}
}A real pipeline — Copy + Databricks in sequence
Here is what a typical ADF pipeline looks like. Two activities chained together: a Copy activity that moves data from SQL Server to Bronze, then a Databricks Notebook activity that runs once the copy succeeds. The dependsOn field is how you control execution order — the second activity only runs if the first one succeeded.
// ADF Pipeline — Copy Activity + Databricks Notebook chained together
{
"name": "PL_Ingest_Sales_Daily",
"activities": [
{
"name": "Copy_Sales_To_Bronze",
"type": "Copy",
"inputs": [{ "referenceName": "DS_SQLServer_Sales", "type": "DatasetReference" }],
"outputs": [{ "referenceName": "DS_ADLS_Bronze_Sales", "type": "DatasetReference" }],
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "SELECT * FROM dbo.Sales WHERE LoadDate = '@{pipeline().parameters.runDate}'"
},
"sink": { "type": "DelimitedTextSink" }
}
},
{
"name": "Run_Databricks_Transform",
"type": "DatabricksNotebook",
"dependsOn": [{ "activity": "Copy_Sales_To_Bronze", "dependencyConditions": ["Succeeded"] }],
"typeProperties": {
"notebookPath": "/Shared/bronze_to_silver",
"baseParameters": { "run_date": "@pipeline().parameters.runDate" }
}
}
],
"parameters": {
"runDate": { "type": "string", "defaultValue": "@utcNow('yyyy-MM-dd')" }
}
}Trigger types and when to use each
Runs at a fixed time on a recurring schedule. Most common type. Example: every day at 2am UTC.
Fires when a file lands in ADLS Gen2. A partner drops a CSV — the pipeline triggers automatically.
Runs for non-overlapping time windows. Perfect for reprocessing 90 days of history one day at a time, in parallel.
🎯 Key Takeaways
- ✓ADF is an orchestration tool — it moves data and triggers other services, it does not transform data itself
- ✓Four core concepts: Pipelines (containers), Activities (steps), Datasets (data references), Linked Services (connections)
- ✓Always store credentials in Azure Key Vault — never hardcode them directly in Linked Services
- ✓dependsOn controls execution order — the next activity only runs when the previous one succeeds
- ✓Three trigger types: Schedule (time-based), Storage Event (file arrives), Tumbling Window (backfills)
- ✓The Monitor tab in ADF is your first stop when a pipeline run fails in production
- ✓ADF integrates natively with Databricks — triggering notebooks with parameters is the most common pipeline pattern
Built Azure Data Factory pipelines orchestrating Databricks notebook activities with dependency chaining and retry logic
Configured ADF Schedule and Storage Event triggers to automate daily batch pipeline execution at 2am UTC
Implemented ADF pipeline monitoring and alerting using Azure Monitor — achieving 99.5% pipeline success rate
Knowledge Check
5 questions · Earn 50 XP for passing · Score 60% or more to pass
Three orchestration tools compared — when to use each and why.
Stop putting secrets in your code — how to use Key Vault with Databricks and ADF.
Why the industry switched from ETL to ELT and what it means for your daily work.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.