Project 03 β Parameterized Pipeline with Run Date
Build a fully automated pipeline where you pass a date at runtime and ADF constructs the correct file names and folder paths automatically. Add a scheduled trigger and the pipeline runs every night at midnight with zero human involvement.
A pipeline that takes a date as input, builds the correct file names and folder paths automatically, and copies all 10 store files into date-partitioned ADLS folders β triggered automatically every night at midnight.
Real World Problem
Let's be honest about what Projects 01 and 02 did and did not solve:
- Moving one file to the cloud
- Moving multiple files with ForEach
- Files are named the same every day
- Someone still presses Debug manually
- Miss a day and that data is gone
- No way to tell which file belongs to which day
Here is what FreshMart's IT team actually needs:
"Every night at 11:30 PM the billing system exports files automatically. The file names include the date β like store_ST001_sales_20240115.csv for January 15th. We need a pipeline to run automatically at midnight, pick up that night's files, and copy them to ADLS β without anyone pressing a button."
This is how every production data pipeline in the real world works:
Concepts You Must Understand First
Why Pass run_date as a Parameter?
The most important design decision in this project. Here is the problem with not using a parameter:
- Pipeline fails on Monday
- You rerun it on Tuesday
- It processes Tuesday's data again
- Monday's data is lost forever
- Cannot reprocess historical dates
- Pipeline fails on Monday
- You rerun with run_date = "2024-01-15"
- It correctly reprocesses Monday's data
- No data lost. Full control.
- Backfill any past date anytime
run_date parameter is how you achieve it.Important ADF Limitation β Parameter Defaults Cannot Be Expressions
@{formatDateTime(utcNow(), 'yyyy-MM-dd')} β ADF treats this as plain text, not an expression. You get an error.The fix is simple: use a plain static date as the default value, and let the trigger pass the real dynamic date at runtime.
2024-01-15 β plain static date. Works perfectly as a default.β produces: Used during Debug. Trigger passes the real date when it fires.
How Do Dynamic Expressions Work with Dates?
ADF has built-in date formatting functions. These are the ones we use in this project:
@pipeline().parameters.run_dateβ "2024-01-15" (exactly what you passed in)
@formatDateTime(pipeline().parameters.run_date, 'yyyyMMdd')β "20240115" (no dashes β for file names)
@formatDateTime(pipeline().parameters.run_date, 'yyyy-MM-dd')β "2024-01-15" (with dashes β for folder names)
@formatDateTime(trigger().scheduledTime, 'yyyy-MM-dd')β The night the trigger fired β e.g. "2024-01-16"
The @{ } inside a text string is called string interpolation. ADF evaluates what is inside the braces and inserts the result:
store_@{item()}_sales_@{formatDateTime(pipeline().parameters.run_date,'yyyyMMdd')}.csvβ produces: store_ST001_sales_20240115.csv (when item()=ST001, run_date=2024-01-15)
Hive-Style Partitioning β The Folder Structure We Are Building
Good data engineers organize ADLS by date. This makes it easy to find any day's data and lets analytics tools skip folders they do not need β dramatically faster and cheaper to query.
raw/sales/
βββ date=2024-01-15/
β βββ store_ST001_sales_20240115.csv
β βββ ... (10 files)
βββ date=2024-01-16/
β βββ ... (10 files)
βββ date=2024-01-17/
βββ ... (10 files)
The date=YYYY-MM-DD folder naming convention is the industry standard β called Hive-style partitioning. Databricks, Synapse, and Athena all understand it natively.
What is a Trigger?
Runs on a fixed schedule β every day at midnight, every hour, every Monday. Most common type. What we use in this project.
Like a schedule trigger but with built-in backfill. If the pipeline was down for 3 days, it automatically queues 3 missed runs.
Fires when a file arrives in ADLS. "As soon as a new file lands, start the pipeline." We use this in Project 05.
Step 1 β Create Date-Based CSV Files
This time the file names include the date: store_ST001_sales_20240115.csv. We create files for two dates (January 15 and 16) so we can test backfill β running the pipeline for different dates without changing anything.
On your Desktop create a folder called freshmart_dated_files. Inside it create two subfolders: 20240115 and 20240116.
Desktop folder 'freshmart_dated_files' β showing two subfolders: 20240115 and 20240116
Inside 20240115 β create all 10 files. Here are the first two as templates. Follow the same pattern for stores ST003βST010.
order_id,store_id,product_name,category,quantity,unit_price,order_date ORD1001,ST001,Basmati Rice 5kg,Grocery,12,299.00,2024-01-15 ORD1002,ST001,Samsung TV 43inch,Electronics,2,32000.00,2024-01-15 ORD1003,ST001,Amul Butter 500g,Dairy,25,240.00,2024-01-15 ORD1004,ST001,Colgate Toothpaste,Personal Care,30,89.00,2024-01-15 ORD1005,ST001,Nike Running Shoes,Apparel,5,4500.00,2024-01-15
order_id,store_id,product_name,category,quantity,unit_price,order_date ORD2001,ST002,Sunflower Oil 1L,Grocery,18,145.00,2024-01-15 ORD2002,ST002,iPhone 14,Electronics,1,75000.00,2024-01-15 ORD2003,ST002,Amul Milk 1L,Dairy,40,62.00,2024-01-15 ORD2004,ST002,Dove Soap 100g,Personal Care,50,65.00,2024-01-15 ORD2005,ST002,Levis Jeans,Apparel,8,2999.00,2024-01-15
Create stores ST003βST010 with the same column structure, using their store IDs and order_date = 2024-01-15. Then for the 20240116 folder, duplicate all 10 files changing only the date in the file name, order IDs, and order_date column to 2024-01-16.
Inside the 20240115 folder β all 10 store CSV files with dates in their names
Step 2 β Upload Files to Landing Container
Go to Azure Portal β stfreshmartdev β Containers β landing β click the store_sales folder.
Click "+ Add Directory" β name it exactly: date=2024-01-15
date= prefix is the Hive partition convention. Keep it exactly like this in both landing and raw containers so the folder structure mirrors across both sides.Add Directory dialog β 'date=2024-01-15' typed in
Click into date=2024-01-15 β "Upload" β select all 10 files from your 20240115 local folder β "Upload".
landing/store_sales/date=2024-01-15/ β all 10 dated CSV files uploaded
Go back to store_sales β create another directory: date=2024-01-16 β upload all 10 files from your 20240116 local folder.
landing/store_sales/ β showing two date folders: date=2024-01-15 and date=2024-01-16
These datasets need two parameters each β one for the date folder, one for the file name. Both values will be passed from the pipeline at runtime.
Step 3 β Create Source Dataset With Two Parameters
In ADF Studio β Author β Datasets β "+" β "New dataset" β "Azure Blob Storage" β "Continue" β "DelimitedText" β "Continue".
Click "OK" β click the "Parameters" tab β "+ New" β add BOTH parameters:
Dataset Parameters tab β both run_date_folder and file_name parameters listed
Click the "Connection" tab. Set the three path fields:
For the Directory field: click "Add dynamic content" β in the editor, type the full expression: store_sales/@{dataset().run_date_folder} β click "OK".
For the File field: click "Add dynamic content" β under Parameters β click file_name β click "OK".
Connection tab fully configured β container 'landing', directory with dynamic expression, file with @dataset().file_name
Click πΎ Save.
Step 4 β Create Sink Dataset With Two Parameters
Click "+" next to Datasets β "Azure Data Lake Storage Gen2" β "DelimitedText".
Click "OK" β Parameters tab β add the same two parameters: run_date_folder (String) and file_name (String).
Sink dataset Parameters tab β run_date_folder and file_name parameters added
Click Connection tab:
Sink dataset Connection tab β raw/sales/@{dataset().run_date_folder} for directory, @dataset().file_name for file
Click πΎ Save.
Step 5 β Create New Pipeline
In ADF Studio β Author β "+" next to Pipelines β "New pipeline".
New blank pipeline canvas β name 'pl_copy_store_sales_by_date' in Properties panel
Step 6 β Add the run_date Parameter
Click on empty canvas β Parameters tab at the bottom β "+ New".
2024-01-15 β not @{formatDateTime(...)}. The trigger will pass the real dynamic date at runtime. The static default is just for when you manually Debug.run_date parameter β default value showing plain '2024-01-15' with no expression syntax
Step 7 β Add the store_ids Array Parameter
Still in the Parameters tab β "+ New".
Notice: in Project 02 the array stored full file names like store_ST001_sales.csv. Now we store just the store ID like ST001. The pipeline builds the full file name using run_date. This means the array never needs to change β even as dates change every night.
Pipeline Parameters tab β both run_date (String) and store_ids (Array) parameters visible
Step 8 β Add a Pipeline Variable
We need a variable to hold the computed folder name date=2024-01-15. Computing it once in a variable means we can use it in multiple places without repeating the expression.
Click empty canvas β Variables tab β "+ New".
Variables tab β run_date_folder variable of type String added
Step 9 β Add a Set Variable Activity
This activity runs first. It takes run_date (e.g. 2024-01-15) and stores date=2024-01-15 in the variable. Every other activity then reads this variable instead of re-computing it.
Left panel β expand "General" β drag "Set variable" onto the canvas.
Set variable activity placed on the main canvas
Click the Set variable activity β configure:
General Tab
Variables Tab (inside the activity)
Click the Variables tab in the bottom properties panel (this is the activity configuration, not the pipeline variables tab).
Click "Add dynamic content" for the Value field β type this expression in the editor:
date=@{pipeline().parameters.run_date}β produces: date=2024-01-15 (when run_date is 2024-01-15)
This works because run_date already comes in as yyyy-MM-dd format β we just prepend date= to it. Simple and clean.
Set variable activity Variables tab β name 'run_date_folder', value showing date=@{pipeline().parameters.run_date}
Step 10 β Add ForEach and Connect it to Set Variable
Left panel β "Iteration & conditionals" β drag "ForEach" onto the canvas.
Now connect the two activities: hover over set_run_date_folder β drag the green arrow on its right edge β drop it onto the ForEach. This forces Set Variable to finish before ForEach starts.
Canvas β set_run_date_folder connected to ForEach_store_ids with a green arrow showing the execution order
Click the ForEach activity β configure:
General Tab
Settings Tab
ForEach Settings tab β Sequential off, Batch count 4, Items showing @pipeline().parameters.store_ids
Step 11 β Add Copy Activity Inside ForEach
Click the "+" button inside the ForEach box β from the inner canvas left panel β drag "Copy data".
Copy data activity placed inside the ForEach inner canvas
Step 12 β Configure Source With Date Expressions
Click the Copy activity β bottom panel:
General Tab
Source Tab
Select ds_src_blob_dated_store_sales. Two Dataset properties fields appear.
Source tab β ds_src_blob_dated_store_sales selected, Dataset properties showing run_date_folder and file_name fields
For run_date_folder: Click "Add dynamic content" β under Variables β click run_date_folder.
@variables('run_date_folder')β produces: date=2024-01-15
Dynamic content editor β @variables('run_date_folder') expression with run_date_folder visible under Variables section
For file_name: Click "Add dynamic content" β type this expression in the editor:
store_@{item()}_sales_@{formatDateTime(pipeline().parameters.run_date,'yyyyMMdd')}.csvβ produces: store_ST001_sales_20240115.csv (when item()=ST001, run_date=2024-01-15)
How the file name expression builds the value:
store_β "store_" β literal text@{item()}β "ST001" β the current store ID from the ForEach loop_sales_β "_sales_" β literal text@{formatDateTime(pipeline().parameters.run_date,'yyyyMMdd')}β "20240115" β date without dashes.csvβ ".csv" β literal textSource tab fully configured β run_date_folder showing @variables expression, file_name showing the full dynamic file name expression
Step 13 β Configure Sink
Click Sink tab β select ds_sink_adls_dated_sales. Two Dataset properties appear β fill them with the exact same expressions as the source:
Sink tab fully configured β same expressions as source, writing to raw/sales/date=2024-01-15/
Click the back arrow to return to the main pipeline canvas.
Main canvas β set_run_date_folder β ForEach_store_ids connected in sequence
Step 14 β Validate and Debug (Run for Jan 15)
Click "Validate" β should show no errors. Then click "Debug".
Validation successful β no errors found
The parameter dialog appears with the defaults pre-filled. Leave run_date as 2024-01-15 and click "OK".
Debug dialog β run_date = 2024-01-15, store_ids array pre-filled
Watch the canvas β Set Variable completes first (green), then ForEach starts and runs 10 iterations 4 at a time.
Pipeline running β set_run_date_folder green, ForEach running with progress indicator
All completed β both activities showing green checkmarks
Verify in ADLS: Azure Portal β stfreshmartdev β Containers β raw β sales.
raw/sales/date=2024-01-15/ β all 10 dated files visible with correct names and timestamps
Step 15 β Test Backfill (Run for Jan 16 Without Changing Anything)
This is where parameters prove their value. Click "Debug" again β change only run_date:
Debug dialog β run_date changed to 2024-01-16, everything else the same
Click "OK". Check ADLS β you now have two date partitions:
raw/sales/ β showing BOTH date=2024-01-15 and date=2024-01-16 folders side by side
This is backfill. If a pipeline fails on any day, rerun it with that date β it fills the missing data without touching any other day's folder.
Step 16 β Create the Schedule Trigger
Go back to the main pipeline canvas β click "Add trigger" in the top toolbar β "New/Edit" β "+ New".
Top toolbar β 'Add trigger' button highlighted, dropdown showing '+ New'
The New trigger panel opens. Fill in:
New trigger panel β name, type Schedule, recurrence set to daily at 00:00 IST filled in
Click "OK".
Step 17 β Set What the Trigger Passes to the Pipeline
After clicking OK, a "Trigger Run Parameters" dialog appears. This is where you tell the trigger what to send as run_date and store_ids each night.
Trigger Run Parameters dialog β run_date and store_ids fields to fill
For run_date: Click "Add dynamic content" and type this expression:
@{formatDateTime(trigger().scheduledTime, 'yyyy-MM-dd')}β produces: 2024-01-16 (the date the trigger was scheduled to fire)
Why trigger().scheduledTime and not utcNow()?
trigger().scheduledTime is the time ADF scheduled this trigger to fire β always exactly midnight on the right date. utcNow() is the actual clock time when the pipeline runs, which could be 12:00:03 AM β and in UTC that might be a different date than your local time. Always use trigger().scheduledTime in trigger parameters.
Trigger scheduled for 2024-01-16 00:00 IST
β trigger().scheduledTime = 2024-01-16T00:00:00
β formatDateTime result = "2024-01-16" β
always correct
Trigger Run Parameters β run_date showing @{'{formatDateTime(trigger().scheduledTime,\'yyyy-MM-dd\')'}} expression
For store_ids: Type the array directly:
Trigger Run Parameters fully filled β both run_date expression and store_ids array
Click "OK".
Step 18 β Publish Everything
Click "Publish all". The panel shows all 4 new items β click "Publish".
Publish panel β showing pipeline, 2 datasets, and trigger all listed
Successfully published β notification in top right corner
Step 19 β Manually Trigger a Run Right Now
You do not need to wait until midnight to test the trigger. On the pipeline canvas β "Add trigger" β "Trigger now".
'Trigger now' option in the Add trigger dropdown
In the Run Parameters dialog, enter a date you have files for:
Trigger now dialog β run_date and store_ids filled in
Click "OK" β go to Monitor β Pipeline runs to watch it execute.
Monitor β Pipeline runs β pl_copy_store_sales_by_date showing In Progress
Pipeline run completed β status Succeeded, run_date visible in parameters, duration shown
Step 20 β View the Trigger in Monitor
Click Monitor β Trigger runs in the left submenu.
Monitor β Trigger runs β trigger_daily_midnight listed with its next scheduled run time and Active status
The trigger is now live. Every night at midnight IST it fires automatically, passes today's date as run_date, copies all 10 store files into raw/sales/date=YYYY-MM-DD/, and nobody needs to press anything.
Before and After
- Ran only when you pressed Debug
- File names were static β same every day
- No way to reprocess a past date
- No date organization in ADLS
- Triggers automatically every night at midnight
- File names built from run_date parameter
- Backfill any past date anytime
- ADLS organized into date=YYYY-MM-DD/ partitions
All Expressions Used in This Project
| Expression | Where Used |
|---|---|
| 2024-01-15 | run_date parameter default (plain static β no expression allowed here) |
| date=@{pipeline().parameters.run_date} | Set Variable activity β builds the folder name |
| @pipeline().parameters.store_ids | ForEach Items β the list to loop through |
| @variables('run_date_folder') | Dataset property β passes folder to dataset |
| store_@{item()}_sales_@{formatDateTime(pipeline().parameters.run_date,'yyyyMMdd')}.csv | Dataset property β builds the full file name |
| store_sales/@{dataset().run_date_folder} | Source dataset Directory field |
| sales/@{dataset().run_date_folder} | Sink dataset Directory field |
| @{formatDateTime(trigger().scheduledTime,'yyyy-MM-dd')} | Trigger parameter β passes the correct date nightly |
What Was Added in Project 03
| Item | Name | What It Does |
|---|---|---|
| Dataset | ds_src_blob_dated_store_sales | Source with 2 parameters: run_date_folder + file_name |
| Dataset | ds_sink_adls_dated_sales | Sink with 2 parameters: run_date_folder + file_name |
| Pipeline | pl_copy_store_sales_by_date | Set Variable β ForEach β Copy, driven by run_date |
| Parameter | run_date (String) | Date to process β controls file names and folder |
| Parameter | store_ids (Array) | List of store IDs to loop through |
| Variable | run_date_folder (String) | Computed folder name like date=2024-01-15 |
| Activity | set_run_date_folder | Set Variable β builds the date= folder name |
| Activity | ForEach_store_ids | Loops through store IDs |
| Activity | copy_dated_store_file | Copies one store file per iteration |
| Trigger | trigger_daily_midnight | Fires every night at midnight, passes today as run_date |
Key Concepts Reference
| Concept | What It Is | Why It Matters |
|---|---|---|
| run_date parameter | Date passed into the pipeline from outside | Enables backfill, reprocessing, and idempotency |
| Idempotency | Running the same date twice gives the same result | Production pipelines must be safe to rerun |
| formatDateTime() | ADF function that formats a date into a string | Builds file names and folder paths from dates |
| String interpolation | Embedding @{expressions} inside a text string | Build dynamic strings like file names |
| Set Variable activity | Computes and stores a value during the pipeline run | Avoids repeating the same expression everywhere |
| @variables('name') | Reads a variable value you set earlier | Use one computed value in multiple places |
| trigger().scheduledTime | The time the trigger was scheduled to fire | Safe, predictable way to get the date for a run |
| Hive-style partitioning | Folder naming like date=YYYY-MM-DD | Industry standard β analytics tools scan only what they need |
| Schedule trigger | Runs a pipeline on a fixed schedule | Automates nightly runs with zero human involvement |
| Backfill | Running the pipeline for a past date | Fix failed runs without affecting other dates |
Common Mistakes
Using an expression as a parameter default value
Fix: Parameter defaults must be plain static text β write 2024-01-15, not @{formatDateTime(...)}
Using utcNow() in trigger parameters instead of trigger().scheduledTime
Fix: scheduledTime is always the correct scheduled date. utcNow() can be a different date due to timezone offset.
Wrong date format in formatDateTime
Fix: Use 'yyyyMMdd' (no dashes) for file names. Use 'yyyy-MM-dd' (with dashes) for folder names and run_date.
Not connecting Set Variable β ForEach with an arrow
Fix: Without the arrow they run in parallel. ForEach starts before the variable is set β folder name is empty.
Trigger created but never fires β forgot to publish
Fix: Always Publish all after adding or changing a trigger. Triggers only activate after publishing.
So far we have only worked with files you manually uploaded to Blob Storage. In the real world, data often lives on public internet URLs β government portals, supplier servers, weather APIs, open datasets.
In Project 04 you will build a pipeline that downloads a CSV file directly from a public HTTPS URL β no manual upload needed. ADF fetches the file from the internet and drops it straight into ADLS. Same FreshMart scenario. Zero manual work.
π― Key Takeaways
- βPipeline parameter defaults must be plain static text β expressions like @{formatDateTime(...)} are not allowed there
- βrun_date as a parameter enables idempotency β rerun any past date safely without affecting other dates
- βSet Variable activity runs before ForEach β always connect them with an arrow to enforce the order
- β@variables('run_date_folder') reads the computed folder name β one computation, used everywhere
- βString interpolation: store_@{item()}_sales_@{formatDateTime(run_date,'yyyyMMdd')}.csv builds file names at runtime
- βtrigger().scheduledTime is the safe way to get the date in trigger parameters β not utcNow()
- βHive-style partitioning (date=YYYY-MM-DD) is the industry standard β analytics tools understand it natively
- βAfter publishing the trigger, use "Trigger now" to test immediately without waiting for midnight
Discussion
0Have a better approach? Found something outdated? Share it β your knowledge helps everyone learning here.