Project 05 β Organize Files Automatically With Date Stamps
Stop overwriting files silently. Build a pipeline that checks if a file exists before copying, date-stamps the output, cleans the landing zone automatically, and logs what was missing β a complete production file management workflow.
Azure DE β Zero to Advanced
05 of 25
Beginner+
90β120 min
π’ Real World Problem
FreshMart's data lake is growing. After Projects 01β04, files are landing in ADLS β but new problems are showing up.
Problem 1 β Files get overwritten silently
Every morning, cities.csv is downloaded and saved to raw/external/cities/cities.csv. But what happened to yesterday's file? Overwritten. Gone. No history. Three months later the data team asks what external price data looked like in January β the answer is: we don't know.
Problem 2 β Nobody knows if the source file actually arrived
The pipeline runs at midnight. What if the supplier server was down? ADF will try to copy a file that doesn't exist and fail with a cryptic error. Better behavior: check first, skip gracefully if missing β don't crash the whole pipeline.
Problem 3 β Landing zone is filling up with processed files
Files land in landing/store_sales/ and stay there forever after being copied to raw/. The landing zone is a staging area, not permanent storage. Processed files should be deleted after a successful copy.
The solution β a production-grade file management pattern:
Step 1: Check if file exists in landing zone (Get Metadata + If Condition)
Step 2: If file exists β copy it to ADLS with date stamp in the name
Step 3: After successful copy β delete file from landing zone
Step 4: If file does not exist β log warning, continue gracefullyπ§ Concepts You Must Understand First
What is the Get Metadata Activity?
Get Metadata reads information about a file or folder β without reading the file contents. Think of it like checking a package label before opening it.
Get Metadata can return:
exists β true or false
size β file size in bytes
lastModified β when the file was last changed
itemName β the file name
itemType β "File" or "Folder"
childItems β list of files inside a folderWhat is the If Condition Activity?
ADF's decision maker. It evaluates a TRUE/FALSE expression and runs different activities for each outcome.
IF (file exists AND size > 0)
THEN β Copy the file + Delete original
ELSE β Log a warning message, skip this file
In ADF:
If Condition Activity
βββ Expression: @and(activity('get_metadata_store_file').output.exists, greater(...size, 0))
βββ True branch: Copy Activity β Delete Activity
βββ False branch: Set Variable Activity (log "file not found")What is the Delete Activity?
Deletes a file or folder from a storage location. After successfully copying a file from landing β ADLS, we use Delete to remove the original.
String Interpolation for Date-Stamped File Names
Original: store_ST001_sales.csv
Stamped: store_ST001_sales_20240115.csv
Expression:
@{replace(item(), '.csv', '')}_@{formatDateTime(pipeline().parameters.run_date, 'yyyyMMdd')}.csv
Breaking down:
@{replace(item(), '.csv', '')} β removes .csv β "store_ST001_sales"
_ β literal underscore
@{formatDateTime(..., 'yyyyMMdd')} β "20240115"
.csv β adds .csv back
Result: store_ST001_sales_20240115.csvWhat We Are Building β Visualized
PIPELINE: pl_file_management_daily
FOR EACH store file:
β
βββΊ GET METADATA
β "Does store_ST001_sales_20240115.csv exist in landing?"
β
βββΊ IF CONDITION
β "output.exists = true AND size > 0?"
β
βββΊ TRUE BRANCH:
β β
β βββΊ COPY ACTIVITY
β β landing/store_sales/date=2024-01-15/store_ST001_sales_20240115.csv
β β β raw/sales/date=2024-01-15/store_ST001_sales_20240115.csv
β β
β βββΊ DELETE ACTIVITY
β Delete: landing/store_sales/date=2024-01-15/store_ST001_sales_20240115.csv
β
βββΊ FALSE BRANCH:
SET VARIABLE: missing_files += "ST001 missing for 2024-01-15 | "
(pipeline continues β does not crash)π Step by Step Overview
PHASE 1 β Prepare (10 min)
Step 1: Upload test files to landing zone
PHASE 2 β Create New Datasets (15 min)
Step 2: Confirm existing source dataset (reuse from Project 03)
Step 3: Create Delete activity dataset
PHASE 3 β Build the Pipeline (60 min)
Step 4: Create pipeline with parameters and variables
Step 5: Add Set Variable β build run_date_folder
Step 6: Add ForEach activity
Step 7: Inside ForEach β add Get Metadata activity
Step 8: Inside ForEach β add If Condition activity
Step 9: Inside True branch β add Copy activity
Step 10: Inside True branch β add Delete activity
Step 11: Inside False branch β add Set Variable (log missing files)
Step 12: Add final summary log activity to main canvas
Step 13: Validate and Debug β file exists scenario
Step 14: Debug β file missing scenario
Step 15: PublishPhase 1 β Prepare
Step 1 β Upload Test Files to Landing Zone
We need files in the landing zone to test both scenarios β exists AND missing.
Azure Portal β Storage β stfreshmartdev β Containers β landing β store_sales/
Click "+ Add Directory"
Directory name: date=2024-01-15Add Directory dialog β date=2024-01-15 entered
Click into date=2024-01-15 β click "Upload"
Upload only 5 of the 10 store files (ST001βST005). This lets us test the missing scenario for ST006βST010.
store_ST001_sales_20240115.csv
store_ST002_sales_20240115.csv
store_ST003_sales_20240115.csv
store_ST004_sales_20240115.csv
store_ST005_sales_20240115.csvUpload dialog β 5 files selected, ready to upload
landing/store_sales/date=2024-01-15/ β showing exactly 5 files (ST001 through ST005)
Phase 2 β Create New Datasets
Step 2 β Reuse Existing Source Dataset
The Get Metadata activity will reuse ds_src_blob_dated_store_sales from Project 03 β it already has run_date_folder and file_name parameters. No new source dataset needed.
Author β Datasets β ds_src_blob_dated_store_sales already exists from Project 03
Step 3 β Create Delete Activity Dataset
In ADF Studio β Author β Datasets β "+" β "New dataset"
Search "Azure Blob Storage" β select β "Continue"
Select "DelimitedText" β "Continue"
Name: ds_delete_blob_landing
Linked service: ls_blob_freshmart_landing
File path: (leave all empty)
First row as header: β
Yes
Import schema: NoneClick "OK" β "Parameters" tab β add TWO parameters:
Parameter 1:
Name: run_date_folder
Type: String
Parameter 2:
Name: file_name
Type: Stringds_delete_blob_landing Parameters tab β run_date_folder and file_name parameters
Click "Connection" tab:
Container: landing
Directory β "Add dynamic content":
store_sales/@{dataset().run_date_folder}File β "Add dynamic content":
@dataset().file_nameds_delete_blob_landing Connection tab β landing/store_sales/@{dataset().run_date_folder}/@dataset().file_name
Click πΎ Save
Phase 3 β Build the Pipeline
Step 4 β Create Pipeline With Parameters and Variables
In ADF Studio β Author β "+" β "New pipeline"
Name: pl_file_management_daily
Description: Checks file existence, copies with date stamp, deletes from landing zoneClick empty canvas β "Parameters" tab β add TWO parameters:
Parameter 1:
Name: run_date
Type: String
Default: 2024-01-15
Parameter 2:
Name: store_ids
Type: Array
Default: ["ST001","ST002","ST003","ST004","ST005","ST006","ST007","ST008","ST009","ST010"]Pipeline Parameters tab β run_date and store_ids parameters
Click "Variables" tab β add THREE variables:
Variable 1:
Name: run_date_folder
Type: String
Variable 2:
Name: missing_files
Type: String
Variable 3:
Name: final_log
Type: StringADF does not allow a Set Variable activity to read and write the same variable in one step β it calls this a "self-reference" and throws an error.
missing_files is written to inside the ForEach (appending each missing store).final_log is written to after the ForEach (reads missing_files and builds the summary).Two different variables = no self-reference = no error.
Variables tab β three variables: run_date_folder, missing_files, final_log
Step 5 β Add Set Variable: Build run_date_folder
From left panel β "General" β drag "Set variable" onto the canvas
Click it β bottom panel:
General tab:
Name: set_run_date_folder
Description: Formats run_date into Hive partition folder nameClick "Variables" tab:
Name: run_date_folder
Value: (Add dynamic content)date=@{pipeline().parameters.run_date}Set variable activity β name 'set_run_date_folder', value showing date=@{pipeline().parameters.run_date}
Step 6 β Add ForEach Activity
From left panel β "Iteration & conditionals" β drag "ForEach" onto the canvas
Connect: hover over set_run_date_folder β drag green arrow β connect to ForEach
Canvas β set_run_date_folder connected to ForEach with green success arrow
Click the ForEach β bottom panel:
General tab:
Name: ForEach_stores
Description: Loops through each store ID to check and copy files
Settings tab:
Sequential: β Checked β IMPORTANT
Items: @pipeline().parameters.store_idsForEach Settings tab β Sequential CHECKED, Items showing @pipeline().parameters.store_ids
Variables in ADF are shared across the pipeline. If two iterations run simultaneously and both try to update
missing_files at the same time, one write overwrites the other β entries get lost. This is a race condition. Sequential solves it β each iteration waits its turn before writing.Parallel risk: Iterations 4 and 6 both find missing files simultaneously
Both try to write to missing_files β one write lost β
Sequential: Iteration 4 runs β writes β finishes
Iteration 6 runs β writes β finishes
All updates preserved β
Step 7 β Inside ForEach: Add Get Metadata Activity
Click the "+" button inside the ForEach box to enter the inner canvas
ForEach box β '+' button inside, about to enter inner canvas
From left panel β "General" β drag "Get Metadata" onto the inner canvas
Get Metadata activity placed on ForEach inner canvas
Click Get Metadata β bottom panel:
General tab:
Name: get_metadata_store_file
Description: Checks if today's store file exists in landing zoneClick "Dataset" tab:
Dataset: ds_src_blob_dated_store_salesrun_date_folder β Add dynamic content:
@variables('run_date_folder')file_name β Add dynamic content:
store_@{item()}_sales_@{formatDateTime(pipeline().parameters.run_date,'yyyyMMdd')}.csvDynamic content editor β full dated file name expression with @item() and formatDateTime
Click "Field list" tab β click "+ New" twice:
Field 1: exists
Field 2: sizeField list tab β 'exists' and 'size' added as fields to retrieve
exists tells you if the file is there. size tells you if it is empty (0 bytes). In production you check both β a 0-byte file exists but contains no data.Step 8 β Inside ForEach: Add If Condition Activity
From left panel β "Iteration & conditionals" β drag "If Condition" onto the inner canvas
Connect green arrow from get_metadata_store_file β connect to If Condition
Inner canvas β get_metadata_store_file connected to If Condition with green arrow
Click If Condition β bottom panel:
General tab:
Name: if_file_exists
Description: Checks if file exists AND has content (size > 0)Click "Activities" tab β Expression field β "Add dynamic content":
@and( activity('get_metadata_store_file').output.exists, greater(activity('get_metadata_store_file').output.size, 0) )Dynamic content editor β the full @and() expression with exists and greater() checks
@and( ... , ... )
β Returns true only if BOTH conditions are true
activity('get_metadata_store_file').output.exists
β true if file exists, false if not
β activity('name') reads output of a previous activity by name
greater(activity('get_metadata_store_file').output.size, 0)
β true if file size is greater than 0 (not empty)
Combined:
β true if file EXISTS and is NOT EMPTY β
β false if file is missing OR is 0 bytes βIf Condition Activities tab β expression showing the @and() check with exists and size
Step 9 β Inside True Branch: Add Copy Activity
Click the pencil icon next to "True" to enter the True branch canvas
If Condition Activities tab β True and False sections, pencil icon next to True highlighted
Drag "Copy data" onto the True branch canvas
Copy data activity placed on the True branch canvas
General tab:
Name: copy_to_adls_with_datestamp
Description: Copies store file from landing to ADLS raw/sales/ with date stampSource tab:
Source dataset: ds_src_blob_dated_store_salesrun_date_folder β @variables('run_date_folder')
file_name β
store_@{item()}_sales_@{formatDateTime(pipeline().parameters.run_date,'yyyyMMdd')}.csvSource tab β both dataset properties filled, file_name showing the dated expression
Sink tab:
Sink dataset: ds_sink_adls_dated_salesrun_date_folder β @variables('run_date_folder')
file_name β same dated expression as Source
store_@{item()}_sales_@{formatDateTime(pipeline().parameters.run_date,'yyyyMMdd')}.csvSink tab β both dataset properties filled matching source
Step 10 β Inside True Branch: Add Delete Activity
Still on the True branch canvas β from left panel β "General" β drag "Delete"
Connect green arrow from copy_to_adls_with_datestamp β connect to Delete
True branch canvas β Copy activity connected to Delete activity with green success arrow
General tab:
Name: delete_from_landing
Description: Removes processed file from landing zone after successful copyClick "Dataset" tab:
Dataset: ds_delete_blob_landingrun_date_folder β @variables('run_date_folder')
file_name β dated file name expression
store_@{item()}_sales_@{formatDateTime(pipeline().parameters.run_date,'yyyyMMdd')}.csvDelete activity Dataset tab β ds_delete_blob_landing selected, both properties filled
Click "Logging settings" tab (optional but recommended):
Enable logging: β
Yes
Linked service: ls_adls_freshmart
Log folder path: logs/delete_activity/Delete activity Logging tab β enabled, log path set to logs/delete_activity/
Complete True branch canvas β Copy β Delete connected with green arrow
Step 11 β Inside False Branch: Add Set Variable
Click back arrow β return to If Condition Activities tab β click pencil next to "False"
If Condition β False section pencil icon highlighted
Drag "Set variable" onto the False branch canvas
Set variable activity placed on False branch canvas
General tab:
Name: log_missing_file
Description: Appends missing store ID to missing_files variable for monitoringVariables tab:
Name: missing_files
Value: (Add dynamic content)@{variables('missing_files')}@{item()} missing for @{pipeline().parameters.run_date} | β After ST006 and ST007 missing: "ST006 missing for 2024-01-15 | ST007 missing for 2024-01-15 | "
Set variable β Variables tab with the append expression for missing_files
@{variables('missing_files')} β current value (whatever was already logged)
@{item()} β current store ID, e.g. "ST006"
" missing for " β literal text
@{pipeline().parameters.run_date} β "2024-01-15"
" | " β separator between entries
Each iteration APPENDS β previous entries are preserved β
Complete False branch canvas β log_missing_file Set variable activity alone
Step 12 β Add Final Summary Log to Main Canvas
Click back arrow until you are on the main pipeline canvas.
From left panel β drag one more "Set variable" onto the main canvas (outside ForEach)
Connect green arrow from ForEach_stores β connect to this new Set variable
General tab:
Name: output_missing_files_log
Description: Final log of all missing files for this pipeline runVariables tab:
Name: final_log β write to final_log, NOT missing_files
Value: Pipeline run complete. Missing files: @{variables('missing_files')}final_log, not missing_files.ADF throws a "self-reference" error if a Set Variable activity reads and writes the same variable.
missing_files is appended to inside the ForEach. Here we read it and write the result tofinal_log β two different variables, no error.output_missing_files_log β Name shows 'final_log', value reads from missing_files
Main canvas β complete pipeline: set_run_date_folder β ForEach_stores β output_missing_files_log
MAIN CANVAS:
[set_run_date_folder] βββΊ [ForEach_stores] βββΊ [output_missing_files_log]
INSIDE ForEach_stores:
[get_metadata_store_file] βββΊ [if_file_exists]
β
βββ TRUE: [copy_to_adls_with_datestamp] βββΊ [delete_from_landing]
β
βββ FALSE: [log_missing_file]Step 13 β Validate
Click "Validate" in the top toolbar
Validation successful β no errors found
"Activity not found" β The If Condition expression uses
activity('get_metadata_store_file') β the name in quotes must exactly match the activity's General tab name. Case-sensitive."Variable is read-only inside parallel ForEach" β ForEach Sequential must be CHECKED ON.
"Delete activity dataset not configured" β True branch β Delete β Dataset tab β confirm
ds_delete_blob_landing is selected with both properties filled.Step 14 β Debug: File Exists Scenario
Click "Debug"
run_date: 2024-01-15
store_ids: ["ST001","ST002","ST003","ST004","ST005","ST006","ST007","ST008","ST009","ST010"]Debug parameter dialog β run_date 2024-01-15, full store_ids array
Click "OK" β pipeline runs sequentially through all 10 stores.
ST001βST005 go through TRUE branch. ST006βST010 go through FALSE branch.
Pipeline running β set_run_date_folder green, ForEach running with sequential progress
Pipeline completed β all activities green
Click π glasses icon on ForEach in Output tab:
ForEach iteration list β 10 rows, showing each store ID, which branch ran (TRUE/FALSE), and duration
Verify in ADLS:
raw/sales/date=2024-01-15/
βββ store_ST001_sales_20240115.csv β
βββ store_ST002_sales_20240115.csv β
βββ store_ST003_sales_20240115.csv β
βββ store_ST004_sales_20240115.csv β
βββ store_ST005_sales_20240115.csv β
Only 5 files β because only 5 existed in landing. Correct behavior.raw/sales/date=2024-01-15/ β exactly 5 files, matching the stores uploaded
Verify landing zone is cleaned:
landing/store_sales/date=2024-01-15/
(empty β all 5 processed files deleted) β
landing/store_sales/date=2024-01-15/ β empty folder, files deleted after copying
Check the missing files log β ADF Monitor β pipeline run β click output_missing_files_log β Output:
Pipeline run complete. Missing files: ST006 missing for 2024-01-15 | ST007 missing for 2024-01-15 | ST008 missing for 2024-01-15 | ST009 missing for 2024-01-15 | ST010 missing for 2024-01-15 |output_missing_files_log activity output β showing the missing stores listed in final_log value
Step 15 β Debug: All Files Present Scenario
Upload the remaining 5 store files (ST006βST010) to landing/store_sales/date=2024-01-15/
landing/store_sales/date=2024-01-15/ β all 10 files now uploaded
Run Debug again with the same parameters. This time all 10 go through TRUE branch.
ForEach iteration list β all 10 rows showing TRUE branch ran, all green
raw/sales/date=2024-01-15/ β all 10 files present
landing/store_sales/date=2024-01-15/ β empty, all files deleted
output_missing_files_log output β 'Pipeline run complete. Missing files: ' (empty β none missing)
Step 16 β Publish
Click "Publish all"
Publishing:
pl_file_management_daily (new)
ds_delete_blob_landing (new)Publish panel β listing new pipeline and dataset
Successfully published notification
π― What You Built β Summary
BEFORE:
Pipelines copied blindly β crashed if file was missing
Files accumulated in landing zone forever
No way to know which files were missing on any given day
No history β files overwritten daily
AFTER:
Pipeline checks file existence BEFORE attempting to copy
Landing zone cleaned automatically after successful copy
Missing files logged β you know exactly what did not arrive
Files are date-stamped β full history preserved in ADLS
Pipeline never crashes on missing files β handles gracefullyNew ADF Activities Learned
| Activity | Purpose | Key Output |
|---|---|---|
| Get Metadata | Read file/folder properties | .output.exists, .output.size, .output.lastModified |
| If Condition | Branch based on true/false | Runs True OR False activities |
| Delete | Remove a file from storage | File is permanently removed |
| Set Variable (append) | Build a running log | Concatenates text across iterations |
New Expressions Learned
| Expression | What It Does |
|---|---|
| @and(condition1, condition2) | True only when BOTH conditions are true |
| activity('name').output.fieldname | Read the output of a previous activity |
| greater(value, number) | True if value is greater than number |
| @{replace(string, 'find', 'replace')} | Replace text within a string |
| @{variables('name')}existing@{item()}new | Append text to a variable (concatenation) |
π§ Key Concepts to Remember
| Concept | What It Is | Why It Matters |
|---|---|---|
| Get Metadata | Reads file properties without reading file | Check existence before copying β prevent crashes |
| If Condition | Branches pipeline based on true/false | Handle missing files gracefully |
| Delete Activity | Permanently removes file from storage | Clean landing zone after successful processing |
| True branch | Activities that run when condition is true | The happy path |
| False branch | Activities that run when condition is false | The error handling path |
| Sequential ForEach | One iteration at a time | Required when writing to shared variables |
| Race condition | Two iterations updating same variable at once | Why parallel ForEach breaks variable updates |
| String concatenation | Appending text to a variable each iteration | Build running logs across loop iterations |
| activity('name').output | Read another activity's result | Core pattern for connecting activity results |
| @and() | Both conditions must be true | Safer than just checking exists alone |
β οΈ Common Mistakes in This Project
| Mistake | Fix |
|---|---|
| Activity name in expression does not match actual name | Expression uses activity('get_metadata_store_file') β if the activity is named GetMetadata1 the expression fails. Names are case-sensitive. |
| ForEach set to Parallel when writing to a variable | Set ForEach Sequential = ON whenever activities inside write to pipeline variables. Parallel causes race conditions. |
| Delete activity runs even when Copy failed | Make sure Delete is connected to Copy with a success arrow (green), not always arrow (grey). Click the arrow to verify it shows "On success". |
| Get Metadata field list not configured | Get Metadata β Field list tab β must explicitly add "exists" as a field. Without this, output.exists returns null. |
| Activities placed in wrong branch | Click Activities tab on If Condition β verify Copy is in True branch and log_missing_file is in False branch. |
| Self-reference error on missing_files variable | Use a second variable (final_log) for the summary. missing_files is written inside ForEach, final_log reads it after. Never read and write the same variable in one Set Variable activity. |
π Tier 1 Complete β What You Have Built So Far
PROJECT 01: Copy a single file β ADF basics, linked services, datasets
PROJECT 02: Copy multiple files with ForEach β Loops, arrays, parallel execution
PROJECT 03: Date-parameterized pipeline + trigger β Parameters, dynamic expressions, scheduling
PROJECT 04: Download from public HTTPS URL β HTTP linked service, internet data sources
PROJECT 05: File management with validation β Get Metadata, If Condition, Delete, error handlingπ― Key Takeaways
- βAlways use Get Metadata to check file existence before copying β prevents cryptic pipeline crashes
- βIf Condition branches your pipeline into happy path (True) and error path (False)
- βConnect Delete to Copy with a success arrow β if Copy fails, Delete must NOT run
- βSequential ForEach is required when activities inside write to shared pipeline variables
- βUse a separate variable for summaries β ADF blocks self-reference (read + write same variable in one step)
- βDate-stamping files in ADLS preserves history β overwriting destroys it
- βThe landing zone is temporary β clean it after processing so it stays lean
π What's Coming in Project 06 β Tier 2 Begins
So far, all data sources have been files β CSVs sitting somewhere waiting to be copied. In the real world, a huge portion of data comes from REST APIs β services you query with an HTTP request and get back structured JSON.
In Project 06, FreshMart integrates with a live public REST API:
- Call a real API endpoint and receive a JSON response
- Extract data and land it in ADLS as a clean CSV
- Handle REST API pagination β when results come in pages
- API key authentication
- Parse nested JSON structures
Discussion
0Have a better approach? Found something outdated? Share it β your knowledge helps everyone learning here.