Python Β· SQL Β· Web Dev Β· Java Β· AI/ML tracks launching soon β€” your one platform for all of IT
Advanced+500 XP

Project 05 β€” Organize Files Automatically With Date Stamps

Stop overwriting files silently. Build a pipeline that checks if a file exists before copying, date-stamps the output, cleans the landing zone automatically, and logs what was missing β€” a complete production file management workflow.

90–120 min March 2026
Series
Azure DE β€” Zero to Advanced
Project
05 of 25
Level
Beginner+
Time
90–120 min
Builds on: Projects 01–04 β€” same resource group, same storage account, same ADF

🏒 Real World Problem

FreshMart's data lake is growing. After Projects 01–04, files are landing in ADLS β€” but new problems are showing up.

Problem 1 β€” Files get overwritten silently

Every morning, cities.csv is downloaded and saved to raw/external/cities/cities.csv. But what happened to yesterday's file? Overwritten. Gone. No history. Three months later the data team asks what external price data looked like in January β€” the answer is: we don't know.

Problem 2 β€” Nobody knows if the source file actually arrived

The pipeline runs at midnight. What if the supplier server was down? ADF will try to copy a file that doesn't exist and fail with a cryptic error. Better behavior: check first, skip gracefully if missing β€” don't crash the whole pipeline.

Problem 3 β€” Landing zone is filling up with processed files

Files land in landing/store_sales/ and stay there forever after being copied to raw/. The landing zone is a staging area, not permanent storage. Processed files should be deleted after a successful copy.

The solution β€” a production-grade file management pattern:

Step 1: Check if file exists in landing zone (Get Metadata + If Condition)
Step 2: If file exists β†’ copy it to ADLS with date stamp in the name
Step 3: After successful copy β†’ delete file from landing zone
Step 4: If file does not exist β†’ log warning, continue gracefully
πŸ“Œ Real World Example
This is exactly how production pipelines are designed at companies like Flipkart, Zomato, and HDFC Bank.

🧠 Concepts You Must Understand First

What is the Get Metadata Activity?

Get Metadata reads information about a file or folder β€” without reading the file contents. Think of it like checking a package label before opening it.

Get Metadata can return:
  exists          β†’ true or false
  size            β†’ file size in bytes
  lastModified    β†’ when the file was last changed
  itemName        β†’ the file name
  itemType        β†’ "File" or "Folder"
  childItems      β†’ list of files inside a folder

What is the If Condition Activity?

ADF's decision maker. It evaluates a TRUE/FALSE expression and runs different activities for each outcome.

IF (file exists AND size > 0)
  THEN β†’ Copy the file + Delete original
  ELSE β†’ Log a warning message, skip this file

In ADF:
  If Condition Activity
    β”œβ”€β”€ Expression:   @and(activity('get_metadata_store_file').output.exists, greater(...size, 0))
    β”œβ”€β”€ True branch:  Copy Activity β†’ Delete Activity
    └── False branch: Set Variable Activity (log "file not found")

What is the Delete Activity?

Deletes a file or folder from a storage location. After successfully copying a file from landing β†’ ADLS, we use Delete to remove the original.

⚠️ Important
Delete only runs AFTER Copy succeeds. If copying fails, Delete never runs and the original file is safe. Always connect Delete to Copy with a success arrow (green), not an always arrow (grey).

String Interpolation for Date-Stamped File Names

Building a dated file name
Original:   store_ST001_sales.csv
Stamped:    store_ST001_sales_20240115.csv

Expression:
@{replace(item(), '.csv', '')}_@{formatDateTime(pipeline().parameters.run_date, 'yyyyMMdd')}.csv

Breaking down:
  @{replace(item(), '.csv', '')}   β†’ removes .csv β†’ "store_ST001_sales"
  _                                β†’ literal underscore
  @{formatDateTime(..., 'yyyyMMdd')} β†’ "20240115"
  .csv                             β†’ adds .csv back

Result: store_ST001_sales_20240115.csv

What We Are Building β€” Visualized

PIPELINE: pl_file_management_daily

FOR EACH store file:
  β”‚
  β”œβ”€β–Ί GET METADATA
  β”‚   "Does store_ST001_sales_20240115.csv exist in landing?"
  β”‚
  β”œβ”€β–Ί IF CONDITION
  β”‚   "output.exists = true AND size > 0?"
  β”‚
  β”œβ”€β–Ί TRUE BRANCH:
  β”‚   β”‚
  β”‚   β”œβ”€β–Ί COPY ACTIVITY
  β”‚   β”‚   landing/store_sales/date=2024-01-15/store_ST001_sales_20240115.csv
  β”‚   β”‚   β†’ raw/sales/date=2024-01-15/store_ST001_sales_20240115.csv
  β”‚   β”‚
  β”‚   └─► DELETE ACTIVITY
  β”‚       Delete: landing/store_sales/date=2024-01-15/store_ST001_sales_20240115.csv
  β”‚
  └─► FALSE BRANCH:
      SET VARIABLE: missing_files += "ST001 missing for 2024-01-15 | "
      (pipeline continues β€” does not crash)

πŸ“‹ Step by Step Overview

PHASE 1 β€” Prepare (10 min)
  Step 1:  Upload test files to landing zone

PHASE 2 β€” Create New Datasets (15 min)
  Step 2:  Confirm existing source dataset (reuse from Project 03)
  Step 3:  Create Delete activity dataset

PHASE 3 β€” Build the Pipeline (60 min)
  Step 4:  Create pipeline with parameters and variables
  Step 5:  Add Set Variable β€” build run_date_folder
  Step 6:  Add ForEach activity
  Step 7:  Inside ForEach β€” add Get Metadata activity
  Step 8:  Inside ForEach β€” add If Condition activity
  Step 9:  Inside True branch β€” add Copy activity
  Step 10: Inside True branch β€” add Delete activity
  Step 11: Inside False branch β€” add Set Variable (log missing files)
  Step 12: Add final summary log activity to main canvas
  Step 13: Validate and Debug β€” file exists scenario
  Step 14: Debug β€” file missing scenario
  Step 15: Publish

Phase 1 β€” Prepare

Step 1 β€” Upload Test Files to Landing Zone

We need files in the landing zone to test both scenarios β€” exists AND missing.

Azure Portal β†’ Storage β†’ stfreshmartdev β†’ Containers β†’ landing β†’ store_sales/

Click "+ Add Directory"

Directory name: date=2024-01-15
πŸ“ΈSCREENSHOT

Add Directory dialog β€” date=2024-01-15 entered

Click into date=2024-01-15 β†’ click "Upload"

Upload only 5 of the 10 store files (ST001–ST005). This lets us test the missing scenario for ST006–ST010.

store_ST001_sales_20240115.csv
store_ST002_sales_20240115.csv
store_ST003_sales_20240115.csv
store_ST004_sales_20240115.csv
store_ST005_sales_20240115.csv
πŸ“ΈSCREENSHOT

Upload dialog β€” 5 files selected, ready to upload

πŸ“ΈSCREENSHOT

landing/store_sales/date=2024-01-15/ β€” showing exactly 5 files (ST001 through ST005)

Phase 2 β€” Create New Datasets

Step 2 β€” Reuse Existing Source Dataset

The Get Metadata activity will reuse ds_src_blob_dated_store_sales from Project 03 β€” it already has run_date_folder and file_name parameters. No new source dataset needed.

πŸ“ΈSCREENSHOT

Author β†’ Datasets β€” ds_src_blob_dated_store_sales already exists from Project 03

Step 3 β€” Create Delete Activity Dataset

In ADF Studio β†’ Author β†’ Datasets β†’ "+" β†’ "New dataset"

Search "Azure Blob Storage" β†’ select β†’ "Continue"

Select "DelimitedText" β†’ "Continue"

Name:            ds_delete_blob_landing
Linked service:  ls_blob_freshmart_landing
File path:       (leave all empty)
First row as header: βœ… Yes
Import schema:   None

Click "OK" β†’ "Parameters" tab β†’ add TWO parameters:

Parameter 1:
  Name:    run_date_folder
  Type:    String

Parameter 2:
  Name:    file_name
  Type:    String
πŸ“ΈSCREENSHOT

ds_delete_blob_landing Parameters tab β€” run_date_folder and file_name parameters

Click "Connection" tab:

Container: landing

Directory β†’ "Add dynamic content":

Directory
store_sales/@{dataset().run_date_folder}

File β†’ "Add dynamic content":

File
@dataset().file_name
πŸ“ΈSCREENSHOT

ds_delete_blob_landing Connection tab β€” landing/store_sales/@{dataset().run_date_folder}/@dataset().file_name

Click πŸ’Ύ Save

Phase 3 β€” Build the Pipeline

Step 4 β€” Create Pipeline With Parameters and Variables

In ADF Studio β†’ Author β†’ "+" β†’ "New pipeline"

Name:        pl_file_management_daily
Description: Checks file existence, copies with date stamp, deletes from landing zone

Click empty canvas β†’ "Parameters" tab β†’ add TWO parameters:

Parameter 1:
  Name:    run_date
  Type:    String
  Default: 2024-01-15

Parameter 2:
  Name:    store_ids
  Type:    Array
  Default: ["ST001","ST002","ST003","ST004","ST005","ST006","ST007","ST008","ST009","ST010"]
πŸ“ΈSCREENSHOT

Pipeline Parameters tab β€” run_date and store_ids parameters

Click "Variables" tab β†’ add THREE variables:

Variable 1:
  Name:    run_date_folder
  Type:    String

Variable 2:
  Name:    missing_files
  Type:    String

Variable 3:
  Name:    final_log
  Type:    String
⚠️ Important
Why three variables?
ADF does not allow a Set Variable activity to read and write the same variable in one step β€” it calls this a "self-reference" and throws an error.

missing_files is written to inside the ForEach (appending each missing store).
final_log is written to after the ForEach (reads missing_files and builds the summary).
Two different variables = no self-reference = no error.
πŸ“ΈSCREENSHOT

Variables tab β€” three variables: run_date_folder, missing_files, final_log

Step 5 β€” Add Set Variable: Build run_date_folder

From left panel β†’ "General" β†’ drag "Set variable" onto the canvas

Click it β†’ bottom panel:

General tab:
  Name:        set_run_date_folder
  Description: Formats run_date into Hive partition folder name

Click "Variables" tab:

Name:   run_date_folder
Value:  (Add dynamic content)
Value expression
date=@{pipeline().parameters.run_date}
πŸ“ΈSCREENSHOT

Set variable activity β€” name 'set_run_date_folder', value showing date=@{pipeline().parameters.run_date}

Step 6 β€” Add ForEach Activity

From left panel β†’ "Iteration & conditionals" β†’ drag "ForEach" onto the canvas

Connect: hover over set_run_date_folder β†’ drag green arrow β†’ connect to ForEach

πŸ“ΈSCREENSHOT

Canvas β€” set_run_date_folder connected to ForEach with green success arrow

Click the ForEach β†’ bottom panel:

General tab:
  Name:        ForEach_stores
  Description: Loops through each store ID to check and copy files

Settings tab:
  Sequential:   β˜‘ Checked   ← IMPORTANT
  Items:        @pipeline().parameters.store_ids
πŸ“ΈSCREENSHOT

ForEach Settings tab β€” Sequential CHECKED, Items showing @pipeline().parameters.store_ids

⚠️ Important
Why Sequential and not Parallel?
Variables in ADF are shared across the pipeline. If two iterations run simultaneously and both try to updatemissing_files at the same time, one write overwrites the other β€” entries get lost. This is a race condition. Sequential solves it β€” each iteration waits its turn before writing.

Parallel risk:   Iterations 4 and 6 both find missing files simultaneously
                 Both try to write to missing_files β†’ one write lost ❌

Sequential:      Iteration 4 runs β†’ writes β†’ finishes
                 Iteration 6 runs β†’ writes β†’ finishes
                 All updates preserved βœ…

Step 7 β€” Inside ForEach: Add Get Metadata Activity

Click the "+" button inside the ForEach box to enter the inner canvas

πŸ“ΈSCREENSHOT

ForEach box β€” '+' button inside, about to enter inner canvas

From left panel β†’ "General" β†’ drag "Get Metadata" onto the inner canvas

πŸ“ΈSCREENSHOT

Get Metadata activity placed on ForEach inner canvas

Click Get Metadata β†’ bottom panel:

General tab:
  Name:        get_metadata_store_file
  Description: Checks if today's store file exists in landing zone

Click "Dataset" tab:

Dataset:  ds_src_blob_dated_store_sales

run_date_folder β†’ Add dynamic content:

run_date_folder
@variables('run_date_folder')

file_name β†’ Add dynamic content:

file_name
store_@{item()}_sales_@{formatDateTime(pipeline().parameters.run_date,'yyyyMMdd')}.csv
πŸ“ΈSCREENSHOT

Dynamic content editor β€” full dated file name expression with @item() and formatDateTime

Click "Field list" tab β†’ click "+ New" twice:

Field 1:  exists
Field 2:  size
πŸ“ΈSCREENSHOT

Field list tab β€” 'exists' and 'size' added as fields to retrieve

πŸ’‘ Note
exists tells you if the file is there. size tells you if it is empty (0 bytes). In production you check both β€” a 0-byte file exists but contains no data.

Step 8 β€” Inside ForEach: Add If Condition Activity

From left panel β†’ "Iteration & conditionals" β†’ drag "If Condition" onto the inner canvas

Connect green arrow from get_metadata_store_file β†’ connect to If Condition

πŸ“ΈSCREENSHOT

Inner canvas β€” get_metadata_store_file connected to If Condition with green arrow

Click If Condition β†’ bottom panel:

General tab:
  Name:        if_file_exists
  Description: Checks if file exists AND has content (size > 0)

Click "Activities" tab β†’ Expression field β†’ "Add dynamic content":

If Condition expression
@and( activity('get_metadata_store_file').output.exists, greater(activity('get_metadata_store_file').output.size, 0) )
πŸ“ΈSCREENSHOT

Dynamic content editor β€” the full @and() expression with exists and greater() checks

Breaking down the expression
@and( ... , ... )
  β†’ Returns true only if BOTH conditions are true

activity('get_metadata_store_file').output.exists
  β†’ true if file exists, false if not
  β†’ activity('name') reads output of a previous activity by name

greater(activity('get_metadata_store_file').output.size, 0)
  β†’ true if file size is greater than 0 (not empty)

Combined:
  β†’ true if file EXISTS and is NOT EMPTY βœ…
  β†’ false if file is missing OR is 0 bytes ❌
πŸ“ΈSCREENSHOT

If Condition Activities tab β€” expression showing the @and() check with exists and size

Step 9 β€” Inside True Branch: Add Copy Activity

Click the pencil icon next to "True" to enter the True branch canvas

πŸ“ΈSCREENSHOT

If Condition Activities tab β€” True and False sections, pencil icon next to True highlighted

Drag "Copy data" onto the True branch canvas

πŸ“ΈSCREENSHOT

Copy data activity placed on the True branch canvas

General tab:
  Name:        copy_to_adls_with_datestamp
  Description: Copies store file from landing to ADLS raw/sales/ with date stamp

Source tab:

Source dataset:  ds_src_blob_dated_store_sales

run_date_folder β†’ @variables('run_date_folder')

file_name β†’

file_name
store_@{item()}_sales_@{formatDateTime(pipeline().parameters.run_date,'yyyyMMdd')}.csv
πŸ“ΈSCREENSHOT

Source tab β€” both dataset properties filled, file_name showing the dated expression

Sink tab:

Sink dataset:  ds_sink_adls_dated_sales

run_date_folder β†’ @variables('run_date_folder')

file_name β†’ same dated expression as Source

file_name
store_@{item()}_sales_@{formatDateTime(pipeline().parameters.run_date,'yyyyMMdd')}.csv
πŸ“ΈSCREENSHOT

Sink tab β€” both dataset properties filled matching source

Step 10 β€” Inside True Branch: Add Delete Activity

Still on the True branch canvas β†’ from left panel β†’ "General" β†’ drag "Delete"

Connect green arrow from copy_to_adls_with_datestamp β†’ connect to Delete

πŸ“ΈSCREENSHOT

True branch canvas β€” Copy activity connected to Delete activity with green success arrow

General tab:
  Name:        delete_from_landing
  Description: Removes processed file from landing zone after successful copy

Click "Dataset" tab:

Dataset:  ds_delete_blob_landing

run_date_folder β†’ @variables('run_date_folder')

file_name β†’ dated file name expression

file_name
store_@{item()}_sales_@{formatDateTime(pipeline().parameters.run_date,'yyyyMMdd')}.csv
πŸ“ΈSCREENSHOT

Delete activity Dataset tab β€” ds_delete_blob_landing selected, both properties filled

Click "Logging settings" tab (optional but recommended):

Enable logging:    βœ… Yes
Linked service:    ls_adls_freshmart
Log folder path:   logs/delete_activity/
πŸ“ΈSCREENSHOT

Delete activity Logging tab β€” enabled, log path set to logs/delete_activity/

🎯 Pro Tip
Enable logging for Delete. It is irreversible β€” if a file gets deleted that shouldn't have been, the log tells you exactly what was deleted, when, and by which run. Your safety net.
πŸ“ΈSCREENSHOT

Complete True branch canvas β€” Copy β†’ Delete connected with green arrow

Step 11 β€” Inside False Branch: Add Set Variable

Click back arrow β†’ return to If Condition Activities tab β†’ click pencil next to "False"

πŸ“ΈSCREENSHOT

If Condition β€” False section pencil icon highlighted

Drag "Set variable" onto the False branch canvas

πŸ“ΈSCREENSHOT

Set variable activity placed on False branch canvas

General tab:
  Name:        log_missing_file
  Description: Appends missing store ID to missing_files variable for monitoring

Variables tab:

Name:   missing_files
Value:  (Add dynamic content)
Append expression
@{variables('missing_files')}@{item()} missing for @{pipeline().parameters.run_date} |

β†’ After ST006 and ST007 missing: "ST006 missing for 2024-01-15 | ST007 missing for 2024-01-15 | "

πŸ“ΈSCREENSHOT

Set variable β€” Variables tab with the append expression for missing_files

How string concatenation works here
@{variables('missing_files')}        β†’ current value (whatever was already logged)
@{item()}                             β†’ current store ID, e.g. "ST006"
" missing for "                       β†’ literal text
@{pipeline().parameters.run_date}    β†’ "2024-01-15"
" | "                                 β†’ separator between entries

Each iteration APPENDS β€” previous entries are preserved βœ…
πŸ“ΈSCREENSHOT

Complete False branch canvas β€” log_missing_file Set variable activity alone

Step 12 β€” Add Final Summary Log to Main Canvas

Click back arrow until you are on the main pipeline canvas.

From left panel β†’ drag one more "Set variable" onto the main canvas (outside ForEach)

Connect green arrow from ForEach_stores β†’ connect to this new Set variable

General tab:
  Name:        output_missing_files_log
  Description: Final log of all missing files for this pipeline run

Variables tab:

Name:   final_log       ← write to final_log, NOT missing_files
Value:  Pipeline run complete. Missing files: @{variables('missing_files')}
⚠️ Important
Write to final_log, not missing_files.
ADF throws a "self-reference" error if a Set Variable activity reads and writes the same variable.missing_files is appended to inside the ForEach. Here we read it and write the result tofinal_log β€” two different variables, no error.
πŸ“ΈSCREENSHOT

output_missing_files_log β€” Name shows 'final_log', value reads from missing_files

πŸ“ΈSCREENSHOT

Main canvas β€” complete pipeline: set_run_date_folder β†’ ForEach_stores β†’ output_missing_files_log

Full pipeline structure
MAIN CANVAS:
  [set_run_date_folder] ──► [ForEach_stores] ──► [output_missing_files_log]

INSIDE ForEach_stores:
  [get_metadata_store_file] ──► [if_file_exists]
                                    β”‚
                                    β”œβ”€β”€ TRUE:  [copy_to_adls_with_datestamp] ──► [delete_from_landing]
                                    β”‚
                                    └── FALSE: [log_missing_file]

Step 13 β€” Validate

Click "Validate" in the top toolbar

πŸ“ΈSCREENSHOT

Validation successful β€” no errors found

⚠️ Important
Common validation errors:

"Activity not found" β†’ The If Condition expression uses activity('get_metadata_store_file') β€” the name in quotes must exactly match the activity's General tab name. Case-sensitive.

"Variable is read-only inside parallel ForEach" β†’ ForEach Sequential must be CHECKED ON.

"Delete activity dataset not configured" β†’ True branch β†’ Delete β†’ Dataset tab β†’ confirm ds_delete_blob_landing is selected with both properties filled.

Step 14 β€” Debug: File Exists Scenario

Click "Debug"

run_date:   2024-01-15
store_ids:  ["ST001","ST002","ST003","ST004","ST005","ST006","ST007","ST008","ST009","ST010"]
πŸ“ΈSCREENSHOT

Debug parameter dialog β€” run_date 2024-01-15, full store_ids array

Click "OK" β€” pipeline runs sequentially through all 10 stores.

ST001–ST005 go through TRUE branch. ST006–ST010 go through FALSE branch.

πŸ“ΈSCREENSHOT

Pipeline running β€” set_run_date_folder green, ForEach running with sequential progress

πŸ“ΈSCREENSHOT

Pipeline completed β€” all activities green

Click πŸ‘“ glasses icon on ForEach in Output tab:

πŸ“ΈSCREENSHOT

ForEach iteration list β€” 10 rows, showing each store ID, which branch ran (TRUE/FALSE), and duration

Verify in ADLS:

raw/sales/date=2024-01-15/
  β”œβ”€β”€ store_ST001_sales_20240115.csv  βœ…
  β”œβ”€β”€ store_ST002_sales_20240115.csv  βœ…
  β”œβ”€β”€ store_ST003_sales_20240115.csv  βœ…
  β”œβ”€β”€ store_ST004_sales_20240115.csv  βœ…
  └── store_ST005_sales_20240115.csv  βœ…

Only 5 files β€” because only 5 existed in landing. Correct behavior.
πŸ“ΈSCREENSHOT

raw/sales/date=2024-01-15/ β€” exactly 5 files, matching the stores uploaded

Verify landing zone is cleaned:

landing/store_sales/date=2024-01-15/
  (empty β€” all 5 processed files deleted) βœ…
πŸ“ΈSCREENSHOT

landing/store_sales/date=2024-01-15/ β€” empty folder, files deleted after copying

Check the missing files log β€” ADF Monitor β†’ pipeline run β†’ click output_missing_files_log β†’ Output:

Pipeline run complete. Missing files: ST006 missing for 2024-01-15 | ST007 missing for 2024-01-15 | ST008 missing for 2024-01-15 | ST009 missing for 2024-01-15 | ST010 missing for 2024-01-15 |
πŸ“ΈSCREENSHOT

output_missing_files_log activity output β€” showing the missing stores listed in final_log value

Step 15 β€” Debug: All Files Present Scenario

Upload the remaining 5 store files (ST006–ST010) to landing/store_sales/date=2024-01-15/

πŸ“ΈSCREENSHOT

landing/store_sales/date=2024-01-15/ β€” all 10 files now uploaded

Run Debug again with the same parameters. This time all 10 go through TRUE branch.

πŸ“ΈSCREENSHOT

ForEach iteration list β€” all 10 rows showing TRUE branch ran, all green

πŸ“ΈSCREENSHOT

raw/sales/date=2024-01-15/ β€” all 10 files present

πŸ“ΈSCREENSHOT

landing/store_sales/date=2024-01-15/ β€” empty, all files deleted

πŸ“ΈSCREENSHOT

output_missing_files_log output β€” 'Pipeline run complete. Missing files: ' (empty β€” none missing)

Step 16 β€” Publish

Click "Publish all"

Publishing:
  pl_file_management_daily   (new)
  ds_delete_blob_landing     (new)
πŸ“ΈSCREENSHOT

Publish panel β€” listing new pipeline and dataset

πŸ“ΈSCREENSHOT

Successfully published notification

🎯 What You Built β€” Summary

BEFORE:
  Pipelines copied blindly β€” crashed if file was missing
  Files accumulated in landing zone forever
  No way to know which files were missing on any given day
  No history β€” files overwritten daily

AFTER:
  Pipeline checks file existence BEFORE attempting to copy
  Landing zone cleaned automatically after successful copy
  Missing files logged β€” you know exactly what did not arrive
  Files are date-stamped β€” full history preserved in ADLS
  Pipeline never crashes on missing files β€” handles gracefully

New ADF Activities Learned

ActivityPurposeKey Output
Get MetadataRead file/folder properties.output.exists, .output.size, .output.lastModified
If ConditionBranch based on true/falseRuns True OR False activities
DeleteRemove a file from storageFile is permanently removed
Set Variable (append)Build a running logConcatenates text across iterations

New Expressions Learned

ExpressionWhat It Does
@and(condition1, condition2)True only when BOTH conditions are true
activity('name').output.fieldnameRead the output of a previous activity
greater(value, number)True if value is greater than number
@{replace(string, 'find', 'replace')}Replace text within a string
@{variables('name')}existing@{item()}newAppend text to a variable (concatenation)

🧠 Key Concepts to Remember

ConceptWhat It IsWhy It Matters
Get MetadataReads file properties without reading fileCheck existence before copying β€” prevent crashes
If ConditionBranches pipeline based on true/falseHandle missing files gracefully
Delete ActivityPermanently removes file from storageClean landing zone after successful processing
True branchActivities that run when condition is trueThe happy path
False branchActivities that run when condition is falseThe error handling path
Sequential ForEachOne iteration at a timeRequired when writing to shared variables
Race conditionTwo iterations updating same variable at onceWhy parallel ForEach breaks variable updates
String concatenationAppending text to a variable each iterationBuild running logs across loop iterations
activity('name').outputRead another activity's resultCore pattern for connecting activity results
@and()Both conditions must be trueSafer than just checking exists alone

⚠️ Common Mistakes in This Project

MistakeFix
Activity name in expression does not match actual nameExpression uses activity('get_metadata_store_file') β€” if the activity is named GetMetadata1 the expression fails. Names are case-sensitive.
ForEach set to Parallel when writing to a variableSet ForEach Sequential = ON whenever activities inside write to pipeline variables. Parallel causes race conditions.
Delete activity runs even when Copy failedMake sure Delete is connected to Copy with a success arrow (green), not always arrow (grey). Click the arrow to verify it shows "On success".
Get Metadata field list not configuredGet Metadata β†’ Field list tab β†’ must explicitly add "exists" as a field. Without this, output.exists returns null.
Activities placed in wrong branchClick Activities tab on If Condition β†’ verify Copy is in True branch and log_missing_file is in False branch.
Self-reference error on missing_files variableUse a second variable (final_log) for the summary. missing_files is written inside ForEach, final_log reads it after. Never read and write the same variable in one Set Variable activity.

πŸ† Tier 1 Complete β€” What You Have Built So Far

PROJECT 01:  Copy a single file                     β†’ ADF basics, linked services, datasets
PROJECT 02:  Copy multiple files with ForEach       β†’ Loops, arrays, parallel execution
PROJECT 03:  Date-parameterized pipeline + trigger  β†’ Parameters, dynamic expressions, scheduling
PROJECT 04:  Download from public HTTPS URL         β†’ HTTP linked service, internet data sources
PROJECT 05:  File management with validation        β†’ Get Metadata, If Condition, Delete, error handling
🎯 Pro Tip
You now understand the complete ADF activity toolkit for file-based pipelines. Every data engineer at every company uses these exact patterns.

🎯 Key Takeaways

  • βœ“Always use Get Metadata to check file existence before copying β€” prevents cryptic pipeline crashes
  • βœ“If Condition branches your pipeline into happy path (True) and error path (False)
  • βœ“Connect Delete to Copy with a success arrow β€” if Copy fails, Delete must NOT run
  • βœ“Sequential ForEach is required when activities inside write to shared pipeline variables
  • βœ“Use a separate variable for summaries β€” ADF blocks self-reference (read + write same variable in one step)
  • βœ“Date-stamping files in ADLS preserves history β€” overwriting destroys it
  • βœ“The landing zone is temporary β€” clean it after processing so it stays lean

πŸš€ What's Coming in Project 06 β€” Tier 2 Begins

So far, all data sources have been files β€” CSVs sitting somewhere waiting to be copied. In the real world, a huge portion of data comes from REST APIs β€” services you query with an HTTP request and get back structured JSON.

In Project 06, FreshMart integrates with a live public REST API:

  • Call a real API endpoint and receive a JSON response
  • Extract data and land it in ADLS as a clean CSV
  • Handle REST API pagination β€” when results come in pages
  • API key authentication
  • Parse nested JSON structures
Share

Discussion

0

Have a better approach? Found something outdated? Share it β€” your knowledge helps everyone learning here.

Continue with GitHub
Loading...