Project 01 — Copy a CSV File to Azure Data Lake
Build your first Azure data pipeline from scratch. Copy a CSV file from a local landing zone into ADLS Gen2 using Azure Data Factory — the foundational pattern behind every data engineering pipeline in Azure.
A pipeline that automatically copies a CSV file from your local computer into Azure cloud storage — the foundational pattern of every data engineering project in the world.
Real World Problem
The Company: FreshMart — a grocery chain with 10 stores across India.
Every day, each store manager exports a file called daily_sales.csv from their billing software and saves it on their computer. That file just sits there. The data team at FreshMart HQ has zero visibility into what is happening across stores. They cannot answer basic questions like:
- Which store sold the most today?
- Which product is running out of stock?
- Is store revenue growing or shrinking week over week?
The root problem: Data is trapped on individual laptops. There is no central place to store it, no way to query it, no way to analyse it.
What we are going to do: Take that CSV file sitting on a laptop → upload it to Azure cloud storage automatically using a pipeline. This is the very first step of every data engineering project in the world — get the data off the source and into a central location.
Simple. One file. One destination. One pipeline. But this exact pattern is the foundation of every data engineering project you will ever build.
Concepts You Must Understand First
Before touching Azure, read this section completely — it will make every step feel logical instead of confusing.
What is Azure?
Azure is Microsoft's cloud platform. Instead of buying physical servers and hard drives — you rent them from Microsoft and pay only for what you use.
Think of it like electricity. You do not build a power plant to get electricity at home. You plug into the grid and pay a monthly bill. Azure is the same — you plug into Microsoft's infrastructure and pay for what you consume. For data engineering, Azure gives you cloud storage, compute, pipelines, and dashboards.
What is Azure Data Factory (ADF)?
ADF is a data pipeline orchestration tool. It answers the question: how do I move data from A to B automatically?
Think of ADF like a courier service. You tell it: pick up from this location (source), deliver to that location (destination), run every day at midnight, and alert me if delivery fails. That is exactly what ADF does with data.
The connection details to reach a data source — like saving a contact in your phone (name + number).
Points to the specific file or table you want — like telling the courier: pick up the red box.
One single action — copy, transform, run code. Like one step in the delivery process.
A container that holds one or more activities — the complete delivery workflow from start to finish.
In this project we will create 2 Linked Services, 2 Datasets, 1 Activity, and 1 Pipeline.
What is ADLS Gen2?
ADLS stands for Azure Data Lake Storage Generation 2. It is cloud storage optimized for data analytics workloads — think of it as a massive intelligent hard drive in the cloud that can store any file type, never runs out of space, and is directly connected to every Azure analytics service.
Step by Step Overview
- 01Create an Azure free account
- 02Create a Resource Group
- 03Create the CSV file on your computer
- 04Create ADLS Gen2 storage account
- 05Create container and folder structure
- 06Create Azure Data Factory
- 07Open ADF Studio
- 08Create Linked Service — Blob (source)
- 09Create Linked Service — ADLS Gen2 (destination)
- 10Create Dataset — source CSV
- 11Create Dataset — destination ADLS
- 12Create Pipeline + Copy Activity
- 13Configure Source and Sink
- 14Debug (test run)
- 15Publish
- 16Verify file in ADLS
Step 1 — Create an Azure Account
Go to https://azure.microsoft.com/en-in/free and click "Start free".
You will get $200 free credit valid for 30 days, 12 months of popular services free, and always-free services including limited ADF and storage. You will need a Microsoft account, a phone number for verification, and a credit card for identity only — you will NOT be charged for free tier usage.
Azure free account signup page — showing the 'Start free' button and the $200 credit offer
After signing up → go to https://portal.azure.com. You will land on the Azure Portal homepage — your control centre for everything Azure.
Azure Portal homepage after first login — showing the top search bar, left sidebar, and the main dashboard area
What you are looking at
Top search bar — Search for any Azure service by name — use this constantly
Left sidebar — Your recently used services
Main area — Your dashboard — empty now, fills up as you create resources
Top right — Your account name, subscription, notifications
Step 2 — Create a Resource Group
A Resource Group is a logical folder that holds all Azure resources belonging to one project. Instead of having ADF, ADLS, and Databricks scattered randomly — you put them all in one Resource Group called rg-freshmart-dev.
Why use a Resource Group?
See all costs in one place — How much is this entire project costing me?
Delete everything at once — Done with the project? Delete the group and everything inside is gone.
Permissions once — Give a teammate access to the group and they get access to everything inside.
In the Azure Portal search bar → type "Resource groups" → click it.
Azure Portal — typing 'Resource groups' in the search bar, showing the suggestion dropdown
Click "+ Create" (top left button).
Resource groups page — showing the '+ Create' button at top left
Fill in the form exactly as shown:
rg = resource group, freshmart = project name, dev = environment. Professional teams always follow naming conventions. Get in the habit now — all 25 projects will use this same pattern.Resource group creation form — all three fields filled in exactly as shown above
Click "Review + Create" → then "Create". Wait 5 seconds — you will see a notification: "Resource group created".
Resource group successfully created — showing the overview page with name 'rg-freshmart-dev' and region 'East US 2'
Step 3 — Create the Sample CSV File
This represents the daily_sales.csv that FreshMart store managers export from their billing software every day. Open Notepad (Windows) or TextEdit (Mac) and paste exactly this:
order_id,store_id,product_name,category,quantity,unit_price,order_date ORD001,ST001,Basmati Rice 5kg,Grocery,3,299.00,2024-01-15 ORD002,ST001,Sunflower Oil 1L,Grocery,5,145.00,2024-01-15 ORD003,ST001,Samsung TV 43inch,Electronics,1,32000.00,2024-01-15 ORD004,ST002,Amul Butter 500g,Dairy,8,240.00,2024-01-15 ORD005,ST002,Basmati Rice 5kg,Grocery,2,299.00,2024-01-15 ORD006,ST002,Colgate Toothpaste,Personal Care,10,89.00,2024-01-15 ORD007,ST003,Lays Chips Family Pack,Snacks,15,99.00,2024-01-15 ORD008,ST003,Sony Headphones,Electronics,2,4500.00,2024-01-15 ORD009,ST003,Amul Milk 1L,Dairy,20,62.00,2024-01-15 ORD010,ST001,Dove Soap 100g,Personal Care,6,65.00,2024-01-15
Save the file as daily_sales.csv somewhere easy to find — your Desktop is fine.
Notepad with the CSV content pasted in — showing the file before saving
Save As dialog — showing filename 'daily_sales.csv' being saved to Desktop
Step 4 — Create ADLS Gen2 Storage Account
In the Azure Portal search bar → type "Storage accounts" → click it → click "+ Create".
Storage accounts page — showing the '+ Create' button at top left
Basics Tab
Storage account creation — Basics tab completely filled in with all values as shown above
Advanced Tab — The Most Important Checkbox
Click the "Advanced" tab → find the section called "Data Lake Storage Gen2" → enable Hierarchical namespace.
Advanced tab — showing the 'Hierarchical namespace' checkbox being checked under the Data Lake Storage Gen2 section
Leave all other tabs as default. Click "Review" → "Create". Deployment takes about 30–60 seconds.
Deployment complete — showing 'Your deployment is complete' with the resource name stfreshmartdev
Click "Go to resource".
Storage account overview page — highlighting the 'Azure Data Lake Storage Gen2' label and the name stfreshmartdev
Step 5 — Create Container and Folder Structure
A container is like a top-level folder inside a storage account. We will create one container called raw — this is where all raw, unprocessed data lands. In later projects we will also have processed and curated containers following the Medallion Architecture.
On the storage account page → left sidebar → click "Containers" → "+ Container".
New container dialog — name 'raw' entered, Private selected
Container 'raw' now visible in the containers list
Click on the "raw" container → click "+ Add Directory" → name it sales.
raw container showing the 'sales' directory created inside it
raw/products/, raw/customers/, raw/inventory/ as FreshMart grows. Starting with a clean hierarchy now saves massive headaches later.Step 6 — Create Azure Data Factory
In the Azure Portal search bar → type "Data factories" → click it → click "+ Create".
Data Factory creation form — all fields filled in as shown above
Click the "Git configuration" tab → check "Configure Git later". Git integration is important for production but adds complexity for beginners — we will cover it in a later project.
Git configuration tab — 'Configure Git later' checkbox checked
Click "Review + create" → "Create". Deployment takes 1–2 minutes.
ADF deployment complete — showing 'adf-freshmart-dev' successfully created
Step 7 — Open ADF Studio
Click "Go to resource" → on the ADF overview page → click "Launch studio". ADF Studio opens in a new browser tab — this is where you will spend 90% of your time.
ADF overview page — showing the 'Launch studio' button in the centre
ADF Studio homepage — label each section: (1) Left sidebar icons, (2) Main canvas area, (3) Top toolbar
ADF Studio Layout
Step 8 — Create Linked Service for Source (Blob Storage)
ADF lives in the cloud and cannot directly reach into your laptop. The solution: we first upload the CSV to a landing container in the same storage account, then ADF copies it from there to the raw/sales/ destination.
Think of it like a courier — the courier cannot teleport to your home. You drop the package at a collection point, then the courier picks it up and delivers it.
Upload the CSV to a landing container
Go to Azure Portal → Storage account stfreshmartdev → Containers → "+ Container".
Creating 'landing' container — name and private access level set
Click on the "landing" container → click "Upload" → select your daily_sales.csv from your Desktop → click "Upload".
landing container after upload — daily_sales.csv visible with file size and last modified date
Create the Linked Service
Go back to ADF Studio → click "Manage" (toolbox icon in left sidebar) → click "Linked services" → click "+ New".
Linked services page — empty list and '+ New' button visible
In the search box → type "Azure Blob" → select "Azure Blob Storage" → click "Continue".
New linked service panel — 'Azure Blob Storage' selected in search results
Click "Test connection" at the bottom. You should see a green ✅ "Connection successful".
Green 'Connection successful' message at the bottom of the linked service form
Click "Create".
Linked services list — ls_blob_freshmart_landing now visible
Step 9 — Create Linked Service for Destination (ADLS Gen2)
Still in Manage → Linked services → click "+ New" → search "Azure Data Lake Storage Gen2" → select it → "Continue".
New linked service — 'Azure Data Lake Storage Gen2' selected
stfreshmartdev. We still create two separate linked services because one represents Blob Storage behaviour (landing) and one represents ADLS Gen2 behaviour (raw, with hierarchical namespace). In real projects these will be completely different accounts.Click "Test connection" → ✅ green → click "Create".
Linked services list — now showing both: ls_blob_freshmart_landing and ls_adls_freshmart
Step 10 — Create Source Dataset
A Dataset tells ADF which specific file to work with. The Linked Service is how to connect to the storage — the Dataset is what file specifically to read or write.
Click "Author" (pencil icon in left sidebar) → click "+" next to "Datasets" → "New dataset".
Author tab — showing Datasets section with the '+' button highlighted
Search "Azure Blob Storage" → select it → "Continue". Select format: "DelimitedText" (this means CSV) → "Continue".
Format selection — DelimitedText/CSV selected
Dataset properties form — all fields filled in exactly as shown
Click "OK" → then click the "Preview data" tab at the bottom. You should see all 10 rows of FreshMart sales data — this confirms ADF can read your file.
Dataset preview — showing all 10 rows of CSV data in a clean table format
Click 💾 Save (or Ctrl+S).
Dataset saved — ds_src_blob_daily_sales visible in the Datasets list on the left
Step 11 — Create Destination Dataset
Click "+" next to "Datasets" → "New dataset" → search "Azure Data Lake Storage Gen2" → select it → "Continue". Select format: "DelimitedText" → "Continue".
Destination dataset form — all fields showing raw/sales/daily_sales.csv path
Click "OK" → 💾 Save.
Datasets list — now showing both ds_src_blob_daily_sales and ds_sink_adls_raw_sales
Step 12 — Create the Pipeline
Click "+" next to "Pipelines" → "New pipeline". A blank canvas opens.
Author tab — Pipelines section with '+' button highlighted
In the Properties panel on the right, set the name and description:
Pipeline canvas — Properties panel on right showing the name and description filled in
Empty pipeline canvas — labelling each area: top toolbar, left activities panel, centre canvas, bottom properties panel
Step 13 — Add and Configure Copy Activity
The Copy Activity does one thing: read data from a source dataset and write it to a sink (destination) dataset. Source = where data comes FROM. Sink = where data goes TO. (Sink is standard data engineering terminology — like a kitchen sink where water flows into.)
In the left activities panel → expand "Move & transform" → drag "Copy data" onto the canvas.
Dragging 'Copy data' activity from the left panel onto the canvas
Copy data activity placed on the canvas — a blue box with 'Copy data' label
Click the Copy activity box to select it. The bottom panel shows configuration tabs. Set the following in each tab:
General Tab
General tab — activity name and description filled in
Source Tab
Source tab — ds_src_blob_daily_sales selected as source dataset
Sink Tab
Sink tab — ds_sink_adls_raw_sales selected as sink dataset
Mapping Tab
Click "Mapping" → click "Import schemas". ADF will automatically detect all columns and map them 1:1.
Mapping tab — all columns auto-mapped with arrows between source and destination
Click 💾 Save.
Saved pipeline — pl_copy_daily_sales_csv visible in the Pipelines list on the left
Step 14 — Debug (Test Run)
Before scheduling or publishing anything, always run a Debug first. Debug runs the pipeline immediately using your current draft — no effect on production.
Click "Debug" in the top toolbar. No parameters for this pipeline — click "OK".
Top toolbar — Debug button highlighted with cursor pointing to it
Watch the canvas. The Copy activity will show:
Pipeline running — Copy activity showing yellow/spinning status
Pipeline succeeded — Copy activity showing green checkmark
At the bottom → "Output" tab → click the 👓 glasses icon next to the run to see details.
Copy activity run details — showing files read: 1, files written: 1, data read and written amounts
Step 15 — Publish
In ADF, everything you build exists as a draft until published. Debug runs work on drafts. But triggers and scheduled runs only use published pipelines.
Click "Publish all" in the top toolbar. A panel shows everything that will be published — all 5 items we created. Click "Publish".
Publish panel — showing all 5 items: pipeline, 2 datasets, 2 linked services
'Successfully published' notification in the top right corner
Step 16 — Verify File in ADLS
Go to Azure Portal → Storage accounts → stfreshmartdev → Containers → raw → sales.
sales folder contents — showing daily_sales.csv file with file size and last modified timestamp
Click on daily_sales.csv → click "Edit" to preview its contents.
File preview in Azure Portal — showing all 10 rows of FreshMart sales data confirming the copy was successful
Step 17 — Check the Monitor Tab
Go back to ADF Studio → click "Monitor" (bar chart icon in the left sidebar). In production, this is where you check every morning that all pipelines ran successfully overnight.
Monitor tab — showing the pipeline run for pl_copy_daily_sales_csv with status 'Succeeded', duration, and timestamp
Click on the pipeline run to see full details.
Pipeline run detail — showing the copy activity, its duration, rows copied, and data volume
Resources Created — Summary
| Resource | Name | Purpose |
|---|---|---|
| Resource Group | rg-freshmart-dev | Container for all project resources |
| Storage Account | stfreshmartdev | Holds all data (landing + raw) |
| Container | landing | Staging area for uploaded files |
| Container | raw | Destination — Bronze layer |
| Data Factory | adf-freshmart-dev | Pipeline orchestration |
| Linked Service | ls_blob_freshmart_landing | Connection to landing container |
| Linked Service | ls_adls_freshmart | Connection to ADLS Gen2 raw container |
| Dataset | ds_src_blob_daily_sales | Points to landing/daily_sales.csv |
| Dataset | ds_sink_adls_raw_sales | Points to raw/sales/daily_sales.csv |
| Pipeline | pl_copy_daily_sales_csv | Copies the file end-to-end |
Key Concepts Reference
| Concept | What It Is | Analogy |
|---|---|---|
| Resource Group | Logical folder for Azure resources | Project folder on your computer |
| ADLS Gen2 | Cloud data lake with real folder hierarchy | Massive intelligent hard drive |
| Hierarchical Namespace | What makes ADLS Gen2 different from Blob | Real folders vs simulated ones |
| ADF | Visual pipeline orchestration tool | Automated courier service |
| Linked Service | Saved connection to a data source | Contact saved in your phone |
| Dataset | Pointer to a specific file or table | Address written on a package |
| Copy Activity | Single action that copies data | The delivery truck |
| Pipeline | Container of one or more activities | The full delivery workflow |
| Source | Where data comes FROM | Pickup location |
| Sink | Where data goes TO | Delivery destination |
| Debug | Test run on a draft pipeline | Proofreading before sending |
| Publish | Make pipeline live and schedulable | Clicking Send |
| Monitor | View all pipeline run logs | Delivery tracking dashboard |
Common Mistakes
Forgetting to enable Hierarchical Namespace
Fix: Delete the storage account and recreate it — cannot be enabled after creation
Not testing connection before saving Linked Service
Fix: Always click "Test connection" and wait for the green tick before saving
Forgetting to Publish after building
Fix: Always click "Publish all" after making changes — triggers use published version only
Wrong file path in dataset
Fix: Double-check container name, directory name, and filename in the dataset settings
Right now our pipeline copies one specific file. But FreshMart has 10 stores — that means 10 files: store_ST001_sales.csv through store_ST010_sales.csv.
In Project 02, you will learn to use the ForEach activity to loop through all 10 files and copy them in one pipeline run — instead of creating 10 separate Copy activities. Same resources. Same storage account. Same ADF. Just smarter.
🎯 Key Takeaways
- ✓ADLS Gen2 requires the Hierarchical Namespace checkbox to be enabled at creation time — this cannot be changed later
- ✓ADF cannot reach your local laptop — upload files to a landing container first, then ADF copies from there
- ✓Always test connections on Linked Services before saving — catch errors early
- ✓Debug runs use your draft pipeline — Publish to make the pipeline available to triggers and schedules
- ✓Source = where data comes FROM. Sink = where data goes TO. These are standard data engineering terms
- ✓The Monitor tab is your daily health check — every pipeline run is logged with status, duration, and row counts
- ✓Resource Groups let you see all project costs together and delete everything with one click when done
Built end-to-end Medallion Architecture batch pipeline on Azure: ADLS Gen2 → Databricks PySpark → Synapse Analytics
Implemented data quality framework validating 5,000 daily records — removing nulls, duplicates, and invalid values in Silver layer
Orchestrated multi-step Azure Data Factory pipeline with chained Databricks Notebook activities on daily schedule trigger
Bronze, Silver, Gold — the pattern behind every modern data lake.
How to process only new data — watermarks, upserts, and change data capture.
What to check, when to check it, and what to do when checks fail.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.