Python · SQL · Web Dev · Java · AI/ML tracks launching soon — your one platform for all of IT
Advanced+500 XP

Project 01 — Copy a CSV File to Azure Data Lake

Build your first Azure data pipeline from scratch. Copy a CSV file from a local landing zone into ADLS Gen2 using Azure Data Factory — the foundational pattern behind every data engineering pipeline in Azure.

60–90 min March 2026
Series: Azure Data Engineering — Zero to Advanced
Project: 01 of 25
Level: Absolute Beginner
Time: 60–90 minutes
What you will build

A pipeline that automatically copies a CSV file from your local computer into Azure cloud storage — the foundational pattern of every data engineering project in the world.

Real World Problem

The Company: FreshMart — a grocery chain with 10 stores across India.

Every day, each store manager exports a file called daily_sales.csv from their billing software and saves it on their computer. That file just sits there. The data team at FreshMart HQ has zero visibility into what is happening across stores. They cannot answer basic questions like:

  • Which store sold the most today?
  • Which product is running out of stock?
  • Is store revenue growing or shrinking week over week?

The root problem: Data is trapped on individual laptops. There is no central place to store it, no way to query it, no way to analyse it.

What we are going to do: Take that CSV file sitting on a laptop → upload it to Azure cloud storage automatically using a pipeline. This is the very first step of every data engineering project in the world — get the data off the source and into a central location.

What we are building
💻
Your Computer
daily_sales.csv
⚙️
ADF Pipeline
pl_copy_sales_csv
🏦
ADLS Gen2
raw/sales/daily_sales.csv

Simple. One file. One destination. One pipeline. But this exact pattern is the foundation of every data engineering project you will ever build.

Concepts You Must Understand First

Before touching Azure, read this section completely — it will make every step feel logical instead of confusing.

What is Azure?

Azure is Microsoft's cloud platform. Instead of buying physical servers and hard drives — you rent them from Microsoft and pay only for what you use.

Think of it like electricity. You do not build a power plant to get electricity at home. You plug into the grid and pay a monthly bill. Azure is the same — you plug into Microsoft's infrastructure and pay for what you consume. For data engineering, Azure gives you cloud storage, compute, pipelines, and dashboards.

What is Azure Data Factory (ADF)?

ADF is a data pipeline orchestration tool. It answers the question: how do I move data from A to B automatically?

Think of ADF like a courier service. You tell it: pick up from this location (source), deliver to that location (destination), run every day at midnight, and alert me if delivery fails. That is exactly what ADF does with data.

Linked Service

The connection details to reach a data source — like saving a contact in your phone (name + number).

Dataset

Points to the specific file or table you want — like telling the courier: pick up the red box.

Activity

One single action — copy, transform, run code. Like one step in the delivery process.

Pipeline

A container that holds one or more activities — the complete delivery workflow from start to finish.

In this project we will create 2 Linked Services, 2 Datasets, 1 Activity, and 1 Pipeline.

What is ADLS Gen2?

ADLS stands for Azure Data Lake Storage Generation 2. It is cloud storage optimized for data analytics workloads — think of it as a massive intelligent hard drive in the cloud that can store any file type, never runs out of space, and is directly connected to every Azure analytics service.

⚠️ Critical — Enable Hierarchical Namespace
ADLS Gen2 is created by enabling one checkbox — Hierarchical Namespace — during storage account creation. This cannot be enabled after creation. If you forget it, you must delete the storage account and start over. We will cover this in detail in Step 4.

Step by Step Overview

PHASE 0 — Prepare15 min
  • 01Create an Azure free account
  • 02Create a Resource Group
  • 03Create the CSV file on your computer
PHASE 1 — Storage15 min
  • 04Create ADLS Gen2 storage account
  • 05Create container and folder structure
PHASE 2 — ADF Setup10 min
  • 06Create Azure Data Factory
  • 07Open ADF Studio
PHASE 3 — Build the Pipeline30 min
  • 08Create Linked Service — Blob (source)
  • 09Create Linked Service — ADLS Gen2 (destination)
  • 10Create Dataset — source CSV
  • 11Create Dataset — destination ADLS
  • 12Create Pipeline + Copy Activity
  • 13Configure Source and Sink
  • 14Debug (test run)
  • 15Publish
  • 16Verify file in ADLS
PHASE 0 — PREPARE

Step 1 — Create an Azure Account

Go to https://azure.microsoft.com/en-in/free and click "Start free".

You will get $200 free credit valid for 30 days, 12 months of popular services free, and always-free services including limited ADF and storage. You will need a Microsoft account, a phone number for verification, and a credit card for identity only — you will NOT be charged for free tier usage.

📸SCREENSHOT

Azure free account signup page — showing the 'Start free' button and the $200 credit offer

After signing up → go to https://portal.azure.com. You will land on the Azure Portal homepage — your control centre for everything Azure.

📸SCREENSHOT

Azure Portal homepage after first login — showing the top search bar, left sidebar, and the main dashboard area

What you are looking at

Top search barSearch for any Azure service by name — use this constantly

Left sidebarYour recently used services

Main areaYour dashboard — empty now, fills up as you create resources

Top rightYour account name, subscription, notifications

Step 2 — Create a Resource Group

A Resource Group is a logical folder that holds all Azure resources belonging to one project. Instead of having ADF, ADLS, and Databricks scattered randomly — you put them all in one Resource Group called rg-freshmart-dev.

Why use a Resource Group?

See all costs in one placeHow much is this entire project costing me?

Delete everything at onceDone with the project? Delete the group and everything inside is gone.

Permissions onceGive a teammate access to the group and they get access to everything inside.

In the Azure Portal search bar → type "Resource groups" → click it.

📸SCREENSHOT

Azure Portal — typing 'Resource groups' in the search bar, showing the suggestion dropdown

Click "+ Create" (top left button).

📸SCREENSHOT

Resource groups page — showing the '+ Create' button at top left

Fill in the form exactly as shown:

Subscriptionyour subscription name — e.g. "Azure subscription 1"
Resource grouprg-freshmart-dev
RegionEast US 2
🎯 Naming Convention
We use: rg = resource group, freshmart = project name, dev = environment. Professional teams always follow naming conventions. Get in the habit now — all 25 projects will use this same pattern.
📸SCREENSHOT

Resource group creation form — all three fields filled in exactly as shown above

Click "Review + Create" → then "Create". Wait 5 seconds — you will see a notification: "Resource group created".

📸SCREENSHOT

Resource group successfully created — showing the overview page with name 'rg-freshmart-dev' and region 'East US 2'

Step 3 — Create the Sample CSV File

This represents the daily_sales.csv that FreshMart store managers export from their billing software every day. Open Notepad (Windows) or TextEdit (Mac) and paste exactly this:

daily_sales.csv
order_id,store_id,product_name,category,quantity,unit_price,order_date
ORD001,ST001,Basmati Rice 5kg,Grocery,3,299.00,2024-01-15
ORD002,ST001,Sunflower Oil 1L,Grocery,5,145.00,2024-01-15
ORD003,ST001,Samsung TV 43inch,Electronics,1,32000.00,2024-01-15
ORD004,ST002,Amul Butter 500g,Dairy,8,240.00,2024-01-15
ORD005,ST002,Basmati Rice 5kg,Grocery,2,299.00,2024-01-15
ORD006,ST002,Colgate Toothpaste,Personal Care,10,89.00,2024-01-15
ORD007,ST003,Lays Chips Family Pack,Snacks,15,99.00,2024-01-15
ORD008,ST003,Sony Headphones,Electronics,2,4500.00,2024-01-15
ORD009,ST003,Amul Milk 1L,Dairy,20,62.00,2024-01-15
ORD010,ST001,Dove Soap 100g,Personal Care,6,65.00,2024-01-15

Save the file as daily_sales.csv somewhere easy to find — your Desktop is fine.

📸SCREENSHOT

Notepad with the CSV content pasted in — showing the file before saving

📸SCREENSHOT

Save As dialog — showing filename 'daily_sales.csv' being saved to Desktop

PHASE 1 — STORAGE

Step 4 — Create ADLS Gen2 Storage Account

In the Azure Portal search bar → type "Storage accounts" → click it → click "+ Create".

📸SCREENSHOT

Storage accounts page — showing the '+ Create' button at top left

Basics Tab

Resource grouprg-freshmart-dev
Storage accountstfreshmartdev← no hyphens!
RegionEast US 2
PerformanceStandard
RedundancyLocally-redundant storage (LRS)
📸SCREENSHOT

Storage account creation — Basics tab completely filled in with all values as shown above

Advanced Tab — The Most Important Checkbox

Click the "Advanced" tab → find the section called "Data Lake Storage Gen2" → enable Hierarchical namespace.

⚠️ Do Not Skip This
This single checkbox converts a regular Azure Blob Storage account into ADLS Gen2. Without it, analytics tools run 30× slower and you get simulated folders instead of real ones. You cannot enable this after creation. If you forget, delete the storage account and recreate it.
📸SCREENSHOT

Advanced tab — showing the 'Hierarchical namespace' checkbox being checked under the Data Lake Storage Gen2 section

Leave all other tabs as default. Click "Review""Create". Deployment takes about 30–60 seconds.

📸SCREENSHOT

Deployment complete — showing 'Your deployment is complete' with the resource name stfreshmartdev

Click "Go to resource".

📸SCREENSHOT

Storage account overview page — highlighting the 'Azure Data Lake Storage Gen2' label and the name stfreshmartdev

Step 5 — Create Container and Folder Structure

A container is like a top-level folder inside a storage account. We will create one container called raw — this is where all raw, unprocessed data lands. In later projects we will also have processed and curated containers following the Medallion Architecture.

On the storage account page → left sidebar → click "Containers""+ Container".

Nameraw
Public accessPrivate (no anonymous access)
📸SCREENSHOT

New container dialog — name 'raw' entered, Private selected

📸SCREENSHOT

Container 'raw' now visible in the containers list

Click on the "raw" container → click "+ Add Directory" → name it sales.

📸SCREENSHOT

raw container showing the 'sales' directory created inside it

🎯 Why This Folder Structure?
We are starting organized. In future projects we will add raw/products/, raw/customers/, raw/inventory/ as FreshMart grows. Starting with a clean hierarchy now saves massive headaches later.
PHASE 2 — ADF SETUP

Step 6 — Create Azure Data Factory

In the Azure Portal search bar → type "Data factories" → click it → click "+ Create".

Resource grouprg-freshmart-dev
Nameadf-freshmart-dev
RegionEast US 2
VersionV2
📸SCREENSHOT

Data Factory creation form — all fields filled in as shown above

Click the "Git configuration" tab → check "Configure Git later". Git integration is important for production but adds complexity for beginners — we will cover it in a later project.

📸SCREENSHOT

Git configuration tab — 'Configure Git later' checkbox checked

Click "Review + create""Create". Deployment takes 1–2 minutes.

📸SCREENSHOT

ADF deployment complete — showing 'adf-freshmart-dev' successfully created

Step 7 — Open ADF Studio

Click "Go to resource" → on the ADF overview page → click "Launch studio". ADF Studio opens in a new browser tab — this is where you will spend 90% of your time.

📸SCREENSHOT

ADF overview page — showing the 'Launch studio' button in the centre

📸SCREENSHOT

ADF Studio homepage — label each section: (1) Left sidebar icons, (2) Main canvas area, (3) Top toolbar

ADF Studio Layout

🏠 HomeWelcome page
✏️ AuthorWhere you BUILD pipelines, datasets, linked services
📊 MonitorWhere you SEE pipeline runs, success/failure logs
🔧 ManageWhere you set up linked services and integration runtimes
PHASE 3 — BUILD THE PIPELINE

Step 8 — Create Linked Service for Source (Blob Storage)

ADF lives in the cloud and cannot directly reach into your laptop. The solution: we first upload the CSV to a landing container in the same storage account, then ADF copies it from there to the raw/sales/ destination.

Think of it like a courier — the courier cannot teleport to your home. You drop the package at a collection point, then the courier picks it up and delivers it.

Upload the CSV to a landing container

Go to Azure Portal → Storage account stfreshmartdevContainers"+ Container".

Namelanding
Public accessPrivate
📸SCREENSHOT

Creating 'landing' container — name and private access level set

Click on the "landing" container → click "Upload" → select your daily_sales.csv from your Desktop → click "Upload".

📸SCREENSHOT

landing container after upload — daily_sales.csv visible with file size and last modified date

Create the Linked Service

Go back to ADF Studio → click "Manage" (toolbox icon in left sidebar) → click "Linked services" → click "+ New".

📸SCREENSHOT

Linked services page — empty list and '+ New' button visible

In the search box → type "Azure Blob" → select "Azure Blob Storage" → click "Continue".

📸SCREENSHOT

New linked service panel — 'Azure Blob Storage' selected in search results

Namels_blob_freshmart_landing
Connect viaAutoResolveIntegrationRuntime
AuthenticationAccount key
Storage accountstfreshmartdev

Click "Test connection" at the bottom. You should see a green ✅ "Connection successful".

📸SCREENSHOT

Green 'Connection successful' message at the bottom of the linked service form

Click "Create".

📸SCREENSHOT

Linked services list — ls_blob_freshmart_landing now visible

Step 9 — Create Linked Service for Destination (ADLS Gen2)

Still in Manage → Linked services → click "+ New" → search "Azure Data Lake Storage Gen2" → select it → "Continue".

📸SCREENSHOT

New linked service — 'Azure Data Lake Storage Gen2' selected

Namels_adls_freshmart
Connect viaAutoResolveIntegrationRuntime
AuthenticationAccount key
Storage accountstfreshmartdev
💡 Same Storage Account, Two Linked Services?
Yes — in this project both source and destination live in stfreshmartdev. We still create two separate linked services because one represents Blob Storage behaviour (landing) and one represents ADLS Gen2 behaviour (raw, with hierarchical namespace). In real projects these will be completely different accounts.

Click "Test connection" → ✅ green → click "Create".

📸SCREENSHOT

Linked services list — now showing both: ls_blob_freshmart_landing and ls_adls_freshmart

Step 10 — Create Source Dataset

A Dataset tells ADF which specific file to work with. The Linked Service is how to connect to the storage — the Dataset is what file specifically to read or write.

Click "Author" (pencil icon in left sidebar) → click "+" next to "Datasets""New dataset".

📸SCREENSHOT

Author tab — showing Datasets section with the '+' button highlighted

Search "Azure Blob Storage" → select it → "Continue". Select format: "DelimitedText" (this means CSV) → "Continue".

📸SCREENSHOT

Format selection — DelimitedText/CSV selected

Nameds_src_blob_daily_sales
Linked servicels_blob_freshmart_landing
Containerlanding
Filedaily_sales.csv
First row as header✅ Yes
Import schemaFrom connection/store
📸SCREENSHOT

Dataset properties form — all fields filled in exactly as shown

Click "OK" → then click the "Preview data" tab at the bottom. You should see all 10 rows of FreshMart sales data — this confirms ADF can read your file.

📸SCREENSHOT

Dataset preview — showing all 10 rows of CSV data in a clean table format

Click 💾 Save (or Ctrl+S).

📸SCREENSHOT

Dataset saved — ds_src_blob_daily_sales visible in the Datasets list on the left

Step 11 — Create Destination Dataset

Click "+" next to "Datasets""New dataset" → search "Azure Data Lake Storage Gen2" → select it → "Continue". Select format: "DelimitedText""Continue".

Nameds_sink_adls_raw_sales
Linked servicels_adls_freshmart
Containerraw
Directorysales
Filedaily_sales.csv
First row as header✅ Yes
Import schemaNone← destination does not need a schema
📸SCREENSHOT

Destination dataset form — all fields showing raw/sales/daily_sales.csv path

Click "OK"💾 Save.

📸SCREENSHOT

Datasets list — now showing both ds_src_blob_daily_sales and ds_sink_adls_raw_sales

Step 12 — Create the Pipeline

Click "+" next to "Pipelines""New pipeline". A blank canvas opens.

📸SCREENSHOT

Author tab — Pipelines section with '+' button highlighted

In the Properties panel on the right, set the name and description:

Namepl_copy_daily_sales_csv
DescriptionCopies daily_sales.csv from landing zone to ADLS raw/sales/
📸SCREENSHOT

Pipeline canvas — Properties panel on right showing the name and description filled in

📸SCREENSHOT

Empty pipeline canvas — labelling each area: top toolbar, left activities panel, centre canvas, bottom properties panel

Step 13 — Add and Configure Copy Activity

The Copy Activity does one thing: read data from a source dataset and write it to a sink (destination) dataset. Source = where data comes FROM. Sink = where data goes TO. (Sink is standard data engineering terminology — like a kitchen sink where water flows into.)

In the left activities panel → expand "Move & transform" → drag "Copy data" onto the canvas.

📸SCREENSHOT

Dragging 'Copy data' activity from the left panel onto the canvas

📸SCREENSHOT

Copy data activity placed on the canvas — a blue box with 'Copy data' label

Click the Copy activity box to select it. The bottom panel shows configuration tabs. Set the following in each tab:

General Tab

Namecopy_daily_sales_to_adls
DescriptionReads daily_sales.csv from landing and writes to raw/sales/
📸SCREENSHOT

General tab — activity name and description filled in

Source Tab

Source datasetds_src_blob_daily_sales
File path typeFile path in dataset
📸SCREENSHOT

Source tab — ds_src_blob_daily_sales selected as source dataset

Sink Tab

Sink datasetds_sink_adls_raw_sales
Copy behaviorPreserveHierarchy
📸SCREENSHOT

Sink tab — ds_sink_adls_raw_sales selected as sink dataset

Mapping Tab

Click "Mapping" → click "Import schemas". ADF will automatically detect all columns and map them 1:1.

📸SCREENSHOT

Mapping tab — all columns auto-mapped with arrows between source and destination

Click 💾 Save.

📸SCREENSHOT

Saved pipeline — pl_copy_daily_sales_csv visible in the Pipelines list on the left

Step 14 — Debug (Test Run)

Before scheduling or publishing anything, always run a Debug first. Debug runs the pipeline immediately using your current draft — no effect on production.

Click "Debug" in the top toolbar. No parameters for this pipeline — click "OK".

📸SCREENSHOT

Top toolbar — Debug button highlighted with cursor pointing to it

Watch the canvas. The Copy activity will show:

🟡
Yellow
Running
Green
Success
🔴
Red
Failed
📸SCREENSHOT

Pipeline running — Copy activity showing yellow/spinning status

📸SCREENSHOT

Pipeline succeeded — Copy activity showing green checkmark

At the bottom → "Output" tab → click the 👓 glasses icon next to the run to see details.

📸SCREENSHOT

Copy activity run details — showing files read: 1, files written: 1, data read and written amounts

Step 15 — Publish

In ADF, everything you build exists as a draft until published. Debug runs work on drafts. But triggers and scheduled runs only use published pipelines.

🎯 Draft vs Published
Think of it like a Google Doc — you are editing a draft. Publishing is clicking "Share" so others and scheduled triggers can see and use the final version. Always publish after making changes.

Click "Publish all" in the top toolbar. A panel shows everything that will be published — all 5 items we created. Click "Publish".

📸SCREENSHOT

Publish panel — showing all 5 items: pipeline, 2 datasets, 2 linked services

📸SCREENSHOT

'Successfully published' notification in the top right corner

Step 16 — Verify File in ADLS

Go to Azure Portal → Storage accountsstfreshmartdevContainersrawsales.

📸SCREENSHOT

sales folder contents — showing daily_sales.csv file with file size and last modified timestamp

Click on daily_sales.csv → click "Edit" to preview its contents.

📸SCREENSHOT

File preview in Azure Portal — showing all 10 rows of FreshMart sales data confirming the copy was successful

🎉 You did it
You just completed your first Azure Data Engineering pipeline. A CSV file that was sitting on a laptop is now safely stored in Azure Data Lake — accessible to Databricks, Synapse, and Power BI — copied by an automated pipeline that you built from scratch.

Step 17 — Check the Monitor Tab

Go back to ADF Studio → click "Monitor" (bar chart icon in the left sidebar). In production, this is where you check every morning that all pipelines ran successfully overnight.

📸SCREENSHOT

Monitor tab — showing the pipeline run for pl_copy_daily_sales_csv with status 'Succeeded', duration, and timestamp

Click on the pipeline run to see full details.

📸SCREENSHOT

Pipeline run detail — showing the copy activity, its duration, rows copied, and data volume

Resources Created — Summary

ResourceNamePurpose
Resource Grouprg-freshmart-devContainer for all project resources
Storage AccountstfreshmartdevHolds all data (landing + raw)
ContainerlandingStaging area for uploaded files
ContainerrawDestination — Bronze layer
Data Factoryadf-freshmart-devPipeline orchestration
Linked Servicels_blob_freshmart_landingConnection to landing container
Linked Servicels_adls_freshmartConnection to ADLS Gen2 raw container
Datasetds_src_blob_daily_salesPoints to landing/daily_sales.csv
Datasetds_sink_adls_raw_salesPoints to raw/sales/daily_sales.csv
Pipelinepl_copy_daily_sales_csvCopies the file end-to-end

Key Concepts Reference

ConceptWhat It IsAnalogy
Resource GroupLogical folder for Azure resourcesProject folder on your computer
ADLS Gen2Cloud data lake with real folder hierarchyMassive intelligent hard drive
Hierarchical NamespaceWhat makes ADLS Gen2 different from BlobReal folders vs simulated ones
ADFVisual pipeline orchestration toolAutomated courier service
Linked ServiceSaved connection to a data sourceContact saved in your phone
DatasetPointer to a specific file or tableAddress written on a package
Copy ActivitySingle action that copies dataThe delivery truck
PipelineContainer of one or more activitiesThe full delivery workflow
SourceWhere data comes FROMPickup location
SinkWhere data goes TODelivery destination
DebugTest run on a draft pipelineProofreading before sending
PublishMake pipeline live and schedulableClicking Send
MonitorView all pipeline run logsDelivery tracking dashboard

Common Mistakes

⚠️

Forgetting to enable Hierarchical Namespace

Fix: Delete the storage account and recreate it — cannot be enabled after creation

⚠️

Not testing connection before saving Linked Service

Fix: Always click "Test connection" and wait for the green tick before saving

⚠️

Forgetting to Publish after building

Fix: Always click "Publish all" after making changes — triggers use published version only

⚠️

Wrong file path in dataset

Fix: Double-check container name, directory name, and filename in the dataset settings

What is coming in Project 02

Right now our pipeline copies one specific file. But FreshMart has 10 stores — that means 10 files: store_ST001_sales.csv through store_ST010_sales.csv.

In Project 02, you will learn to use the ForEach activity to loop through all 10 files and copy them in one pipeline run — instead of creating 10 separate Copy activities. Same resources. Same storage account. Same ADF. Just smarter.

🎯 Key Takeaways

  • ADLS Gen2 requires the Hierarchical Namespace checkbox to be enabled at creation time — this cannot be changed later
  • ADF cannot reach your local laptop — upload files to a landing container first, then ADF copies from there
  • Always test connections on Linked Services before saving — catch errors early
  • Debug runs use your draft pipeline — Publish to make the pipeline available to triggers and schedules
  • Source = where data comes FROM. Sink = where data goes TO. These are standard data engineering terms
  • The Monitor tab is your daily health check — every pipeline run is logged with status, duration, and row counts
  • Resource Groups let you see all project costs together and delete everything with one click when done
📄 Resume Bullet Points
Copy these directly to your resume — tailored from this lesson

Built end-to-end Medallion Architecture batch pipeline on Azure: ADLS Gen2 → Databricks PySpark → Synapse Analytics

Implemented data quality framework validating 5,000 daily records — removing nulls, duplicates, and invalid values in Silver layer

Orchestrated multi-step Azure Data Factory pipeline with chained Databricks Notebook activities on daily schedule trigger

Share

Discussion

0

Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.

Continue with GitHub
Loading...