Project 04 — Download Files From a Public HTTPS URL
Stop uploading files manually. Build a pipeline that goes directly to a public internet URL, downloads the CSV, and saves it to ADLS — automatically, every morning. No analyst involvement required.
Azure DE — Zero to Advanced
04 of 25
Beginner
60–75 min
🏢 Real World Problem
FreshMart's data team now works with an external supplier — AgriPrice India — a government-backed organization that publishes daily wholesale vegetable and fruit prices across Indian cities.
FreshMart's category managers need this data to:
- Compare what they paid suppliers vs the wholesale market price
- Spot overpriced procurement deals
- Adjust store pricing based on commodity fluctuations
AgriPrice India publishes their data every morning as a CSV file on their public website. Right now, FreshMart's analyst manually opens the browser every morning, downloads the file, and uploads it to Azure. 20 minutes every single day. If the analyst is on leave, it doesn't happen at all.
BEFORE:
Analyst → opens browser → downloads file → uploads to Azure → 20 min/day
AFTER:
ADF Pipeline → hits the URL → saves to ADLS → 0 human minutes🧠 Concepts You Must Understand First
What is an HTTP Linked Service?
In Projects 01–03, our source was always Azure Blob Storage — files sitting inside Azure. But data sources in the real world are everywhere: government portals, supplier servers, financial data providers, open datasets. All accessible via HTTP or HTTPS — the same protocol your browser uses.
An HTTP Linked Service tells ADF:
HTTP Linked Service: https://people.sc.fsu.edu
↑ the base server address
HTTP Dataset: /~jburkardt/data/csv/cities.csv
↑ the specific file path on that server
Together they form: https://people.sc.fsu.edu/~jburkardt/data/csv/cities.csv
↑ the full URL ADF fetchesHTTP File Download vs REST API
| HTTP (File Download) | REST API (Data Service) |
|---|---|
| Downloads a static file | Returns structured data (JSON/XML) |
| URL points to a file | URL is an endpoint that processes requests |
| Response is the file itself | Response is data (often paginated) |
| No authentication usually | Usually requires API key or OAuth token |
| Example: CSV on government website | Example: Twitter API, Weather API |
| This project → | Project 06 → |
Public Datasets We Are Using
We will use two stable, public CSVs hosted by Florida State University — available for 10+ years:
URL: https://people.sc.fsu.edu/~jburkardt/data/csv/cities.csv
LatD,LatM,LatS,NS,LonD,LonM,LonS,EW,City,State
41,5,59,N,80,39,0,W,Youngstown,OH
42,52,48,N,97,23,23,W,Yankton,SD
...URL: https://people.sc.fsu.edu/~jburkardt/data/csv/grades.csvWhat We Are Building
INTERNET (Public URLs) ADF Pipeline ADLS Gen2
https://fsu.edu/.../cities.csv ──► HTTP Linked Service ──► raw/
Copy Activity └── external/
https://fsu.edu/.../grades.csv ──► Copy Activity ├── cities/
│ └── cities.csv
pl_download_external_data └── grades/
└── grades.csv📋 Step by Step Overview
PHASE 1 — Understand the Source URLs (5 min)
Step 1: Open and inspect the public CSV URLs in your browser
PHASE 2 — Create HTTP Linked Service (15 min)
Step 2: Create Linked Service for the FSU data server
Step 3: Test the connection
PHASE 3 — Create Datasets (15 min)
Step 4: Create parameterized HTTP source dataset
Step 5: Create ADLS sink dataset with dynamic path
PHASE 4 — Build the Pipeline (30 min)
Step 6: Create new pipeline
Step 7: Add parameters — source_relative_url, destination_folder, file_name
Step 8: Add Copy activity
Step 9: Configure HTTP Source
Step 10: Configure ADLS Sink
Step 11: Debug and verify
Step 12: Add second Copy activity for parallel download
Step 13: Debug parallel run
Step 14: PublishPhase 1 — Understand the Source URLs
Step 1 — Open and Inspect the URLs in Your Browser
Before building anything in ADF, always verify the source URL works.
Open your browser and go to:
https://people.sc.fsu.edu/~jburkardt/data/csv/cities.csvYour browser should display or download the CSV. You should see data starting with:
LatD,LatM,LatS,NS,LonD,LonM,LonS,EW,City,State
41,5,59,N,80,39,0,W,Youngstown,OHBrowser showing the cities.csv file content at the FSU URL — raw CSV text visible
Now open the second URL:
https://people.sc.fsu.edu/~jburkardt/data/csv/grades.csvBrowser showing the grades.csv file content — raw CSV text visible
Notice the split between base URL and file path:
Base URL: https://people.sc.fsu.edu
File path: /~jburkardt/data/csv/cities.csv
File path: /~jburkardt/data/csv/grades.csvThis separation matters — the Linked Service stores the base URL, the Dataset stores the file path.
Phase 2 — Create HTTP Linked Service
Step 2 — Create the HTTP Linked Service
In ADF Studio → Manage (toolbox icon) → Linked services → "+ New"
In the search box → type "HTTP" → select "HTTP" → click "Continue"
New linked service search — 'HTTP' typed, HTTP connector highlighted in results
HTTP linked service form — blank, ready to fill
Fill in the form exactly:
Name: ls_http_public_data
Description: HTTP connection to public data sources on the internet
Connect via: AutoResolveIntegrationRuntime
Base URL: https://people.sc.fsu.edu
Authentication type: AnonymousAuthentication type: Anonymous means no login required. In Project 06 when we connect to a REST API, you will see how to add API keys and Bearer tokens here.
HTTP linked service form fully filled — name, base URL, anonymous authentication all set
Step 3 — Test the Connection
Click "Test connection" at the bottom of the form
Test connection button being clicked
You should see:
✅ Connection successfulGreen 'Connection successful' message
"Could not connect to host" → Check the base URL spelling — no trailing slash, no file path
"SSL certificate error" → Try changing https:// to http://
"Timeout" → Azure's outbound internet may be blocked in your subscription — check network policies
Click "Create"
Linked services list — ls_http_public_data now visible alongside previous blob and ADLS linked services
Phase 3 — Create Datasets
Step 4 — Create Parameterized HTTP Source Dataset
This dataset will be reusable — instead of creating one dataset per URL, we create one dataset with arelative_url parameter. Different pipelines can pass different file paths to the same dataset.
In ADF Studio → Author → Datasets → "+" → "New dataset"
Search "HTTP" → select "HTTP" → click "Continue"
New dataset — HTTP connector selected
Select format: "DelimitedText" (CSV) → "Continue"
Format selection — DelimitedText selected
Name: ds_src_http_csv
Linked service: ls_http_public_data
Relative URL: (leave empty — we will make this dynamic)
Request method: GET
First row as header: ✅ Yes
Import schema: NoneHTTP dataset form — name filled, linked service selected, relative URL left empty
Click "OK"
Now add a parameter. Click "Parameters" tab → "+ New"
Name: relative_url
Type: String
Default: (empty)Dataset Parameters tab — relative_url parameter of type String
Click "Connection" tab → click inside "Relative URL" field → click "Add dynamic content"
In the expression editor → click relative_url under Parameters
@dataset().relative_urlDynamic content editor — @dataset().relative_url expression
Click "OK"
Connection tab — Relative URL showing @dataset().relative_url dynamic expression
Linked Service:
https://people.sc.fsu.eduDataset relative URL:
/~jburkardt/data/csv/cities.csvADF fetches:
https://people.sc.fsu.edu/~jburkardt/data/csv/cities.csvClick 💾 Save
Step 5 — Create ADLS Sink Dataset With Dynamic Path
In ADF Studio → Datasets → "+" → "New dataset"
Search "Azure Data Lake Storage Gen2" → select → "Continue"
Select "DelimitedText" → "Continue"
Name: ds_sink_adls_external
Linked service: ls_adls_freshmart
File path: (leave all empty)
First row as header: ✅ Yes
Import schema: NoneClick "OK"
Click "Parameters" tab → add TWO parameters:
Parameter 1:
Name: destination_folder
Type: String
Parameter 2:
Name: file_name
Type: StringDataset Parameters tab — destination_folder and file_name parameters listed
Click "Connection" tab → configure each field:
Container:
rawDirectory → "Add dynamic content":
external/@{dataset().destination_folder}File → "Add dynamic content":
@dataset().file_nameexternal/ as a top-level folder?Keeps internet-sourced data separate from internal store sales data.
raw/
├── sales/ ← internal store sales (Projects 01–03)
└── external/ ← data downloaded from internet (this project)
├── cities/
└── grades/Sink dataset Connection tab — raw/external/@{dataset().destination_folder}/@dataset().file_name
Click 💾 Save
Phase 4 — Build the Pipeline
Step 6 — Create New Pipeline
In ADF Studio → Author → "+" next to Pipelines → "New pipeline"
Name: pl_download_external_data
Description: Downloads CSV files from public HTTPS URLs into ADLS raw/external/New blank pipeline — name in Properties panel on the right
Step 7 — Add Pipeline Parameters
Click empty canvas → "Parameters" tab at the bottom → add THREE parameters:
Parameter 1:
Name: source_relative_url
Type: String
Default: /~jburkardt/data/csv/cities.csv
Parameter 2:
Name: destination_folder
Type: String
Default: cities
Parameter 3:
Name: file_name
Type: String
Default: cities.csvPipeline Parameters tab — all three parameters listed with their defaults
Run 1 → downloads cities.csv to
raw/external/cities/cities.csvRun 2 → downloads grades.csv to
raw/external/grades/grades.csvStep 8 — Add Copy Activity
From left activities panel → "Move & transform" → drag "Copy data" onto the canvas
Copy data activity dragged onto the pipeline canvas
Click the Copy activity → bottom panel → General tab:
Name: copy_http_to_adls
Description: Downloads file from HTTP source URL and saves to ADLS raw/external/General tab — name and description filled
Step 9 — Configure HTTP Source
Click "Source" tab
Source dataset: ds_src_http_csvThe Dataset properties section appears with the relative_url field.
Click inside relative_url → "Add dynamic content" → click source_relative_url under Parameters:
@pipeline().parameters.source_relative_urlDynamic content editor — @pipeline().parameters.source_relative_url expression
Click "OK"
Additional Source settings:
Request method: GET
Additional headers: (leave empty)
Pagination rules: (leave empty)Source tab fully configured — ds_src_http_csv selected, relative_url showing pipeline parameter expression
Step 10 — Configure ADLS Sink
Click "Sink" tab
Sink dataset: ds_sink_adls_externalTwo dataset properties appear. Fill both:
destination_folder → "Add dynamic content":
@pipeline().parameters.destination_folderfile_name → "Add dynamic content":
@pipeline().parameters.file_nameSink tab — both destination_folder and file_name properties filled with pipeline parameter expressions
Step 11 — Debug and Verify
Click "Validate" in the top toolbar
Validation successful — no errors message
Click "Debug" → the parameter dialog appears pre-filled with your defaults:
source_relative_url: /~jburkardt/data/csv/cities.csv
destination_folder: cities
file_name: cities.csvDebug parameter dialog — all three parameters showing their default values
Click "OK"
Pipeline running — copy_http_to_adls activity showing yellow/running status
Pipeline succeeded — copy_http_to_adls activity showing green checkmark
Click the 👓 glasses icon next to the run in the Output tab:
Data read: X bytes
Data written: X bytes
Files read: 1
Files written: 1
Source: https://people.sc.fsu.edu/~jburkardt/data/csv/cities.csv
Destination: raw/external/cities/cities.csvCopy activity run details — showing source URL and destination path, files read and written = 1
Verify in ADLS: Azure Portal → Storage → stfreshmartdev → Containers → raw
raw/
├── sales/ ← from Projects 01–03
└── external/ ← NEW — created by this pipeline
└── cities/
└── cities.csv ✅raw container — showing both 'sales' and 'external' folders
raw/external/cities/ — cities.csv visible with file size
cities.csv preview in Azure Portal — CSV data with LatD, LatM, City, State columns
Step 12 — Download the Second File (grades.csv)
Run the same pipeline again with different parameters — no code changes.
Click "Debug" again → change all three parameters:
source_relative_url: /~jburkardt/data/csv/grades.csv
destination_folder: grades
file_name: grades.csvDebug dialog — all three parameters changed to grades values
Click "OK" → wait for green ✅
Pipeline succeeded again — same pipeline, different parameters
raw/external/
├── cities/
│ └── cities.csv ✅ (from first run)
└── grades/
└── grades.csv ✅ (from second run)raw/external/ — both cities and grades folders visible
Step 13 — Add a Second Copy Activity for Parallel Download
Right now you run the pipeline twice manually. In production you want both files in a single run. Let's redesign the pipeline to download both files simultaneously.
First — update pipeline parameters to handle two files. Click empty canvas → Parameters tab → replace with 6 parameters:
file1_url: /~jburkardt/data/csv/cities.csv
file1_folder: cities
file1_name: cities.csv
file2_url: /~jburkardt/data/csv/grades.csv
file2_folder: grades
file2_name: grades.csvPipeline Parameters tab — six parameters, three for each file
Rename the existing Copy activity → General tab: copy_file1_http_to_adls
Update its Source dataset property:
@pipeline().parameters.file1_urlUpdate its Sink dataset properties:
@pipeline().parameters.file1_folder@pipeline().parameters.file1_nameFirst copy activity — Source showing file1_url parameter, Sink showing file1_folder and file1_name
Drag another "Copy data" onto the canvas — place it beside the first — do NOT connect them with an arrow
Canvas with TWO Copy activities side by side — no arrow connecting them, they run in parallel
Click the second Copy activity → General tab:
Name: copy_file2_http_to_adls
Description: Downloads grades.csv from HTTP sourceSource tab → ds_src_http_csv → relative_url:
@pipeline().parameters.file2_urlSecond Copy activity Source tab — file2_url parameter used
Sink tab → ds_sink_adls_external
@pipeline().parameters.file2_folder@pipeline().parameters.file2_nameSecond Copy activity Sink tab — file2_folder and file2_name parameters used
Step 14 — Debug the Parallel Pipeline and Publish
Click "Validate" → should pass with no errors
Validation successful
Click "Debug" → all 6 parameters show their defaults → click "OK"
Debug dialog — all 6 parameters visible with default values pre-filled
Both copy activities running simultaneously — both showing yellow/spinning at the same time
Both activities completed — both showing green checkmarks
Both ran in parallel. Total time ≈ the slower activity, not the sum:
copy_file1_http_to_adls: ~5 seconds
copy_file2_http_to_adls: ~4 seconds
Total pipeline time: ~5 seconds (parallel!)
If sequential: ~9 seconds
Time saved: ~4 secondsOutput tab — both activities showing individual durations, total pipeline duration is the max not the sum
raw/external/
├── cities/
│ └── cities.csv ✅
└── grades/
└── grades.csv ✅raw/external/ folder — both cities and grades folders side by side
Click "Publish all"
Publishing:
pl_download_external_data (new)
ds_src_http_csv (new)
ds_sink_adls_external (new)
ls_http_public_data (new)Publish panel — all new items listed
Successfully published notification
🎯 What You Built — Summary
BEFORE:
Analyst manually downloads file from browser every morning
Uploads to Azure manually — 20 minutes per day
If analyst is on leave — data is missing
AFTER:
ADF fetches directly from URL — zero human involvement
Two files downloaded in parallel in a single pipeline run
Same pipeline handles any public CSV by changing parameters
raw/external/ folder organized for all internet-sourced dataFull ADF Resource Inventory After Project 04
| Resource Type | Name | Purpose |
|---|---|---|
| Linked Service | ls_blob_freshmart_landing | Azure Blob (landing zone) |
| Linked Service | ls_adls_freshmart | ADLS Gen2 (raw/processed/curated) |
| Linked Service | ls_http_public_data | Public internet HTTP/HTTPS |
| Dataset | ds_src_blob_daily_sales | Single static CSV from Blob |
| Dataset | ds_src_blob_store_sales | Parameterized store files from Blob |
| Dataset | ds_src_blob_dated_store_sales | Date-parameterized store files |
| Dataset | ds_src_http_csv | Public HTTP CSV — relative URL is dynamic |
| Dataset | ds_sink_adls_raw_sales | ADLS sink for sales data |
| Dataset | ds_sink_adls_dated_sales | ADLS sink with date partitioning |
| Dataset | ds_sink_adls_external | ADLS sink for external downloads |
| Pipeline | pl_copy_daily_sales_csv | Project 01 — single file |
| Pipeline | pl_copy_all_store_sales | Project 02 — ForEach loop |
| Pipeline | pl_copy_store_sales_by_date | Project 03 — date parameterized |
| Pipeline | pl_download_external_data | Project 04 — HTTP download |
🧠 Key Concepts to Remember
| Concept | What It Is | When You Use It |
|---|---|---|
| HTTP Linked Service | Connection to an internet server | Downloading files from public URLs |
| Base URL | Root domain in HTTP linked service | Set once, shared by all datasets using that server |
| Relative URL | File path after the base URL | Specific file on the server |
| Anonymous auth | No login needed | Public datasets with no access control |
| Parallel activities | Two activities with no arrow between them | When tasks are independent, run simultaneously |
| Sequential activities | Two activities connected with arrow | When task B needs task A to finish first |
| Pipeline time (parallel) | Max of all parallel activity durations | Faster than sequential for independent tasks |
| external/ folder | Separate folder for internet-sourced data | Keep internal and external data organized |
⚠️ Common Mistakes in This Project
| Mistake | Fix |
|---|---|
| Full URL in linked service base URL | Base URL is domain only — no file path. Wrong: https://fsu.edu/file.csv Right: https://fsu.edu |
| Missing leading slash in relative URL | Relative URL must start with / — Wrong: ~jburkardt/data.csv Right: /~jburkardt/data.csv |
| Connecting parallel activities with arrow | Delete the arrow — activities without arrows run in parallel automatically |
| Using POST instead of GET | For file downloads always use GET as the request method |
| HTTPS vs HTTP mismatch | Make sure the base URL protocol matches the actual server — use https:// for secure sites |
🎯 Key Takeaways
- ✓The HTTP Linked Service stores only the base URL — the file path goes in the Dataset as a dynamic parameter
- ✓Always verify public URLs work in your browser BEFORE building the ADF pipeline
- ✓Anonymous authentication is used for public datasets — no login required
- ✓Activities without an arrow between them run in parallel — no arrow needed, no configuration required
- ✓Parallel pipeline time = the slowest activity, not the sum of all activities
- ✓The external/ folder keeps internet-sourced data organized separately from internal store data
- ✓One parameterized pipeline can download any file from any URL on the same server — just change the parameters
🚀 What's Coming in Project 05
So far our files land in ADLS with generic names like cities.csv. In production, you need to know when a file was downloaded — was it today's data or last week's?
In Project 05 you will learn to:
- Automatically add today's date to downloaded file names:
cities_20240115.csv - Organize files into date folders automatically:
raw/external/cities/date=2024-01-15/ - Rename and move files after copying using the Get Metadata and Delete activities
- Combine everything from Projects 01–04 into one clean, organized pipeline
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.