Python · SQL · Web Dev · Java · AI/ML tracks launching soon — your one platform for all of IT
Advanced+500 XP

Project 04 — Download Files From a Public HTTPS URL

Stop uploading files manually. Build a pipeline that goes directly to a public internet URL, downloads the CSV, and saves it to ADLS — automatically, every morning. No analyst involvement required.

60–75 min March 2026
Series
Azure DE — Zero to Advanced
Project
04 of 25
Level
Beginner
Time
60–75 min
Builds on: Projects 01, 02, 03 — same resource group, same storage account, same ADF

🏢 Real World Problem

FreshMart's data team now works with an external supplier — AgriPrice India — a government-backed organization that publishes daily wholesale vegetable and fruit prices across Indian cities.

FreshMart's category managers need this data to:

  • Compare what they paid suppliers vs the wholesale market price
  • Spot overpriced procurement deals
  • Adjust store pricing based on commodity fluctuations

AgriPrice India publishes their data every morning as a CSV file on their public website. Right now, FreshMart's analyst manually opens the browser every morning, downloads the file, and uploads it to Azure. 20 minutes every single day. If the analyst is on leave, it doesn't happen at all.

BEFORE:
  Analyst → opens browser → downloads file → uploads to Azure → 20 min/day

AFTER:
  ADF Pipeline → hits the URL → saves to ADLS → 0 human minutes

🧠 Concepts You Must Understand First

What is an HTTP Linked Service?

In Projects 01–03, our source was always Azure Blob Storage — files sitting inside Azure. But data sources in the real world are everywhere: government portals, supplier servers, financial data providers, open datasets. All accessible via HTTP or HTTPS — the same protocol your browser uses.

An HTTP Linked Service tells ADF:

💡 Note
"I want to connect to this base URL on the internet. Here are the connection details."
How base URL + relative URL combine
HTTP Linked Service:   https://people.sc.fsu.edu
                       ↑ the base server address

HTTP Dataset:          /~jburkardt/data/csv/cities.csv
                       ↑ the specific file path on that server

Together they form:    https://people.sc.fsu.edu/~jburkardt/data/csv/cities.csv
                       ↑ the full URL ADF fetches

HTTP File Download vs REST API

HTTP (File Download)REST API (Data Service)
Downloads a static fileReturns structured data (JSON/XML)
URL points to a fileURL is an endpoint that processes requests
Response is the file itselfResponse is data (often paginated)
No authentication usuallyUsually requires API key or OAuth token
Example: CSV on government websiteExample: Twitter API, Weather API
This project →Project 06 →

Public Datasets We Are Using

We will use two stable, public CSVs hosted by Florida State University — available for 10+ years:

Primary dataset
URL: https://people.sc.fsu.edu/~jburkardt/data/csv/cities.csv

LatD,LatM,LatS,NS,LonD,LonM,LonS,EW,City,State
41,5,59,N,80,39,0,W,Youngstown,OH
42,52,48,N,97,23,23,W,Yankton,SD
...
Secondary dataset
URL: https://people.sc.fsu.edu/~jburkardt/data/csv/grades.csv
💡 Note
Government URLs change. A university dataset URL is stable. We treat this as AgriPrice India's data for the scenario — the pipeline pattern is identical regardless of URL.

What We Are Building

INTERNET (Public URLs)                ADF Pipeline               ADLS Gen2

https://fsu.edu/.../cities.csv  ──►   HTTP Linked Service  ──►   raw/
                                       Copy Activity              └── external/
https://fsu.edu/.../grades.csv  ──►   Copy Activity                  ├── cities/
                                                                      │   └── cities.csv
                                       pl_download_external_data      └── grades/
                                                                          └── grades.csv

📋 Step by Step Overview

PHASE 1 — Understand the Source URLs (5 min)
  Step 1: Open and inspect the public CSV URLs in your browser

PHASE 2 — Create HTTP Linked Service (15 min)
  Step 2: Create Linked Service for the FSU data server
  Step 3: Test the connection

PHASE 3 — Create Datasets (15 min)
  Step 4: Create parameterized HTTP source dataset
  Step 5: Create ADLS sink dataset with dynamic path

PHASE 4 — Build the Pipeline (30 min)
  Step 6:  Create new pipeline
  Step 7:  Add parameters — source_relative_url, destination_folder, file_name
  Step 8:  Add Copy activity
  Step 9:  Configure HTTP Source
  Step 10: Configure ADLS Sink
  Step 11: Debug and verify
  Step 12: Add second Copy activity for parallel download
  Step 13: Debug parallel run
  Step 14: Publish

Phase 1 — Understand the Source URLs

Step 1 — Open and Inspect the URLs in Your Browser

Before building anything in ADF, always verify the source URL works.

Open your browser and go to:

https://people.sc.fsu.edu/~jburkardt/data/csv/cities.csv

Your browser should display or download the CSV. You should see data starting with:

LatD,LatM,LatS,NS,LonD,LonM,LonS,EW,City,State
41,5,59,N,80,39,0,W,Youngstown,OH
📸SCREENSHOT

Browser showing the cities.csv file content at the FSU URL — raw CSV text visible

Now open the second URL:

https://people.sc.fsu.edu/~jburkardt/data/csv/grades.csv
📸SCREENSHOT

Browser showing the grades.csv file content — raw CSV text visible

💡 Note
If a URL is broken, you want to know before spending 30 minutes building ADF pipelines. This 30-second check saves hours of debugging.

Notice the split between base URL and file path:

Base URL:  https://people.sc.fsu.edu
File path: /~jburkardt/data/csv/cities.csv
File path: /~jburkardt/data/csv/grades.csv

This separation matters — the Linked Service stores the base URL, the Dataset stores the file path.

Phase 2 — Create HTTP Linked Service

Step 2 — Create the HTTP Linked Service

In ADF Studio → Manage (toolbox icon) → Linked services"+ New"

In the search box → type "HTTP" → select "HTTP" → click "Continue"

📸SCREENSHOT

New linked service search — 'HTTP' typed, HTTP connector highlighted in results

📸SCREENSHOT

HTTP linked service form — blank, ready to fill

Fill in the form exactly:

Name:                    ls_http_public_data
Description:             HTTP connection to public data sources on the internet
Connect via:             AutoResolveIntegrationRuntime
Base URL:                https://people.sc.fsu.edu
Authentication type:     Anonymous
💡 Note
Base URL is the root domain only — no file path, no trailing slash.
Authentication type: Anonymous means no login required. In Project 06 when we connect to a REST API, you will see how to add API keys and Bearer tokens here.
📸SCREENSHOT

HTTP linked service form fully filled — name, base URL, anonymous authentication all set

Step 3 — Test the Connection

Click "Test connection" at the bottom of the form

📸SCREENSHOT

Test connection button being clicked

You should see:

✅ Connection successful
📸SCREENSHOT

Green 'Connection successful' message

⚠️ Important
If the connection test fails:
"Could not connect to host" → Check the base URL spelling — no trailing slash, no file path
"SSL certificate error" → Try changing https:// to http://
"Timeout" → Azure's outbound internet may be blocked in your subscription — check network policies

Click "Create"

📸SCREENSHOT

Linked services list — ls_http_public_data now visible alongside previous blob and ADLS linked services

Phase 3 — Create Datasets

Step 4 — Create Parameterized HTTP Source Dataset

This dataset will be reusable — instead of creating one dataset per URL, we create one dataset with arelative_url parameter. Different pipelines can pass different file paths to the same dataset.

In ADF Studio → AuthorDatasets"+""New dataset"

Search "HTTP" → select "HTTP" → click "Continue"

📸SCREENSHOT

New dataset — HTTP connector selected

Select format: "DelimitedText" (CSV) → "Continue"

📸SCREENSHOT

Format selection — DelimitedText selected

Name:            ds_src_http_csv
Linked service:  ls_http_public_data
Relative URL:    (leave empty — we will make this dynamic)
Request method:  GET
First row as header: ✅ Yes
Import schema:   None
📸SCREENSHOT

HTTP dataset form — name filled, linked service selected, relative URL left empty

Click "OK"

Now add a parameter. Click "Parameters" tab → "+ New"

Name:    relative_url
Type:    String
Default: (empty)
📸SCREENSHOT

Dataset Parameters tab — relative_url parameter of type String

Click "Connection" tab → click inside "Relative URL" field → click "Add dynamic content"

In the expression editor → click relative_url under Parameters

Expression
@dataset().relative_url
📸SCREENSHOT

Dynamic content editor — @dataset().relative_url expression

Click "OK"

📸SCREENSHOT

Connection tab — Relative URL showing @dataset().relative_url dynamic expression

💡 Note
ADF joins the base URL and relative URL automatically.
Linked Service: https://people.sc.fsu.edu
Dataset relative URL: /~jburkardt/data/csv/cities.csv
ADF fetches: https://people.sc.fsu.edu/~jburkardt/data/csv/cities.csv

Click 💾 Save

Step 5 — Create ADLS Sink Dataset With Dynamic Path

In ADF Studio → Datasets"+""New dataset"

Search "Azure Data Lake Storage Gen2" → select → "Continue"

Select "DelimitedText""Continue"

Name:            ds_sink_adls_external
Linked service:  ls_adls_freshmart
File path:       (leave all empty)
First row as header: ✅ Yes
Import schema:   None

Click "OK"

Click "Parameters" tab → add TWO parameters:

Parameter 1:
  Name:    destination_folder
  Type:    String

Parameter 2:
  Name:    file_name
  Type:    String
📸SCREENSHOT

Dataset Parameters tab — destination_folder and file_name parameters listed

Click "Connection" tab → configure each field:

Container:

raw

Directory"Add dynamic content":

Directory expression
external/@{dataset().destination_folder}

File"Add dynamic content":

File expression
@dataset().file_name
💡 Note
Why external/ as a top-level folder?
Keeps internet-sourced data separate from internal store sales data.
raw/
├── sales/          ← internal store sales (Projects 01–03)
└── external/       ← data downloaded from internet (this project)
    ├── cities/
    └── grades/
📸SCREENSHOT

Sink dataset Connection tab — raw/external/@{dataset().destination_folder}/@dataset().file_name

Click 💾 Save

Phase 4 — Build the Pipeline

Step 6 — Create New Pipeline

In ADF Studio → Author"+" next to Pipelines → "New pipeline"

Name:        pl_download_external_data
Description: Downloads CSV files from public HTTPS URLs into ADLS raw/external/
📸SCREENSHOT

New blank pipeline — name in Properties panel on the right

Step 7 — Add Pipeline Parameters

Click empty canvas"Parameters" tab at the bottom → add THREE parameters:

Parameter 1:
  Name:    source_relative_url
  Type:    String
  Default: /~jburkardt/data/csv/cities.csv

Parameter 2:
  Name:    destination_folder
  Type:    String
  Default: cities

Parameter 3:
  Name:    file_name
  Type:    String
  Default: cities.csv
📸SCREENSHOT

Pipeline Parameters tab — all three parameters listed with their defaults

💡 Note
One pipeline, infinite files. Controlled entirely by parameters:
Run 1 → downloads cities.csv to raw/external/cities/cities.csv
Run 2 → downloads grades.csv to raw/external/grades/grades.csv

Step 8 — Add Copy Activity

From left activities panel → "Move & transform" → drag "Copy data" onto the canvas

📸SCREENSHOT

Copy data activity dragged onto the pipeline canvas

Click the Copy activity → bottom panel → General tab:

Name:        copy_http_to_adls
Description: Downloads file from HTTP source URL and saves to ADLS raw/external/
📸SCREENSHOT

General tab — name and description filled

Step 9 — Configure HTTP Source

Click "Source" tab

Source dataset:  ds_src_http_csv

The Dataset properties section appears with the relative_url field.

Click inside relative_url"Add dynamic content" → click source_relative_url under Parameters:

relative_url expression
@pipeline().parameters.source_relative_url
📸SCREENSHOT

Dynamic content editor — @pipeline().parameters.source_relative_url expression

Click "OK"

Additional Source settings:
  Request method:         GET
  Additional headers:     (leave empty)
  Pagination rules:       (leave empty)
📸SCREENSHOT

Source tab fully configured — ds_src_http_csv selected, relative_url showing pipeline parameter expression

Step 10 — Configure ADLS Sink

Click "Sink" tab

Sink dataset:  ds_sink_adls_external

Two dataset properties appear. Fill both:

destination_folder"Add dynamic content":

destination_folder
@pipeline().parameters.destination_folder

file_name"Add dynamic content":

file_name
@pipeline().parameters.file_name
📸SCREENSHOT

Sink tab — both destination_folder and file_name properties filled with pipeline parameter expressions

Step 11 — Debug and Verify

Click "Validate" in the top toolbar

📸SCREENSHOT

Validation successful — no errors message

Click "Debug" → the parameter dialog appears pre-filled with your defaults:

source_relative_url:  /~jburkardt/data/csv/cities.csv
destination_folder:   cities
file_name:            cities.csv
📸SCREENSHOT

Debug parameter dialog — all three parameters showing their default values

Click "OK"

📸SCREENSHOT

Pipeline running — copy_http_to_adls activity showing yellow/running status

📸SCREENSHOT

Pipeline succeeded — copy_http_to_adls activity showing green checkmark

Click the 👓 glasses icon next to the run in the Output tab:

Data read:      X bytes
Data written:   X bytes
Files read:     1
Files written:  1
Source:         https://people.sc.fsu.edu/~jburkardt/data/csv/cities.csv
Destination:    raw/external/cities/cities.csv
📸SCREENSHOT

Copy activity run details — showing source URL and destination path, files read and written = 1

Verify in ADLS: Azure Portal → Storage → stfreshmartdev → Containers → raw

raw/
├── sales/          ← from Projects 01–03
└── external/       ← NEW — created by this pipeline
    └── cities/
        └── cities.csv  ✅
📸SCREENSHOT

raw container — showing both 'sales' and 'external' folders

📸SCREENSHOT

raw/external/cities/ — cities.csv visible with file size

📸SCREENSHOT

cities.csv preview in Azure Portal — CSV data with LatD, LatM, City, State columns

Step 12 — Download the Second File (grades.csv)

Run the same pipeline again with different parameters — no code changes.

Click "Debug" again → change all three parameters:

source_relative_url:  /~jburkardt/data/csv/grades.csv
destination_folder:   grades
file_name:            grades.csv
📸SCREENSHOT

Debug dialog — all three parameters changed to grades values

Click "OK" → wait for green ✅

📸SCREENSHOT

Pipeline succeeded again — same pipeline, different parameters

raw/external/
├── cities/
│   └── cities.csv   ✅  (from first run)
└── grades/
    └── grades.csv   ✅  (from second run)
📸SCREENSHOT

raw/external/ — both cities and grades folders visible

🎯 Pro Tip
This is the power of parameterized pipelines. One pipeline downloaded two completely different files from two different URLs into two different folders — zero changes to the pipeline itself.

Step 13 — Add a Second Copy Activity for Parallel Download

Right now you run the pipeline twice manually. In production you want both files in a single run. Let's redesign the pipeline to download both files simultaneously.

First — update pipeline parameters to handle two files. Click empty canvas → Parameters tab → replace with 6 parameters:

file1_url:          /~jburkardt/data/csv/cities.csv
file1_folder:       cities
file1_name:         cities.csv
file2_url:          /~jburkardt/data/csv/grades.csv
file2_folder:       grades
file2_name:         grades.csv
📸SCREENSHOT

Pipeline Parameters tab — six parameters, three for each file

Rename the existing Copy activity → General tab: copy_file1_http_to_adls

Update its Source dataset property:

relative_url
@pipeline().parameters.file1_url

Update its Sink dataset properties:

destination_folder
@pipeline().parameters.file1_folder
file_name
@pipeline().parameters.file1_name
📸SCREENSHOT

First copy activity — Source showing file1_url parameter, Sink showing file1_folder and file1_name

Drag another "Copy data" onto the canvas — place it beside the first — do NOT connect them with an arrow

📸SCREENSHOT

Canvas with TWO Copy activities side by side — no arrow connecting them, they run in parallel

⚠️ Important
No arrow = parallel. Activities without an arrow connecting them run at the same time. An arrow creates a dependency — meaning sequential execution. Do NOT connect them.

Click the second Copy activity → General tab:

Name:        copy_file2_http_to_adls
Description: Downloads grades.csv from HTTP source

Source tabds_src_http_csv → relative_url:

relative_url
@pipeline().parameters.file2_url
📸SCREENSHOT

Second Copy activity Source tab — file2_url parameter used

Sink tabds_sink_adls_external

destination_folder
@pipeline().parameters.file2_folder
file_name
@pipeline().parameters.file2_name
📸SCREENSHOT

Second Copy activity Sink tab — file2_folder and file2_name parameters used

Step 14 — Debug the Parallel Pipeline and Publish

Click "Validate" → should pass with no errors

📸SCREENSHOT

Validation successful

Click "Debug" → all 6 parameters show their defaults → click "OK"

📸SCREENSHOT

Debug dialog — all 6 parameters visible with default values pre-filled

📸SCREENSHOT

Both copy activities running simultaneously — both showing yellow/spinning at the same time

📸SCREENSHOT

Both activities completed — both showing green checkmarks

Both ran in parallel. Total time ≈ the slower activity, not the sum:

copy_file1_http_to_adls:  ~5 seconds
copy_file2_http_to_adls:  ~4 seconds
Total pipeline time:       ~5 seconds (parallel!)

If sequential:             ~9 seconds
Time saved:                ~4 seconds
📸SCREENSHOT

Output tab — both activities showing individual durations, total pipeline duration is the max not the sum

raw/external/
├── cities/
│   └── cities.csv   ✅
└── grades/
    └── grades.csv   ✅
📸SCREENSHOT

raw/external/ folder — both cities and grades folders side by side

Click "Publish all"

Publishing:
  pl_download_external_data   (new)
  ds_src_http_csv             (new)
  ds_sink_adls_external       (new)
  ls_http_public_data         (new)
📸SCREENSHOT

Publish panel — all new items listed

📸SCREENSHOT

Successfully published notification

🎯 What You Built — Summary

BEFORE:
  Analyst manually downloads file from browser every morning
  Uploads to Azure manually — 20 minutes per day
  If analyst is on leave — data is missing

AFTER:
  ADF fetches directly from URL — zero human involvement
  Two files downloaded in parallel in a single pipeline run
  Same pipeline handles any public CSV by changing parameters
  raw/external/ folder organized for all internet-sourced data

Full ADF Resource Inventory After Project 04

Resource TypeNamePurpose
Linked Servicels_blob_freshmart_landingAzure Blob (landing zone)
Linked Servicels_adls_freshmartADLS Gen2 (raw/processed/curated)
Linked Servicels_http_public_dataPublic internet HTTP/HTTPS
Datasetds_src_blob_daily_salesSingle static CSV from Blob
Datasetds_src_blob_store_salesParameterized store files from Blob
Datasetds_src_blob_dated_store_salesDate-parameterized store files
Datasetds_src_http_csvPublic HTTP CSV — relative URL is dynamic
Datasetds_sink_adls_raw_salesADLS sink for sales data
Datasetds_sink_adls_dated_salesADLS sink with date partitioning
Datasetds_sink_adls_externalADLS sink for external downloads
Pipelinepl_copy_daily_sales_csvProject 01 — single file
Pipelinepl_copy_all_store_salesProject 02 — ForEach loop
Pipelinepl_copy_store_sales_by_dateProject 03 — date parameterized
Pipelinepl_download_external_dataProject 04 — HTTP download

🧠 Key Concepts to Remember

ConceptWhat It IsWhen You Use It
HTTP Linked ServiceConnection to an internet serverDownloading files from public URLs
Base URLRoot domain in HTTP linked serviceSet once, shared by all datasets using that server
Relative URLFile path after the base URLSpecific file on the server
Anonymous authNo login neededPublic datasets with no access control
Parallel activitiesTwo activities with no arrow between themWhen tasks are independent, run simultaneously
Sequential activitiesTwo activities connected with arrowWhen task B needs task A to finish first
Pipeline time (parallel)Max of all parallel activity durationsFaster than sequential for independent tasks
external/ folderSeparate folder for internet-sourced dataKeep internal and external data organized

⚠️ Common Mistakes in This Project

MistakeFix
Full URL in linked service base URLBase URL is domain only — no file path. Wrong: https://fsu.edu/file.csv Right: https://fsu.edu
Missing leading slash in relative URLRelative URL must start with / — Wrong: ~jburkardt/data.csv Right: /~jburkardt/data.csv
Connecting parallel activities with arrowDelete the arrow — activities without arrows run in parallel automatically
Using POST instead of GETFor file downloads always use GET as the request method
HTTPS vs HTTP mismatchMake sure the base URL protocol matches the actual server — use https:// for secure sites

🎯 Key Takeaways

  • The HTTP Linked Service stores only the base URL — the file path goes in the Dataset as a dynamic parameter
  • Always verify public URLs work in your browser BEFORE building the ADF pipeline
  • Anonymous authentication is used for public datasets — no login required
  • Activities without an arrow between them run in parallel — no arrow needed, no configuration required
  • Parallel pipeline time = the slowest activity, not the sum of all activities
  • The external/ folder keeps internet-sourced data organized separately from internal store data
  • One parameterized pipeline can download any file from any URL on the same server — just change the parameters

🚀 What's Coming in Project 05

So far our files land in ADLS with generic names like cities.csv. In production, you need to know when a file was downloaded — was it today's data or last week's?

In Project 05 you will learn to:

  • Automatically add today's date to downloaded file names: cities_20240115.csv
  • Organize files into date folders automatically: raw/external/cities/date=2024-01-15/
  • Rename and move files after copying using the Get Metadata and Delete activities
  • Combine everything from Projects 01–04 into one clean, organized pipeline
Share

Discussion

0

Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.

Continue with GitHub
Loading...