Data Collection — APIs, SQL, Files and Scraping
Where ML data actually comes from and how to pull it reliably. REST APIs with pagination, SQL queries at scale, Parquet pipelines, and scraping — all with production-grade error handling.
Nobody hands you a clean CSV. Data has to be pulled, negotiated with, and earned.
Every ML tutorial starts with a dataset already loaded — iris.csv, mnist, titanic. The real world does not. At Swiggy, the order data lives in a PostgreSQL database behind an internal API. At Razorpay, transaction records are in a Redshift warehouse partitioned by date. At Zepto, inventory data is a stream of events in Kafka. At a startup, it might be a Google Sheet someone exports manually.
Before you can train a model, you have to collect the data. This means making HTTP requests to APIs, running SQL queries, reading from cloud storage, and sometimes scraping a website when there is no API. Each source has its own format, its own failure modes, its own rate limits, and its own quirks.
This module covers every major data source an ML engineer encounters — with real error handling, pagination, retry logic, and performance patterns that make the difference between a pipeline that works once and one that runs reliably every day.
What this module covers:
REST APIs — pulling data over HTTP
A REST API is the most common way to get data from any modern service. You send an HTTP request — GET, POST, PUT, DELETE — to a URL. The server returns JSON. You parse it into a DataFrame or dictionary. The Python requests library handles this in 3 lines. The hard parts are authentication, pagination, rate limiting, and handling failures gracefully.
Basic GET request — the foundation
Authentication — API keys, Bearer tokens, OAuth
Retry logic — handle transient failures automatically
APIs fail. Networks drop. Servers restart. A data collection pipeline that crashes on the first 503 response is not production-ready. You need automatic retry with exponential backoff — wait longer after each failure to avoid hammering a struggling server. The requests library's HTTPAdapter with Retry handles this cleanly.
Pagination — fetching all pages of a large dataset
Most APIs don't return all records at once — they paginate. You get page 1 (100 records), then request page 2, then page 3, until there are no more pages. There are three pagination styles in the wild, and you'll encounter all of them.
SQL — querying databases for ML data
Most company data lives in a relational database — PostgreSQL, MySQL, SQLite, or a cloud warehouse like BigQuery, Redshift, or Snowflake. For ML, you typically need to write a SQL query that joins multiple tables, filters by date range, and aggregates features — then load the result into a Pandas DataFrame. SQLAlchemy is the standard Python library for database connections, and it works with every database.
Chunked reading — large tables that don't fit in RAM
Cloud warehouses — BigQuery, Redshift, Snowflake
Reading files — local, S3, GCS and Azure Blob
In many companies, data is deposited into cloud storage as files — CSV exports from operational databases, Parquet files from data pipelines, JSON dumps from event systems. Cloud storage (S3, GCS, Azure Blob) is cheap, scalable, and Python can read from it almost as easily as from local disk.
Web scraping — extracting data from HTML pages
Some data sources have no API — competitor pricing, job listings, salary data, product reviews, public datasets published as web tables. Web scraping extracts structured data from HTML. Always check the site's robots.txt and Terms of Service before scraping. Scrape politely — add delays, use session caching, and never scrape faster than a human would browse.
BeautifulSoup — static HTML pages
Playwright — JavaScript-rendered dynamic pages
Many modern sites render content with JavaScript — the HTML you get from requests.get() is just a shell with no data. You need a real browser. Playwright controls a real Chromium browser from Python, waits for JavaScript to load, then extracts the rendered HTML.
Kafka — reading streaming event data for ML
High-throughput ML systems — fraud detection, real-time recommendations, delivery ETA prediction — often need to consume data as it streams in, not from batch queries. Apache Kafka is the standard event streaming platform. For ML, you typically read from a Kafka topic, process events, extract features, and either update a model or score against one.
A reusable data collection pipeline
Production data collection is not one-off scripts — it's a pipeline that runs on a schedule, handles failures, logs progress, and stores results in a consistent location. Here's the structure every ML data pipeline follows.
Every common data collection error — explained and fixed
You can now pull data from any source a real ML project will encounter.
APIs with pagination and retry logic. SQL databases with chunked reading and parameterised queries. Cloud storage on S3 and GCS. Dynamic web pages with Playwright. Kafka event streams. A reusable pipeline class that handles failures, logging, and checkpointing. These are the building blocks every ML data engineer uses.
Module 16 moves to data cleaning and validation — the step that comes after collection. Raw data from any of these sources will have nulls, wrong types, duplicate records, schema drift, and outliers. Cleaning it systematically — with validation rules that catch problems before they reach the model — is what separates a reliable pipeline from a fragile one.
Schema validation, duplicate detection, type coercion, outlier handling, and building validation rules that run automatically every time new data arrives.
🎯 Key Takeaways
- ✓Always set a timeout on every HTTP request — requests.get(url, timeout=30). A missing timeout can block a pipeline indefinitely. Use a requests.Session for repeated calls to the same host — it reuses the TCP connection and stores headers.
- ✓Retry logic is not optional for production pipelines. Use exponential backoff: delay = min(base * 2^attempt, max_delay). Always respect the Retry-After header on HTTP 429 responses — it tells you exactly how long to wait.
- ✓Three pagination styles exist: offset/limit (increment page number), cursor-based (use next_cursor from response), and link header (follow rel="next" URL). Always check the API docs to identify which style before writing pagination code.
- ✓Use SQLAlchemy for database connections — it works with every database using the same interface. Always use parameterised queries (text() with bind params) — never format Python variables directly into SQL strings.
- ✓For large SQL tables, use chunksize in pd.read_sql() to process in batches. This prevents MemoryError on multi-million row tables and lets you apply transformations incrementally.
- ✓BeautifulSoup works for static HTML. Playwright is required when content is rendered by JavaScript. Always add delays between requests, check robots.txt, and cache scraped HTML locally during development to avoid re-scraping.
- ✓Wrap every production data collection job in a class with logging, retry tracking, checkpointing, and consistent output format. A pipeline that silently fails and produces empty output is worse than one that fails loudly.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.