Pandas DataFrames
Load, clean, transform and explore real datasets. Every Pandas operation ML projects actually use — with Swiggy and Razorpay examples throughout.
Real data never arrives as a clean NumPy array. It arrives as a mess.
NumPy is perfect for numerical computation — arrays of floats, matrix multiplications, vectorised operations. But real ML datasets are not pure numbers. They're a mix of dates, categories, free text, IDs, and numbers all in the same table. Some columns are missing values. Some have wrong types. Some need to be joined to other tables. Some need to be grouped, aggregated, and reshaped before a model can touch them.
This is Pandas' job. It provides the DataFrame — a table with named columns, mixed types, and an enormous API for loading, cleaning, exploring, transforming, and exporting tabular data. Every single ML project starts in Pandas before the data ever reaches sklearn or PyTorch.
The running dataset in this module is a simulated Swiggy order table — 10,000 rows with order IDs, restaurant names, distances, delivery times, ratings, and some intentional data quality issues. By the end of this module you'll have cleaned it, explored it, engineered features from it, and prepared it for a model.
What this module covers:
Series and DataFrame — the building blocks
Pandas has two primary objects. A Series is a one-dimensional labelled array — like a single column of a spreadsheet. A DataFrame is a two-dimensional labelled table — like a full spreadsheet. Every DataFrame is a collection of Series sharing the same index.
Reading from CSV, JSON, SQL, Excel and Parquet
Most ML datasets come from files or databases. Pandas can read almost any format. The options you pass to these read functions directly determine data quality — getting them right saves hours of cleaning later.
The first five commands on every new dataset
Before touching a dataset you should always run the same five exploration commands. They take 30 seconds and reveal 90% of the data quality problems you'll spend hours debugging later if you skip them.
Selecting rows and columns — loc vs iloc vs direct access
Pandas has three ways to select data. Direct column access with df['col'] for columns. .loc for label-based selection (use column names and index labels). .iloc for integer position-based selection (use numbers like NumPy). Mixing these up is the most common Pandas mistake beginners make.
Handling missing values — detect, understand, decide, fix
Missing data is in every real dataset. The question is never "is there missing data?" but "why is it missing and what should I do about it?" There are three reasons data goes missing, and each calls for a different treatment.
The missingness has nothing to do with the data. A sensor randomly dropped readings. A survey respondent accidentally skipped a question. Safe to drop or impute with column mean/median.
Missingness depends on other observed variables but not on the missing value itself. More careful imputation needed — use information from correlated columns.
Missingness depends on the missing value itself. Dangerous — simple imputation introduces bias. Requires domain knowledge and careful handling.
apply, map and vectorised operations — transform any column
Transforming columns is the core of feature engineering. Pandas gives you three mechanisms: vectorised operations (fastest — use whenever possible), .map() for element-wise transformation of a Series, and .apply() for row-wise or column-wise operations on a DataFrame. Use them in that order of preference — vectorised operations are 100× faster than apply loops.
GroupBy — the most powerful Pandas operation
GroupBy splits the DataFrame into groups based on one or more columns, applies a function to each group, and combines the results. This is the core of almost all exploratory data analysis and feature engineering. It answers questions like "what is the average delivery time per restaurant?" or "which city has the highest fraud rate?" — the questions you answer before deciding what features to build.
Merge and join — combine data from multiple sources
Real ML projects always involve multiple tables. Orders table. Customers table. Restaurants table. Weather data. All need to be joined together before you can train a model. Pandas merge is SQL JOIN — if you know SQL joins, this is identical. If you don't, the examples below will make it clear immediately.
String operations — the .str accessor
Real datasets are full of string columns — restaurant names, addresses, product descriptions, customer comments. Before feeding them to a model you need to clean and extract information from them. The .str accessor applies string methods to every element of a Series in one vectorised call.
DateTime features — extract time-based signals for ML
Time columns are one of the richest sources of features in ML. Hour of day, day of week, month, whether it's a holiday, days since last event — these consistently improve models for delivery time, demand forecasting, fraud detection, and anything with temporal patterns. Pandas makes extracting them trivial.
From DataFrame to NumPy — prepare data for model training
After all the loading, cleaning, and feature engineering, the final step is converting the DataFrame to NumPy arrays that sklearn, PyTorch, or XGBoost can consume. This bridge is where most beginners make the mistakes that silently corrupt model training — leaking the test set into the training pipeline, not handling categoricals correctly, or fitting scalers on the wrong data.
Every common Pandas error — explained and fixed
You can now take any real dataset from raw file to model-ready array.
The programming ecosystem section is complete. Python, NumPy, and Pandas — the three tools every ML engineer uses every day, and that every ML library is built on top of. Every algorithm in the Classical ML section assumes you can load data, explore it, clean it, engineer features, and convert it to a NumPy array. You can do all of that now.
Module 11 begins the Data Engineering section with data collection — pulling data from REST APIs, SQL databases, file systems, and web scraping. In production ML, the data you get from your company's systems is never as clean as the datasets in tutorials. The next section closes that gap.
Where ML data actually comes from and how to pull it reliably — REST APIs, SQL queries, Parquet files, and web scraping.
🎯 Key Takeaways
- ✓A Series is a 1D labelled array (one column). A DataFrame is a 2D labelled table (multiple columns sharing one index). Every DataFrame is a dict of Series.
- ✓Always run df.info(), df.head(), df.describe(), df.isnull().sum(), and df.value_counts() on every new dataset before touching it. These five commands reveal 90% of data quality issues.
- ✓.loc selects by label (column names, index labels) — end label IS included. .iloc selects by integer position (like NumPy) — end position NOT included. Never chain them: df.loc[mask]['col'] = val is always wrong — use df.loc[mask, col] = val.
- ✓Missing data has three types: MCAR (safe to impute with statistics), MAR (use correlated columns), MNAR (dangerous — requires domain knowledge). Always check whether missingness is random before choosing an imputation strategy.
- ✓GroupBy is split-apply-combine. Use .agg() for multiple aggregations, named aggregations for clean output, .transform() to add group statistics back to every row (essential for feature engineering without changing DataFrame shape).
- ✓Always fit scalers and encoders on training data only, then transform both train and test. Fitting on the full dataset leaks test information into training — a silent bug that inflates evaluation metrics. Use sklearn Pipeline to prevent leakage automatically.
- ✓For time columns use the .dt accessor to extract hour, day_of_week, month, is_weekend etc. Use sine/cosine encoding for cyclical features (hour, day of week) so that hour 23 and hour 0 are recognised as numerically adjacent.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.