Feature Engineering
Transform raw columns into powerful model inputs. Log transforms, interaction features, target encoding, cyclical encodings, embeddings, and the techniques that consistently beat model tuning.
The model doesn't see your data. It sees the numbers you give it. Make those numbers count.
A linear regression predicting delivery time from distance_km will give you one set of numbers. The same linear regression predicting from log(distance_km) will give you significantly better numbers — because delivery time grows sub-linearly with distance (the first kilometre adds more time per km than the fifth kilometre). Same model. Different representation. Better result.
This is the core idea of feature engineering: transforming raw columns into representations that better match the mathematical assumptions of the model. Tree-based models (Random Forest, XGBoost) are robust to raw features but still benefit from interaction features and target encoding. Linear models need transformations to handle skew and non-linearity. Neural networks benefit from normalisation and embedding representations for categoricals.
In competitive ML (Kaggle, production systems), feature engineering is consistently the highest-leverage activity. The top solution in most Kaggle competitions uses a standard model on engineered features — not a novel architecture on raw data. This module teaches every major technique with working code on the Swiggy dataset.
What this module covers:
Load the clean Swiggy dataset
Log, sqrt, Box-Cox and scaling — fix skewed distributions
Most real-world numeric features are right-skewed — a few very large values drag the mean far above the median. Linear models assume features are roughly normally distributed. When a feature is heavily skewed, a log transform makes the distribution more symmetric and often produces a dramatically better linear model.
The intuition: delivery time does not increase linearly with distance. Going from 1km to 2km adds more time than going from 9km to 10km (because acceleration, traffic signals, and restaurant location all make short distances disproportionately slow). log(distance) captures this diminishing relationship much better than raw distance.
Products, ratios and differences — capture combined effects
An interaction feature combines two existing features into one that captures their joint effect. Distance and traffic separately each explain some variance in delivery time. But distance × traffic captures the combined effect — a long distance in high traffic is much worse than either alone. Linear models cannot discover this relationship without an explicit interaction term. Tree models can, but having it explicit speeds up and improves learning.
Binning — discretise continuous variables into categories
Binning converts a continuous variable into discrete buckets. This sounds like losing information — and sometimes it is. But for linear models, binning can capture non-linear step-function relationships that a linear term cannot. For tree models, binning pre-computes splits the tree would find anyway, sometimes speeding up training significantly on high-cardinality features.
Datetime feature engineering — extract every signal from a timestamp
A raw timestamp is useless to an ML model. But the features you extract from it — hour of day, day of week, whether it's a holiday, days since the last order — are often among the most predictive features in the whole dataset. Delivery time varies dramatically by hour. Restaurant prep time varies by day of week. Order value varies by time slot. The timestamp encodes all of this, but only if you extract it.
Encoding strategies — from one-hot to target encoding
Categorical columns cannot go into an ML model as strings. They must be converted to numbers. There are many ways to do this, and the choice matters significantly for model performance. One-hot encoding is safest but creates sparse high-dimensional representations. Target encoding is compact and informative but requires careful implementation to avoid leakage.
One-hot encoding — safe, sparse, standard
Target encoding — the most powerful, most dangerous technique
Target encoding replaces each category value with the mean of the target variable for that category. "Pizza Hut" becomes 36.4 (its mean delivery time in the training set). This is extremely informative and produces compact, powerful features. It is also the most dangerous encoding technique — naive implementation directly leaks the target into the features, causing massive overfitting that looks great in cross-validation but collapses in production.
Frequency encoding — fast, leakage-safe, surprisingly effective
Aggregate features — group statistics that make each row context-aware
A single order's distance of 5km tells the model less than knowing that this order is 2km longer than the average order to this restaurant. Aggregate features add context by computing statistics within groups — per restaurant, per city, per time slot — and attaching them to each row. These are among the most consistently powerful features across all ML problems.
Feature selection — remove what doesn't help
More features is not always better. Irrelevant features add noise, slow down training, and can hurt generalisation. Feature selection identifies which features contribute meaningful signal and removes those that don't. There are three families of methods, each with different tradeoffs.
Feature leakage — the most dangerous mistake in ML
Feature leakage occurs when information about the target variable leaks into the features during training. The model learns a shortcut — it can "predict" the target because the feature contains the answer, not because it has learned the underlying pattern. Evaluation metrics look impossibly good. Then the model ships to production where the future isn't available, and performance collapses.
Leakage is not always obvious. The most common forms are subtle and require discipline to prevent consistently.
Feature stores — reuse features across models
In a production ML system, the same features are used by multiple models. The "restaurant average delivery time" feature might be used by the ETA prediction model, the fraud model, and the restaurant ranking model. Computing it three times independently wastes compute and introduces inconsistencies. A feature store computes features once, stores them, and serves them to any model that needs them.
Every common feature engineering error — explained and fixed
The data engineering section is complete. The data is ready. Now we build models.
Four modules. Collect data from APIs, SQL, files, and streams. Clean it — remove duplicates, fix types, handle outliers, validate schemas. Engineer features — log transforms, interactions, aggregates, target encoding. The result: a clean, feature-rich DataFrame ready for any ML algorithm.
Module 18 begins the Classical Machine Learning section with linear regression — the oldest, most interpretable, and still one of the most useful algorithms in production ML. Understanding linear regression deeply — not just calling LinearRegression().fit() — reveals the mathematical foundations that every subsequent algorithm (logistic regression, SVMs, neural networks) builds on.
Ordinary least squares, gradient descent, regularisation (Ridge, Lasso, ElasticNet), and how to diagnose and fix every failure mode — all on the Swiggy dataset.
🎯 Key Takeaways
- ✓Feature engineering consistently outperforms model tuning. The same Ridge regression on well-engineered features beats a Random Forest on raw features in many real problems. Invest in features before investing in model complexity.
- ✓Log-transform right-skewed positive columns (distances, prices, counts) before feeding to linear models. np.log1p(x) handles x=0 safely. Verify the transformation reduced skewness before assuming it helped.
- ✓Interaction features (distance × traffic, prep × traffic) capture joint effects that linear models cannot discover on their own. Always try the physically meaningful interactions first before exhaustive polynomial expansion.
- ✓Target encoding is powerful but dangerous. Never compute it on the full dataset. Always use cross-fold encoding: for each training row, compute the category mean using all other folds. Smooth rare categories toward the global mean.
- ✓Aggregate features (per-restaurant average delivery time, per-city late rate) make each row context-aware and are among the most consistently powerful features. Always compute them on training data only and apply via merge.
- ✓Leakage is the most dangerous mistake in ML. It makes evaluation metrics look great while the production model is broken. The rule: fit() only on X_train, transform() on everything. Use sklearn Pipeline to make leakage structurally impossible.
- ✓Feature stores prevent duplicate computation and inconsistency. Define features once, compute them centrally, serve them to any model. Even a simple Parquet-based store prevents the "which version of this feature was used?" debugging nightmare.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.