Encoding Categorical Features
One-hot encoding, ordinal encoding, target encoding — what each one does to your data, which algorithms need which, and when each is the right choice.
ML models only understand numbers. Most real data is not numbers.
The Swiggy orders dataset has a restaurant column containing "Pizza Hut", "Biryani Blues", "McDonald's". The city column has "Bangalore", "Mumbai", "Delhi". The time_slot column has "breakfast", "lunch", "evening", "dinner". Not a single number in sight — and every ML algorithm from linear regression to XGBoost to neural networks requires a matrix of numbers.
Encoding is the process of converting categorical columns into numeric representations. But the encoding strategy you choose changes what the model can learn. A naive approach — replace "Pizza Hut" with 1, "Biryani Blues" with 2, "McDonald's" with 3 — implies an ordering: McDonald's is three times bigger than Pizza Hut. The model will believe this. Choosing the wrong encoding produces a model that silently learns the wrong relationships.
This module teaches every encoding strategy used in production ML, what each one tells the model, and the exact situations where each is the right choice.
What this module covers:
The dataset used throughout this module
Why naive integer encoding teaches the model lies
The first instinct when seeing string columns is to replace each unique value with an integer. Pizza Hut → 0, Biryani Blues → 1, McDonald's → 2. This is called label encoding or integer encoding. For ordinal features (where the order genuinely matters) it is correct. For nominal features (where order is meaningless) it is wrong — and the model will quietly learn the wrong thing.
Ordinal encoding — for features with a real natural order
Ordinal encoding IS the right choice when categories have a meaningful order and the steps between levels are roughly equal. "poor < average < good < excellent" is a real ordering. "cold < warm < hot" is a real ordering. "bronze < silver < gold" is a real ordering. These should be encoded as 0, 1, 2, 3 — the integers preserve the relationship the model should learn.
One-hot encoding — the right choice for nominal categories
One-hot encoding creates one binary column per category. For a restaurant column with 8 values, it creates 8 new columns — each 1 if the row is that restaurant, 0 otherwise. The model sees 8 independent binary features instead of one ambiguous integer. No false ordering. No false proximity. Each restaurant gets its own weight.
The dummy variable trap — and why drop="first" matters
If you create one binary column per category, the last column is always perfectly predictable from the others — if all other restaurant columns are 0, the order must be from the remaining restaurant. This creates perfect multicollinearity, which causes problems for linear models. The fix: drop one column. With 8 restaurants, you only need 7 columns. The dropped category becomes the baseline — its effect is captured by the intercept.
When one-hot encoding becomes a problem
One-hot encoding with a low-cardinality column (8 restaurants, 6 cities) is fine. But a column with 500 unique values creates 500 new columns — most of which are 0 for any given row. The feature matrix becomes sparse and large, model training slows significantly, and rare categories get very few training examples. For high-cardinality columns (>20–50 unique values) use frequency encoding or target encoding instead.
Frequency encoding — replace category with its prevalence
Frequency encoding replaces each category value with how often it appears in the training set as a proportion. "Pizza Hut" appears in 13% of training orders → encoded as 0.13. "Biryani Blues" appears in 11% → encoded as 0.11. This captures the idea that common categories are different from rare ones without requiring hundreds of new columns or using the target label. It is completely leakage-free and works well for high-cardinality columns.
Target encoding — encode with the label, safely
Target encoding replaces each category with the mean of the target variable for that category. "Pizza Hut" → 34.2 (its mean delivery time in training). "Biryani Blues" → 41.7. This is extremely informative — the model gets a direct signal of what each restaurant typically produces. For high-cardinality columns with strong target relationships, target encoding consistently outperforms one-hot encoding.
The danger: if you compute the target mean on the full dataset and use it as a training feature, each training row's encoded value contains information from its own target — the model can partially memorise training labels rather than learning generalisable patterns. Evaluation metrics look excellent. Production performance collapses.
The two safe approaches:
Smoothing — handling rare categories
Without smoothing, a restaurant with only 3 training orders gets encoded as the mean of those 3 orders — an extremely noisy estimate that will overfit. Smoothing blends the category mean with the global mean, weighted by how many samples the category has. A category with 500 samples gets almost entirely its own mean. A category with 3 samples gets almost entirely the global mean.
Binary encoding — compact representation for many categories
Binary encoding is a middle ground between one-hot and target encoding. It assigns each category an integer (like ordinal encoding) then converts that integer to binary and spreads the bits across columns. 100 categories → only 7 columns (because 2⁷ = 128 > 100) instead of 100. It preserves more information than a single integer while being far more compact than one-hot encoding.
Encoding decision framework
Handling unknown categories at inference time
In production, the model will always eventually encounter a category value it has never seen — a new restaurant partner, a new city expansion, a new product category. Each encoder handles this differently, and not handling it correctly causes crashes or silent wrong predictions.
Every common encoding error — explained and fixed
You can now handle every type of column a real dataset throws at you.
Numeric columns: clean and scale with StandardScaler or RobustScaler. Ordinal columns: OrdinalEncoder with explicit category order. Nominal low-cardinality: OneHotEncoder with drop="first". Nominal high-cardinality: FrequencyEncoder or TargetEncoder. All of it inside a Pipeline that prevents leakage automatically.
Module 19 is the capstone of the Data Engineering section — Feature Engineering and the sklearn Pipeline. It combines everything from Modules 12–15 into a single reusable preprocessing and modelling pipeline, adds interaction features and transformations, and shows the complete workflow from raw DataFrame to trained model ready for cross-validation.
Create new features, combine all transformers, and build one reusable pipeline that preprocesses and models together.
🎯 Key Takeaways
- ✓Never use naive integer encoding for nominal categories — it implies a false ordering that linear models will learn. "Burger King = 7" does not mean it is 7 times anything compared to "Pizza Hut = 0".
- ✓OrdinalEncoder is correct when categories have a real natural order (poor < average < good < excellent). Always specify the order explicitly with the categories= parameter — never rely on alphabetical order.
- ✓OneHotEncoder with drop="first" is the safe default for nominal categories in linear models. drop="first" removes one column per feature to avoid perfect multicollinearity. handle_unknown="ignore" prevents crashes on unseen categories at inference.
- ✓Target encoding replaces each category with the mean target value for that category — extremely informative but dangerous. Always use cross-fold encoding or sklearn TargetEncoder. Never compute target means on the full dataset before cross-validation.
- ✓Frequency encoding replaces each category with its prevalence in training data. Completely leakage-free, works for any cardinality, and is a strong baseline for high-cardinality columns when you want to avoid target leakage.
- ✓Tree-based models (Random Forest, XGBoost, LightGBM) do not require one-hot encoding — they can use label-encoded integers because they find optimal splits regardless of ordering. OneHotEncoder is primarily needed for linear models, SVMs, and neural networks.
- ✓Always put encoders inside a sklearn Pipeline or ColumnTransformer. This ensures fit() runs only on training data and transform() applies stored statistics to test data — preventing leakage and ensuring consistent column counts between train and test.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.