Scikit-learn Interface
The API every sklearn algorithm shares. fit, transform, predict, Pipeline, ColumnTransformer — understand the interface once and every algorithm becomes obvious.
sklearn has 200+ algorithms. They all work the same way. Learn the pattern once — use any algorithm forever.
Imagine you joined Swiggy's data team on day one. Your lead says: "Try a few different models on this delivery time dataset — linear regression, random forest, maybe a gradient boosted tree. See which one performs best." In any other ML library, each algorithm has a completely different API. Different function names, different parameter conventions, different ways to get predictions. You would spend hours reading documentation for each one.
sklearn solved this problem with a unified interface. Every single algorithm — whether it is a simple linear regression or a complex gradient boosting ensemble — follows the exact same pattern: create the model, call .fit() to train it, call .predict() to use it. Switching from one algorithm to another is literally changing one word in your code and nothing else.
This module teaches you that pattern thoroughly. Once you understand it, you can use any of sklearn's 200+ algorithms without reading the docs for each one. You will also learn Pipeline and ColumnTransformer — the two tools that turn a messy sequence of preprocessing steps into a clean, production-ready, leakage-proof workflow.
Think of sklearn like a set of standardised power tools from the same brand. A drill, a sander, and a circular saw all look different and do different things. But they all have the same battery pack, the same on/off button location, and the same safety mechanism. Once you know how to use one tool in the set, picking up a new one takes two minutes — not two hours.
sklearn's "battery pack" is the estimator interface: every model is an object, .fit() trains it, .predict() uses it, .transform() processes data with it. Same pattern, every time.
Three methods — every sklearn object has these
Every sklearn object — whether it is a model, a scaler, an encoder, or an imputer — is built around three methods. Understanding what each one does and when to call it is the entire sklearn interface.
Not all sklearn objects do the same thing — here is the map
sklearn objects fall into three types. All three share the .fit() method. But what they do with it — and what methods they expose — differs. Knowing which type you are working with prevents a lot of confusion.
Learns from labelled data. Takes both X (features) and y (labels) in fit(). Makes predictions on new X.
Learns statistics from X and transforms X. Does NOT use y during fit(). Changes the shape or values of X.
Both preprocesses AND predicts. Uses y during fit() to make the transformation smarter. TargetEncoder is the main example.
Useful attributes after fit() — what every trained object stores
After calling .fit(), sklearn objects expose attributes (ending in underscore _) that let you inspect what was learned. This underscore convention is universal across all of sklearn — if a variable name ends in _ it was set during .fit().
Switching algorithms by changing one word — this is the entire point
The reason sklearn uses a unified interface is so you can compare multiple algorithms with almost zero extra code. The preprocessing stays identical. The evaluation stays identical. Only the model object changes. This is how data scientists actually work — they run several algorithms and pick the one that performs best.
Pipeline — chain preprocessing and modelling into one object
Every ML workflow has multiple steps: impute missing values, scale numeric features, encode categorical features, then train the model. Without Pipeline you write these as separate steps, manually tracking which scaler was fit on which data — and inevitably making the leakage mistake (fitting on the full dataset instead of just the training fold).
Pipeline chains all steps into one object. When you call pipeline.fit(X_train), it fits each step on the training data automatically. When you call pipeline.predict(X_test), it applies each step's stored statistics — never refitting. Data leakage becomes structurally impossible.
A Pipeline is like an assembly line in a factory. Raw materials (data) enter at one end. Each station performs one operation — wash, cut, assemble, paint. The finished product (predictions) comes out at the other end. The assembly line has a fixed order. Each station knows exactly what state the material is in when it arrives.
Without Pipeline you are doing each factory step manually and carrying the half-finished product between stations yourself — error-prone, slow, and easy to do in the wrong order.
Accessing individual steps inside a fitted Pipeline
ColumnTransformer — apply different transformations to different columns
Real datasets always have mixed column types. Numeric columns need scaling. Categorical columns need encoding. Text columns need tokenisation. ColumnTransformer lets you define a different transformation for each group of columns and applies them all in parallel, then concatenates the results into one matrix.
You define named transformers as a list of tuples: (name, transformer, columns). Each transformer processes its assigned columns independently. The results are concatenated horizontally into one output matrix.
remainder='drop' (default) — columns not listed are dropped. remainder='passthrough' — unlisted columns pass through unchanged.
cross_val_score and GridSearchCV — the evaluation and tuning tools
A single train/test split gives you one estimate of model performance. It might be lucky or unlucky depending on which samples ended up in each set. Cross-validation runs the train/test split multiple times with different splits and averages the results — giving a much more reliable performance estimate. GridSearchCV combines cross-validation with hyperparameter search.
Every common sklearn interface error — explained and fixed
You now speak sklearn fluently. The next section puts it to work on real data.
fit, predict, transform, Pipeline, ColumnTransformer, cross_val_score, GridSearchCV — these are the seven tools you will use in every single ML project for the rest of your career. You now know all of them.
Section 4 — Data Engineering for ML — begins next. It starts with the messiest part of every real ML project: getting the data in the first place. REST APIs, SQL databases, Parquet files, web scraping — where ML data actually comes from and how to pull it reliably with Python.
Where ML data actually comes from. Pull from REST APIs, query databases, read Parquet files, and scrape web data — all with production-grade Python.
🎯 Key Takeaways
- ✓sklearn has one unified interface shared by all 200+ algorithms. Three methods cover everything: .fit() learns from data, .transform() applies learned transformations, .predict() makes predictions. Learn this pattern once — use any algorithm.
- ✓.fit() must only be called on training data. Never on test data. Calling fit on test data leaks information and makes evaluation metrics optimistically wrong. This is the single most important rule in all of sklearn.
- ✓There are three types of sklearn objects: Estimators (models with fit+predict), Transformers (preprocessors with fit+transform), and objects that are both. After fit(), all learned values are stored as underscore attributes: scaler.mean_, model.coef_, encoder.categories_.
- ✓Pipeline chains multiple steps into one object. It enforces correct fit/transform order automatically, prevents leakage in cross-validation (each fold refits the entire pipeline on its training portion), and lets you swap models by changing one word.
- ✓ColumnTransformer applies different transformations to different column groups in parallel. Numeric columns get scaling, categorical get encoding, ordinal get ordinal encoding — all in one object that sklearn treats as a single transformer.
- ✓GridSearchCV and RandomizedSearchCV find optimal hyperparameters. Always pass a Pipeline to these — never raw data with a separate preprocessing step. Use double underscore syntax to target parameters inside Pipeline steps: model__alpha, preprocessor__num__scaler__with_mean.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.