AI/ML — Module 14Beginner

Scikit-learn Interface

The API every sklearn algorithm shares. fit, transform, predict, Pipeline, ColumnTransformer — understand the interface once and every algorithm becomes obvious.

30–40 min March 2026

Module 14 · Programming Ecosystem

Python · 5 modulesModule 14

Python NumPy Pandas Data Scikit-learn

Before any code — what problem does sklearn solve?

sklearn has 200+ algorithms. They all work the same way. Learn the pattern once — use any algorithm forever.

Imagine you joined DoorDash's data team on day one. Your lead says: "Try a few different models on this delivery time dataset — linear regression, random forest, maybe a gradient boosted tree. See which one performs best." In any other ML library, each algorithm has a completely different API. Different function names, different parameter conventions, different ways to get predictions. You would spend hours reading documentation for each one.

sklearn solved this problem with a unified interface. Every single algorithm — whether it is a simple linear regression or a complex gradient boosting ensemble — follows the exact same pattern: create the model, call .fit() to train it, call .predict() to use it. Switching from one algorithm to another is literally changing one word in your code and nothing else.

This module teaches you that pattern thoroughly. Once you understand it, you can use any of sklearn's 200+ algorithms without reading the docs for each one. You will also learn Pipeline and ColumnTransformer — the two tools that turn a messy sequence of preprocessing steps into a clean, production-ready, leakage-proof workflow.

🧠 Analogy — read this first

Think of sklearn like a set of standardised power tools from the same brand. A drill, a sander, and a circular saw all look different and do different things. But they all have the same battery pack, the same on/off button location, and the same safety mechanism. Once you know how to use one tool in the set, picking up a new one takes two minutes — not two hours.

sklearn's "battery pack" is the estimator interface: every model is an object, .fit() trains it, .predict() uses it, .transform() processes data with it. Same pattern, every time.

🎯 Pro Tip

This module is intentionally practical. You will not just read about the API — you will use it on the DoorDash delivery time dataset with four different algorithms, switching between them by changing one line each time. By the end, switching algorithms will feel completely natural.

The core interface

Three methods — every sklearn object has these

Every sklearn object — whether it is a model, a scaler, an encoder, or an imputer — is built around three methods. Understanding what each one does and when to call it is the entire sklearn interface.

The three-method pattern — the same for every sklearn object

.fit(X_train)

The learning step. Show the object your training data. It computes and stores statistics — the mean and std for a scaler, the split thresholds for a tree, the weights for a regression. Nothing is returned. The object is now "trained".

⚠ Rule: ONLY called on training data. Never on test data. Ever.

.transform(X) OR predict(X)

The application step. Apply the learned statistics to new data. transform() is for preprocessing (scaling, encoding). predict() is for models (output a class or number). Uses stored stats from fit() — does NOT learn anything new.

⚠ Rule: Called on BOTH training and test data using the same stored stats.

.fit_transform(X_train)

Shortcut: fit() then transform() in one call. Slightly faster because some objects can optimise the combined operation. Only use on training data — calling fit_transform on test data is the classic data leakage mistake.

⚠ Rule: Convenient shortcut — but ONLY for training data.

python

import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

np.random.seed(42)
n = 1000

# DoorDash delivery time dataset
distance  = np.abs(np.random.normal(4.0, 2.0, n)).clip(0.5, 15)
traffic   = np.random.randint(1, 11, n).astype(float)
prep_time = np.abs(np.random.normal(15, 5, n)).clip(5, 35)
delivery  = (8.6 + 7.3*distance + 0.8*prep_time + 1.5*traffic
             + np.random.normal(0, 4, n)).clip(10, 120)

X = np.column_stack([distance, traffic, prep_time])
y = delivery

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# ── Step 1: Create the object ──────────────────────────────────────────
# Just instantiation — nothing is learned yet
scaler = StandardScaler()
model  = LinearRegression()

# ── Step 2: fit() — the learning step ─────────────────────────────────
# ONLY on training data
scaler.fit(X_train)      # learns mean and std of X_train
model.fit(X_train, y_train)  # nope — needs scaled input, see Pipeline below

# What did scaler learn and store?
print("What StandardScaler learned from X_train:")
print(f"  mean_:   {scaler.mean_.round(2)}")   # stored mean per feature
print(f"  scale_:  {scaler.scale_.round(2)}")  # stored std per feature

# ── Step 3: transform() — apply learned stats ────────────────────────
# Uses STORED stats — does NOT refit on test data
X_train_scaled = scaler.transform(X_train)
X_test_scaled  = scaler.transform(X_test)   # same mean/std as training

# ── fit_transform() shortcut — ONLY for training ─────────────────────
scaler2        = StandardScaler()
X_train_scaled = scaler2.fit_transform(X_train)   # fit + transform in one call
X_test_scaled  = scaler2.transform(X_test)        # transform only — no refitting!

# ── WRONG: calling fit_transform on test (leakage!) ──────────────────
# X_test_WRONG = scaler2.fit_transform(X_test)   # NEVER DO THIS
# This would compute NEW mean/std from test data — leaking test stats into your pipeline

# ── Train and predict with the model ─────────────────────────────────
model.fit(X_train_scaled, y_train)

y_pred_train = model.predict(X_train_scaled)
y_pred_test  = model.predict(X_test_scaled)

from sklearn.metrics import mean_absolute_error
print(f"
Linear Regression:")
print(f"  Train MAE: {mean_absolute_error(y_train, y_pred_train):.2f} min")
print(f"  Test MAE:  {mean_absolute_error(y_test,  y_pred_test):.2f} min")

Types of sklearn objects

Not all sklearn objects do the same thing — here is the map

sklearn objects fall into three types. All three share the .fit() method. But what they do with it — and what methods they expose — differs. Knowing which type you are working with prevents a lot of confusion.

Estimator (models)

.fit(X, y).predict(X).score(X, y)

Learns from labelled data. Takes both X (features) and y (labels) in fit(). Makes predictions on new X.

Examples: LinearRegression, LogisticRegression, RandomForestClassifier, DecisionTreeClassifier

Transformer (preprocessors)

.fit(X).transform(X).fit_transform(X)

Learns statistics from X and transforms X. Does NOT use y during fit(). Changes the shape or values of X.

Examples: StandardScaler, MinMaxScaler, OneHotEncoder, SimpleImputer, PCA

Transformer-Estimator (both)

.fit(X, y).transform(X).predict(X)

Both preprocesses AND predicts. Uses y during fit() to make the transformation smarter. TargetEncoder is the main example.

Examples: TargetEncoder, LinearDiscriminantAnalysis

Useful attributes after fit() — what every trained object stores

After calling .fit(), sklearn objects expose attributes (ending in underscore _) that let you inspect what was learned. This underscore convention is universal across all of sklearn — if a variable name ends in _ it was set during .fit().

python

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.tree import DecisionTreeClassifier
import numpy as np

np.random.seed(42)
n = 500
X_num = np.column_stack([
    np.abs(np.random.normal(4, 2, n)),    # distance
    np.random.randint(1, 11, n).astype(float),  # traffic
])
y_reg = 20 + 5*X_num[:,0] + 2*X_num[:,1] + np.random.randn(n)*3
y_cls = (y_reg > 35).astype(int)

# ── StandardScaler attributes after fit() ─────────────────────────────
scaler = StandardScaler().fit(X_num)
print("StandardScaler attributes:")
print(f"  .mean_    (learned mean):    {scaler.mean_.round(3)}")
print(f"  .scale_   (learned std):     {scaler.scale_.round(3)}")
print(f"  .var_     (learned variance):{scaler.var_.round(3)}")
print(f"  .n_features_in_ (n features):{scaler.n_features_in_}")
print(f"  .n_samples_seen_:            {scaler.n_samples_seen_}")

# ── LinearRegression attributes after fit() ───────────────────────────
X_sc = scaler.transform(X_num)
lr   = LinearRegression().fit(X_sc, y_reg)
print("
LinearRegression attributes:")
print(f"  .coef_      (learned weights): {lr.coef_.round(3)}")
print(f"  .intercept_ (learned bias):    {lr.intercept_:.3f}")
print(f"  .n_features_in_:               {lr.n_features_in_}")

# ── DecisionTreeClassifier attributes after fit() ─────────────────────
dt = DecisionTreeClassifier(max_depth=3).fit(X_sc, y_cls)
print("
DecisionTreeClassifier attributes:")
print(f"  .n_features_in_:       {dt.n_features_in_}")
print(f"  .n_classes_:           {dt.n_classes_}")
print(f"  .classes_:             {dt.classes_}")
print(f"  .feature_importances_: {dt.feature_importances_.round(3)}")
print(f"  .max_features_:        {dt.max_features_}")

# ── OneHotEncoder attributes after fit() ─────────────────────────────
restaurants = np.array(['Pizza Hut','KFC','Dominos','Pizza Hut','KFC']).reshape(-1,1)
ohe = OneHotEncoder(sparse_output=False).fit(restaurants)
print("
OneHotEncoder attributes:")
print(f"  .categories_:      {ohe.categories_}")
print(f"  .feature_names_out: {ohe.get_feature_names_out(['restaurant'])}")

The power of the unified interface

Switching algorithms by changing one word — this is the entire point

The reason sklearn uses a unified interface is so you can compare multiple algorithms with almost zero extra code. The preprocessing stays identical. The evaluation stays identical. Only the model object changes. This is how data scientists actually work — they run several algorithms and pick the one that performs best.

python

import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_absolute_error
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)
n = 2000
distance  = np.abs(np.random.normal(4.0, 2.0, n)).clip(0.5, 15)
traffic   = np.random.randint(1, 11, n).astype(float)
prep      = np.abs(np.random.normal(15, 5, n)).clip(5, 35)
value     = np.abs(np.random.normal(350, 150, n)).clip(50, 1200)
delivery  = (8.6 + 7.3*distance + 0.8*prep + 1.5*traffic
             + np.random.normal(0, 4, n)).clip(10, 120)

X = np.column_stack([distance, traffic, prep, value])
y = delivery

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
scaler     = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc  = scaler.transform(X_test)

# ── Every algorithm has the same interface ────────────────────────────
# Change only the model object — everything else stays the same
models = {
    'Linear Regression':      LinearRegression(),
    'Ridge Regression':       Ridge(alpha=1.0),
    'Lasso Regression':       Lasso(alpha=0.1),
    'Decision Tree':          DecisionTreeRegressor(max_depth=5, random_state=42),
    'Random Forest':          RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1),
    'Gradient Boosting':      GradientBoostingRegressor(n_estimators=100, random_state=42),
    'K-Nearest Neighbours':   KNeighborsRegressor(n_neighbors=10),
}

print(f"{'Algorithm':<25} {'Train MAE':>10} {'Test MAE':>10} {'CV MAE (5-fold)':>16}")
print("─" * 68)

for name, model in models.items():
    # ── SAME THREE LINES for every single algorithm ─────────────────
    model.fit(X_train_sc, y_train)                    # 1. train
    y_pred = model.predict(X_test_sc)                 # 2. predict
    test_mae = mean_absolute_error(y_test, y_pred)    # 3. evaluate
    # ────────────────────────────────────────────────────────────────

    train_mae = mean_absolute_error(y_train, model.predict(X_train_sc))
    cv_scores = cross_val_score(model, X_train_sc, y_train,
                                 cv=5, scoring='neg_mean_absolute_error')
    cv_mae = -cv_scores.mean()

    print(f"  {name:<23} {train_mae:>10.2f} {test_mae:>10.2f} {cv_mae:>14.2f}")

# ── score() method — quick accuracy check ────────────────────────────
# For regressors: returns R² score
# For classifiers: returns accuracy
print(f"
R² scores (model.score()):")
for name, model in list(models.items())[:3]:
    r2 = model.score(X_test_sc, y_test)
    print(f"  {name:<25}: R² = {r2:.4f}")

The most important sklearn tool

Pipeline — chain preprocessing and modelling into one object

Every ML workflow has multiple steps: impute missing values, scale numeric features, encode categorical features, then train the model. Without Pipeline you write these as separate steps, manually tracking which scaler was fit on which data — and inevitably making the leakage mistake (fitting on the full dataset instead of just the training fold).

Pipeline chains all steps into one object. When you call pipeline.fit(X_train), it fits each step on the training data automatically. When you call pipeline.predict(X_test), it applies each step's stored statistics — never refitting. Data leakage becomes structurally impossible.

🧠 Analogy — read this first

A Pipeline is like an assembly line in a factory. Raw materials (data) enter at one end. Each station performs one operation — wash, cut, assemble, paint. The finished product (predictions) comes out at the other end. The assembly line has a fixed order. Each station knows exactly what state the material is in when it arrives.

Without Pipeline you are doing each factory step manually and carrying the half-finished product between stations yourself — error-prone, slow, and easy to do in the wrong order.

Pipeline flow — raw data in, predictions out

Raw X

→

Step 1

Imputer

.fit_transform()

→

Step 2

Scaler

.fit_transform()

→

Step 3

Model

.fit()

→

Predictions

During fit(): each step calls fit_transform() on the training data in sequence. During predict(): each step calls transform() (not fit!) — using the stored statistics from training.

python

import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import mean_absolute_error

np.random.seed(42)
n = 2000
restaurants = ['Pizza Hut','KFC','Dominos','Biryani Blues',"McDonald's",'Subway']
cities      = ['Seattle','New York','Delhi','Austin','Boston']

df = pd.DataFrame({
    'distance_km':    np.abs(np.random.normal(4, 2, n)).clip(0.5, 15),
    'traffic_score':  np.random.randint(1, 11, n).astype(float),
    'restaurant_prep': np.abs(np.random.normal(15, 5, n)).clip(5, 35),
    'order_value':    np.abs(np.random.normal(350, 150, n)).clip(50, 1200),
    'restaurant':     np.random.choice(restaurants, n),
    'city':           np.random.choice(cities, n),
})
y = (8.6 + 7.3*df['distance_km'] + 0.8*df['restaurant_prep']
     + 1.5*df['traffic_score'] + np.random.normal(0, 4, n)).clip(10, 120)

# Introduce some missing values
df.loc[np.random.choice(n, 80, replace=False),  'restaurant_prep'] = np.nan
df.loc[np.random.choice(n, 40, replace=False),  'traffic_score']   = np.nan

X_train, X_test, y_train, y_test = train_test_split(
    df, y, test_size=0.2, random_state=42
)

NUM_COLS = ['distance_km', 'traffic_score', 'restaurant_prep', 'order_value']
CAT_COLS = ['restaurant', 'city']

# ── Step 1: Build sub-pipelines for each column type ──────────────────
numeric_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),   # fills NaN with median
    ('scaler',  StandardScaler()),                    # standardise to mean=0, std=1
])

categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),   # fills NaN with mode
    ('encoder', OneHotEncoder(handle_unknown='ignore',      # unknown → all zeros
                               sparse_output=False,
                               drop='first')),
])

# ── Step 2: ColumnTransformer — apply different pipelines to different columns
preprocessor = ColumnTransformer([
    ('num', numeric_pipeline,    NUM_COLS),
    ('cat', categorical_pipeline, CAT_COLS),
])

# ── Step 3: Full pipeline — preprocessor + model ──────────────────────
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model',        Ridge(alpha=1.0)),
])

# ── The pipeline is now ONE sklearn object ─────────────────────────────
# fit() trains the entire chain on training data only
pipeline.fit(X_train, y_train)

# predict() applies the entire chain to new data
y_pred = pipeline.predict(X_test)
print(f"Pipeline (Ridge) Test MAE: {mean_absolute_error(y_test, y_pred):.2f} min")

# ── Cross-validation with pipeline — leakage-proof ────────────────────
# The pipeline is refit from scratch in each fold
# Scaler/encoder statistics are never contaminated by validation data
cv_scores = cross_val_score(
    pipeline, df, y, cv=5,
    scoring='neg_mean_absolute_error'
)
print(f"5-fold CV MAE: {-cv_scores.mean():.2f} ± {cv_scores.std():.2f} min")

# ── Switching the model — change ONE word ─────────────────────────────
pipeline_rf = Pipeline([
    ('preprocessor', preprocessor),
    ('model',        RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)),
])
pipeline_rf.fit(X_train, y_train)
y_pred_rf = pipeline_rf.predict(X_test)
print(f"Pipeline (RF)   Test MAE: {mean_absolute_error(y_test, y_pred_rf):.2f} min")

Accessing individual steps inside a fitted Pipeline

python

# After fitting, you can inspect any step inside the pipeline

# Access by name
fitted_scaler  = pipeline.named_steps['preprocessor']                          .named_transformers_['num']                          .named_steps['scaler']
fitted_encoder = pipeline.named_steps['preprocessor']                          .named_transformers_['cat']                          .named_steps['encoder']
fitted_model   = pipeline.named_steps['model']

print("Stored scaler mean (from training data):")
print(f"  {fitted_scaler.mean_.round(2)}")

print("
Stored encoder categories:")
print(f"  {fitted_encoder.categories_}")

print("
Model coefficients:")
print(f"  {fitted_model.coef_.round(3)}")

# set_params() — change hyperparameters without rebuilding the pipeline
# Uses double underscore __ to navigate into nested steps
pipeline.set_params(model__alpha=10.0)   # change Ridge alpha
pipeline.fit(X_train, y_train)
y_pred_new = pipeline.predict(X_test)
print(f"
After changing alpha to 10: MAE={mean_absolute_error(y_test, y_pred_new):.2f}")

Handling mixed data types

ColumnTransformer — apply different transformations to different columns

Real datasets always have mixed column types. Numeric columns need scaling. Categorical columns need encoding. Text columns need tokenisation. ColumnTransformer lets you define a different transformation for each group of columns and applies them all in parallel, then concatenates the results into one matrix.

ColumnTransformer — the building block of every production pipeline

You define named transformers as a list of tuples: (name, transformer, columns). Each transformer processes its assigned columns independently. The results are concatenated horizontally into one output matrix.

ColumnTransformer([

('name_1', transformer_1, column_list_1),

('name_2', transformer_2, column_list_2),

('name_3', transformer_3, column_list_3),

])

remainder='drop' (default) — columns not listed are dropped. remainder='passthrough' — unlisted columns pass through unchanged.

python

import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import (StandardScaler, MinMaxScaler,
                                    OneHotEncoder, OrdinalEncoder)
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

np.random.seed(42)
n = 500

df = pd.DataFrame({
    # Numeric — need scaling
    'distance_km':    np.abs(np.random.normal(4, 2, n)).clip(0.5, 15),
    'order_value':    np.abs(np.random.normal(350, 150, n)).clip(50, 1200),
    # Ordinal categorical — has a natural order
    'traffic_level':  np.random.choice(['low','medium','high'], n),
    # Nominal categorical — no order
    'city':           np.random.choice(['Seattle','New York','Delhi'], n),
    'restaurant':     np.random.choice(['KFC','Dominos','Pizza Hut'], n),
    # Binary
    'is_weekend':     np.random.randint(0, 2, n).astype(float),
})

# ── Build a ColumnTransformer for mixed types ─────────────────────────
ct = ColumnTransformer([
    # Numeric columns → StandardScaler (after imputing median for NaN)
    ('num_standard', Pipeline([
        ('imp',   SimpleImputer(strategy='median')),
        ('scale', StandardScaler()),
    ]), ['distance_km', 'order_value']),

    # Ordinal → OrdinalEncoder with explicit order
    ('ordinal', OrdinalEncoder(
        categories=[['low', 'medium', 'high']],
        handle_unknown='use_encoded_value', unknown_value=-1,
    ), ['traffic_level']),

    # Nominal → OneHotEncoder
    ('nominal', OneHotEncoder(
        sparse_output=False, handle_unknown='ignore', drop='first'
    ), ['city', 'restaurant']),

    # Binary → pass through as-is (no transformation needed)
    ('binary', 'passthrough', ['is_weekend']),
])

# fit on training data
ct.fit(df)
X_transformed = ct.transform(df)

print(f"Original shape:    {df.shape}")
print(f"Transformed shape: {X_transformed.shape}")

# What columns does each transformer produce?
print(f"
Output feature names:")
for name in ct.get_feature_names_out():
    print(f"  {name}")

# ── make_column_selector — select columns by dtype automatically ───────
from sklearn.compose import make_column_selector

# Instead of listing column names manually, select by dtype
ct_auto = ColumnTransformer([
    ('num', StandardScaler(),
     make_column_selector(dtype_include=np.number)),
    ('cat', OneHotEncoder(sparse_output=False, handle_unknown='ignore'),
     make_column_selector(dtype_include=object)),
])

# This auto-detects numeric and categorical columns
df_typed = df.copy()
df_typed['city']       = df_typed['city'].astype('category')
df_typed['restaurant'] = df_typed['restaurant'].astype('category')
# make_column_selector with dtype_include='category' would pick these up

Evaluating and tuning properly

cross_val_score and GridSearchCV — the evaluation and tuning tools

A single train/test split gives you one estimate of model performance. It might be lucky or unlucky depending on which samples ended up in each set. Cross-validation runs the train/test split multiple times with different splits and averages the results — giving a much more reliable performance estimate. GridSearchCV combines cross-validation with hyperparameter search.

python

import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import Ridge
from sklearn.model_selection import (
    cross_val_score, KFold, StratifiedKFold,
    GridSearchCV, RandomizedSearchCV,
)
from sklearn.metrics import mean_absolute_error

np.random.seed(42)
n = 1500
df = pd.DataFrame({
    'distance_km':    np.abs(np.random.normal(4, 2, n)).clip(0.5, 15),
    'traffic_score':  np.random.randint(1, 11, n).astype(float),
    'restaurant_prep': np.abs(np.random.normal(15, 5, n)).clip(5, 35),
    'order_value':    np.abs(np.random.normal(350, 150, n)).clip(50, 1200),
    'city':           np.random.choice(['Seattle','New York','Delhi','Austin'], n),
    'restaurant':     np.random.choice(['KFC','Dominos','Pizza Hut','Subway'], n),
})
y = (8.6 + 7.3*df['distance_km'] + 0.8*df['restaurant_prep']
     + 1.5*df['traffic_score'] + np.random.normal(0, 4, n)).clip(10, 120)

NUM_COLS = ['distance_km', 'traffic_score', 'restaurant_prep', 'order_value']
CAT_COLS = ['city', 'restaurant']

preprocessor = ColumnTransformer([
    ('num', Pipeline([('imp', SimpleImputer(strategy='median')),
                      ('sc',  StandardScaler())]), NUM_COLS),
    ('cat', OneHotEncoder(sparse_output=False, handle_unknown='ignore',
                           drop='first'), CAT_COLS),
])
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model', Ridge(alpha=1.0)),
])

# ── cross_val_score — most common evaluation ──────────────────────────
scores = cross_val_score(
    pipeline, df, y,
    cv=5,                                    # 5 folds
    scoring='neg_mean_absolute_error',       # sklearn uses negative for minimisation
    n_jobs=-1,                               # parallel across all CPU cores
)
print(f"5-fold CV MAE: {-scores.mean():.2f} ± {scores.std():.2f} min")
print(f"Per-fold scores: {(-scores).round(2)}")

# ── GridSearchCV — exhaustive hyperparameter search ───────────────────
# Syntax: step_name__parameter_name (double underscore)
param_grid = {
    'model__alpha': [0.01, 0.1, 1.0, 10.0, 100.0],
}

grid_search = GridSearchCV(
    pipeline,
    param_grid,
    cv=5,
    scoring='neg_mean_absolute_error',
    n_jobs=-1,
    verbose=0,
)
grid_search.fit(df, y)

print(f"
GridSearchCV results:")
print(f"  Best alpha:  {grid_search.best_params_}")
print(f"  Best CV MAE: {-grid_search.best_score_:.2f} min")
print(f"  Best model:  {grid_search.best_estimator_}")

# ── RandomizedSearchCV — faster for large search spaces ───────────────
from scipy.stats import loguniform

pipeline_rf = Pipeline([
    ('preprocessor', preprocessor),
    ('model', RandomForestRegressor(random_state=42, n_jobs=-1)),
])

param_dist = {
    'model__n_estimators':     [50, 100, 200, 300],
    'model__max_depth':        [None, 5, 10, 20],
    'model__min_samples_leaf': [1, 5, 10, 20],
    'model__max_features':     ['sqrt', 'log2', 0.3],
}

random_search = RandomizedSearchCV(
    pipeline_rf,
    param_dist,
    n_iter=20,          # try 20 random combinations (not all 192)
    cv=3,
    scoring='neg_mean_absolute_error',
    random_state=42,
    n_jobs=-1,
)
random_search.fit(df, y)

print(f"
RandomizedSearchCV results:")
print(f"  Best params: {random_search.best_params_}")
print(f"  Best CV MAE: {-random_search.best_score_:.2f} min")

Errors you will hit

Every common sklearn interface error — explained and fixed

NotFittedError: This StandardScaler instance is not fitted yet. Call 'fit' before 'transform'.

Why it happens

You called .transform() or .predict() on an object that has never had .fit() called on it. Common causes: you created a new scaler object but forgot to call .fit(); you saved a model with joblib and loaded it into a new variable but called predict on the wrong variable; or you put transform before fit in your code.

Fix

Always call .fit(X_train) before .transform() or .predict(). Use a Pipeline to make the order enforced automatically — Pipeline.fit() calls each step's fit in order, making it impossible to call transform before fit. Check your variable names: scaler_new = StandardScaler() creates an unfitted object, scaler_fitted.transform() uses the fitted one.

ValueError: X has 6 features, but StandardScaler is expecting 4 features as input.

Why it happens

The number of columns in your test data does not match what was seen during fit(). Most common cause: you added or removed a column between fitting and transforming. Also caused by pd.get_dummies() producing different columns on train and test when categories differ, or when you manually select columns inconsistently.

Fix

Always use the same column selection for fit and transform. Store the column list as a variable: FEATURE_COLS = ['col1', 'col2']. Then use X_train[FEATURE_COLS] and X_test[FEATURE_COLS] consistently. Use ColumnTransformer inside a Pipeline which handles column selection automatically and consistently.

DataConversionWarning: A column-vector y was passed when a 1d array was expected

Why it happens

You passed y as a 2D array (shape n×1) instead of a 1D array (shape n,). This happens when you use df[['target']] (double brackets → DataFrame) instead of df['target'] (single brackets → Series), or when you do y.reshape(-1, 1) and forget to reverse it.

Fix

Use single brackets for the target column: y = df['target'] not y = df[['target']]. If you have a 2D y, flatten it: y = y.values.ravel() or y = y.squeeze(). Check y.shape — it should be (n,) not (n, 1).

TypeError: All intermediate steps should be transformers — 'model' (type LinearRegression) does not

Why it happens

You put the model in the middle of a Pipeline instead of at the end. sklearn Pipeline requires that all steps except the last one are transformers (have .transform()). Only the final step can be a pure estimator (model). If you accidentally added the model as step 2 of 3, this error appears.

Fix

Models must always be the last step in a Pipeline. Order: Pipeline([('imputer', imputer), ('scaler', scaler), ('model', model)]). If you need to use a model's output as a transformation step (e.g. in stacking), use TransformedTargetRegressor or a custom transformer that wraps the model.

What comes next

You now speak sklearn fluently. The next section puts it to work on real data.

fit, predict, transform, Pipeline, ColumnTransformer, cross_val_score, GridSearchCV — these are the seven tools you will use in every single ML project for the rest of your career. You now know all of them.

Section 4 — Data Engineering for ML — begins next. It starts with the messiest part of every real ML project: getting the data in the first place. REST APIs, SQL databases, Parquet files, web scraping — where ML data actually comes from and how to pull it reliably with Python.

Next — Module 15 · Data Engineering for ML

Data Collection — APIs, SQL, Files and Scraping

Where ML data actually comes from. Pull from REST APIs, query databases, read Parquet files, and scrape web data — all with production-grade Python.

read →

🎯 Key Takeaways

✓sklearn has one unified interface shared by all 200+ algorithms. Three methods cover everything: .fit() learns from data, .transform() applies learned transformations, .predict() makes predictions. Learn this pattern once — use any algorithm.
✓.fit() must only be called on training data. Never on test data. Calling fit on test data leaks information and makes evaluation metrics optimistically wrong. This is the single most important rule in all of sklearn.
✓There are three types of sklearn objects: Estimators (models with fit+predict), Transformers (preprocessors with fit+transform), and objects that are both. After fit(), all learned values are stored as underscore attributes: scaler.mean_, model.coef_, encoder.categories_.
✓Pipeline chains multiple steps into one object. It enforces correct fit/transform order automatically, prevents leakage in cross-validation (each fold refits the entire pipeline on its training portion), and lets you swap models by changing one word.
✓ColumnTransformer applies different transformations to different column groups in parallel. Numeric columns get scaling, categorical get encoding, ordinal get ordinal encoding — all in one object that sklearn treats as a single transformer.
✓GridSearchCV and RandomizedSearchCV find optimal hyperparameters. Always pass a Pipeline to these — never raw data with a separate preprocessing step. Use double underscore syntax to target parameters inside Pipeline steps: model__alpha, preprocessor__num__scaler__with_mean.

Discussion

Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.

Continue with GitHub