PCA — Dimensionality Reduction
Turn 100 features into 10 without losing most of the information. Explained variance, scree plots, reconstruction error, and when PCA helps and when it hurts.
You have 200 features describing each customer. Half of them carry the same information as each other. PCA finds the few directions that capture most of what matters and throws away the rest.
Flipkart's customer dataset has 200 features: order frequency, average spend, category preferences, browsing time, session length, search terms, device type, payment method, return rate, review history, and 190 more. Many of these carry overlapping information. Customers who spend more also tend to buy more often. Customers who browse mobile also tend to use the app. These correlations mean you have 200 columns but far fewer independent dimensions of variation.
Training a model on 200 correlated features causes several problems. It is slow. The model may overfit — too many features for the amount of signal. Distance-based algorithms (KNN, K-Means) suffer from the curse of dimensionality. And visualising the data to understand its structure is impossible in 200 dimensions.
PCA (Principal Component Analysis) solves all of these at once. It finds the directions in the 200-dimensional space along which the data varies the most. These directions — the principal components — are ordered by how much variance they capture. You keep the top k and discard the rest. The result: a dataset with k dimensions instead of 200, where those k dimensions capture most of the meaningful variation in the original data.
You are photographing a 3D sculpture to put on a website. You can only take one photo. From most angles you capture the full shape — height, width, and some sense of depth. From a few bad angles the sculpture looks like a flat line. The best angle is the one that shows the most variation — where the sculpture looks most different from one end to the other.
PCA finds the "best angle" to project high-dimensional data onto a lower-dimensional space — the projection that preserves the most variation. The first principal component is the direction of maximum variance. The second is the direction of maximum remaining variance perpendicular to the first. And so on.
PCA in four steps — from raw data to reduced dimensions
PCA is one of the few ML algorithms you can fully understand mathematically without advanced background. Each of the four steps has a clear purpose.
Explained variance — the only number that matters when choosing k
Every eigenvalue tells you how much variance one principal component captures. Divide each eigenvalue by the total variance (sum of all eigenvalues) to get the fraction of information that component represents. These fractions are the explained variance ratios. Summing the top k gives you the total variance retained after reducing to k dimensions.
The standard rule of thumb: choose k such that the cumulative explained variance is at least 95%. This means you retained 95% of the information in the original data while discarding all the remaining dimensions. The 5% you lose is typically noise.
The scree plot shows variance per component. The dashed green line is the 95% cumulative threshold. The red line shows where you cut — keep components to the left, discard to the right.
Component loadings — what each principal component actually means
After running PCA you have k new dimensions. But what do they mean? Each principal component is a linear combination of the original features. The loadings are the coefficients — how much each original feature contributes to each component. A large positive loading on "order_frequency" and "avg_spend" for PC1 means PC1 measures purchasing intensity. A large loading on "app_sessions" and "pages_per_session" for PC2 means PC2 measures browsing engagement. Naming the components makes PCA results communicable to stakeholders.
PCA inside a Pipeline — the right way to use it before a model
PCA is most commonly used as a preprocessing step before training a model. Reducing from 200 features to 20 before KNN eliminates the curse of dimensionality. Reducing to 50 before logistic regression removes correlated features that cause numerical instability. The key rule: PCA must be fitted on training data only, then applied to test data. Like StandardScaler, fitting PCA on the full dataset leaks test information. Use a Pipeline.
Fitting PCA on the full dataset before splitting is data leakage — the same as fitting a scaler on all data before splitting. The principal components are computed using test set statistics. Always use Pipeline so PCA is refit on the training fold in each cross-validation split, just like every other transformer.
Reconstruction error — quantifying the information you discarded
PCA is reversible up to the information you discarded. You can reconstruct an approximation of the original data from the reduced representation. The reconstruction error — the difference between the original and the reconstructed data — tells you exactly how much information was lost. Low reconstruction error means the discarded components were mostly noise. High error means you discarded signal.
When to use PCA — and three situations where it makes things worse
PCA is a powerful tool but it is not appropriate for every problem. Using it blindly on every dataset is a common mistake. The right question before applying PCA: are there actually correlated features in this data that PCA can compress? And does my downstream algorithm benefit from reduced dimensions?
Day-one task — compress Flipkart customer features for segmentation
Every common PCA error — explained and fixed
Classical ML is complete. Section 6 — Evaluation — begins next.
You have now covered every major algorithm in the Classical ML section: linear models, trees, ensembles, instance-based methods, probabilistic models, boosting, unsupervised clustering, and dimensionality reduction. Thirteen modules. Every algorithm a working data scientist reaches for on a typical project.
Section 6 — Model Evaluation — is next. It answers the question every algorithm module assumed you knew: how do you actually know if your model is good? Accuracy is almost never the right metric. Precision, recall, F1, ROC-AUC, PR-AUC, calibration curves, confusion matrices, and the business cost of each type of error. This section makes every model you build defensible to a stakeholder.
Precision, recall, F1, ROC-AUC, PR-AUC, confusion matrices, and the business cost framing that turns metrics into decisions.
🎯 Key Takeaways
- ✓PCA finds the directions of maximum variance in high-dimensional data (principal components) and projects the data onto the top k of them. The result is a lower-dimensional representation that preserves most of the information. It is built on the eigendecomposition of the covariance matrix from Module 06.
- ✓Always standardise before PCA. Features with large absolute values (like price in rupees) dominate the covariance matrix and hijack all principal components. StandardScaler before PCA is mandatory, not optional.
- ✓Explained variance ratio is the key output. Each component has a fraction of total variance it captures. Sum them cumulatively and stop at 95% — that is how many components to keep. Use PCA(n_components=0.95) to let sklearn pick k automatically.
- ✓PCA must be fit inside each cross-validation fold. Fitting on the full dataset before splitting leaks test set information into the covariance matrix. Always use Pipeline([("scaler", StandardScaler()), ("pca", PCA()), ("model", model)]) and pass the whole pipeline to cross_val_score.
- ✓Do not use PCA before tree models (Random Forest, XGBoost, LightGBM). They handle correlated features natively and do not benefit from dimensionality reduction. PCA genuinely helps distance-based algorithms (KNN, K-Means, SVM) and linear models with correlated features.
- ✓For sparse data (text, one-hot encoded features), use TruncatedSVD instead of PCA. Standard PCA centers the data first, destroying sparsity and requiring dense matrix storage. TruncatedSVD skips centering and works directly on sparse representations.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.