Experiment Tracking with MLflow and Weights & Biases
Log every run, compare experiments, version models, register artifacts. Never lose a good experiment again.
Three weeks ago you trained a model that got 94% accuracy. Today you cannot reproduce it. You do not remember the learning rate, the data version, or which features you included. Experiment tracking means this never happens.
Every ML project goes through dozens of experiments — different models, different hyperparameters, different feature sets, different data slices. Without tracking, all of this knowledge lives in your head and in notebook filenames like model_final_v3_actually_final.ipynb. When the model degrades in production six months later, you cannot reproduce the best version. When a new team member joins, the entire experiment history is lost.
Experiment tracking tools solve this by automatically recording every run: the hyperparameters, metrics at every epoch, code version, data version, environment, and output artifacts. Two runs can be compared side by side. The best model can be registered and promoted to production with a full audit trail. Every Indian ML team of more than two people needs this.
A chef's recipe book vs cooking from memory. A chef who cooks from memory might produce excellent dishes — but cannot replicate them exactly next week, cannot scale the recipe for 200 people, and cannot hand the recipe to a junior chef. A chef who writes down every recipe with precise measurements can reproduce any dish, compare two versions of the same dish scientifically, and build on past experiments. Experiment tracking is the recipe book for ML.
The discipline of logging experiments also forces clarity of thought. When you must decide what to log before running an experiment, you think more carefully about what you are trying to learn. Untracked experiments are usually under-thought experiments.
Parameters, metrics, artifacts, and tags — the four things every run must record
MLflow — self-hosted experiment tracking with model registry
MLflow is four tools in one: Tracking (log experiments), Projects (reproducible code packaging), Models (standard model format), and Registry (model versioning and promotion). For most teams the Tracking and Registry components are what matter. MLflow runs a local server by default — no cloud account required. For production: run the MLflow server backed by PostgreSQL and S3.
MLflow Model Registry — version, stage, and promote models safely
The Model Registry is where experiments become deployable artifacts. Every registered model has a version number, a stage (Staging or Production), and full metadata including which run produced it. Promotion from Staging to Production requires explicit action — this is the deployment gate. The inference service always loads the Production-stage model by name, never by run ID.
Weights & Biases — richer visualisations and collaboration for deep learning
W&B excels where MLflow is weaker: visualising training curves, logging images and audio, comparing runs interactively in a web UI, and team collaboration. The free tier is generous enough for most individual ML engineers. Setup is one line of code — just call wandb.init() and every subsequent print, metric, or artifact is automatically captured.
Experiment tracking conventions — what to standardise across the team
Every common experiment tracking mistake — explained and fixed
You can track every experiment. Next: wrap your model in an API and ship it to production.
Experiment tracking gives you a registered model artifact. Module 71 takes that artifact and deploys it — wrapping the model in a FastAPI REST endpoint, containerising it with Docker, and scaling it with Kubernetes. The full deployment path from a pkl file to a production API serving thousands of requests per minute.
Wrap your model in a FastAPI endpoint, containerise with Docker, scale with Kubernetes. Full working deployment of the Swiggy delivery model.
🎯 Key Takeaways
- ✓Experiment tracking automatically records every run: parameters (inputs — hyperparameters, data version, feature set), metrics (outputs — MAE, AUC, training time), artifacts (files — model.pkl, plots, confusion matrices), and tags (labels — team, purpose, ticket). These four categories together make any experiment exactly reproducible.
- ✓MLflow is four tools: Tracking (log runs), Projects (reproducible packaging), Models (standard format), Registry (versioning and promotion). The Tracking and Registry components are what most teams need. Self-host with a PostgreSQL backend and S3 artifact store for production. Free and open source.
- ✓The Model Registry has four stages: None (freshly registered), Staging (under review), Production (serving live traffic), Archived (superseded). The inference service always loads by name and stage — never by run_id. Promotion from Staging to Production is an explicit gate that creates an audit trail.
- ✓W&B excels for deep learning: richer learning curve charts, first-class image/audio logging, built-in hyperparameter sweep agent (Bayesian optimisation across N runs), team reports, and alerts. The free tier covers most individual engineers. One line to start: wandb.init(project="...", config={...}).
- ✓Standardise experiment logging across the team with a shared wrapper class that validates required params and tags before a run starts. Required at minimum: model_type, dataset_version, feature_set, team, purpose. Add git_commit and run_by automatically. Rejected runs cannot pollute the tracking server with unidentifiable experiments.
- ✓Four common failures: runs look identical (enforce naming convention and required tags), artifact store fills up (use S3, set retention policy, gate log_model() on quality threshold), W&B runs stuck as crashed (use context manager or try/finally for wandb.finish()), cannot reproduce (log and set all random seeds — NumPy, PyTorch, Python random, and CUDA each independently).
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.