DVC — Data Version Control
Version datasets like code. DVC pipelines, remote storage, experiment tracking, and the full DVC + Git workflow for reproducible ML projects.
Your model code is in Git. Your training data is on someone's laptop, or in an S3 bucket with no version history, or in a folder called data_final_v3_use_this. DVC fixes this.
Git tracks code beautifully — every change, every author, every commit. But Git breaks for large files. A 2GB training CSV committed to Git bloats the repository, slows every clone, and makes every checkout painful. More importantly, Git does not understand that a CSV file and the Python script that produced it are connected — if the script changes, Git does not know the CSV is now stale.
DVC (Data Version Control) adds data and model versioning on top of Git. It stores large files in remote storage (S3, GCS, Azure Blob) and keeps tiny pointer files in Git — a .dvc file that is just a hash and a path. When you git checkout an old branch, DVC knows which version of the data that branch used and pulls it from remote storage. Every model in your history has a corresponding dataset version, a feature pipeline version, and a code version. Reproduce any past experiment with two commands.
Git is like a library catalogue — it tracks which books exist and where they are filed, but the books themselves are stored on shelves. DVC is the cataloguing system for your ML datasets — it tracks which version of your data exists and stores a reference in Git, while the actual data lives in a remote storage warehouse (S3). When you need the book (data), you check the catalogue (Git + DVC), find the shelf (S3 path), and retrieve it. The catalogue is tiny. The warehouse can hold terabytes.
The .dvc pointer file committed to Git is typically 200 bytes. The actual dataset it references can be 200GB. Git stores the pointer. S3 stores the data. DVC coordinates between them so every git checkout brings the right data version automatically.
Cache, remote, .dvc files — the three pieces that make versioning work
DVC pipelines — define stages, dependencies, and outputs so reruns are smart
Tracking individual files is useful but DVC pipelines go further. A pipeline defines each processing stage — what inputs it depends on, what command it runs, what outputs it produces. DVC tracks all of these and only reruns a stage when its inputs have changed. If your feature engineering script has not changed and the raw data has not changed, dvc repro skips that stage entirely. The entire ML workflow becomes a reproducible, incremental build system — like Make but for data.
Complete DVC pipeline — four Python scripts driven by params.yaml
dvc exp — run, compare, and select the best experiment without leaving the terminal
DVC experiments extend the pipeline with a lightweight experiment tracking layer. Run an experiment with modified parameters without creating a new Git commit — DVC saves the experiment as a stash. After running several experiments, compare them in a table, pick the best one, and promote it to a full Git commit. This integrates with MLflow and W&B (Module 70) for richer visualisations while keeping the experiment lineage in Git.
The complete Git + DVC daily workflow for an ML team
Every common DVC mistake — explained and fixed
Data is versioned. Next: design any ML system from first principles.
Module 75 is the final module of the MLOps section and one of the most practically valuable in the entire track — ML System Design. Given a real-world ML problem (build Swiggy's delivery time prediction system from scratch, or Razorpay's fraud detection system), how do you design the full architecture? Data collection, feature engineering, model selection, serving infrastructure, monitoring, and the tradeoffs at each decision. This is what senior ML engineering interviews test and what every ML architect does on day one of a new project.
Design any ML system from scratch. The framework, tradeoffs, capacity estimation, and how to present it in an interview.
🎯 Key Takeaways
- ✓DVC adds data and model versioning on top of Git. It stores large files in S3/GCS and keeps tiny .dvc pointer files (200 bytes containing the MD5 hash) in Git. git checkout an old branch, then dvc checkout restores the exact data that branch used. Every model in your history has a corresponding dataset version, feature pipeline version, and code version.
- ✓Three storage locations work together: Git stores .dvc pointer files and dvc.yaml pipeline definitions (kilobytes), local .dvc/cache stores content-addressable data by MD5 hash (gigabytes), remote S3/GCS stores the shared team copy (same structure as local cache). dvc push uploads local cache to remote. dvc pull downloads from remote to local cache and workspace.
- ✓DVC pipelines (dvc.yaml) define stages with commands, deps (inputs that trigger reruns), outs (outputs tracked by DVC), params (hyperparameters from params.yaml), and metrics (small JSON files committed to Git). dvc repro only reruns stages where deps have changed — tracked in dvc.lock which must be committed to Git.
- ✓dvc exp run --set-param key=value runs an experiment with modified hyperparameters without creating a Git commit. dvc exp show compares all experiments in a table. dvc metrics diff HEAD~1 shows metric changes versus the previous commit. The best experiment is promoted with dvc exp apply then committed normally.
- ✓Never run dvc add on code files (.py, .yaml) — only on data files and model artifacts. Add *.py to .dvcignore to prevent accidental tracking. Always commit dvc.lock to Git — without it, DVC cannot detect what has changed and reruns everything. Commit metrics/ and plots/ files to Git (cache: false in dvc.yaml) so metrics are visible in git log and GitHub.
- ✓The complete team workflow: git pull && dvc pull (get latest), git checkout -b experiment/name (branch), edit code + params, dvc repro (run changed stages), dvc metrics diff main (compare to main), git add dvc.lock params.yaml metrics/ src/ && git commit, dvc push (upload data), git push. CI/CD runs dvc pull + dvc repro + metric assertions on every PR.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.