Python for Machine Learning
Not Python 101. Python the way ML engineers actually write it — vectorised, readable, and production-ready from day one.
Python for ML is not the same as Python for web development.
You might already know Python. You know how to write a for loop, define a function, use a dictionary. That's enough to follow tutorials. But when you join a data team and open a real ML codebase, you'll see things that look nothing like the tutorials: list comprehensions chained three levels deep, classes with __call__ and __repr__, generator functions yielding batches, decorators wrapping training loops, context managers handling GPU memory, type hints on every function signature.
This module bridges that gap. Every pattern here was pulled from real production ML codebases — sklearn, PyTorch, HuggingFace Transformers, and actual data team repositories. If you know basic Python already, skim the early sections and slow down at generators, OOP for ML, and the sklearn interface. If you're new to Python, read everything.
What this module covers:
Data types and control flow — the ML-relevant subset
Python has many data types. ML uses five of them constantly: integers, floats, strings, lists, and dictionaries. Booleans are a subtype of integers — True is 1, False is 0, which means you can do arithmetic on boolean masks. Understanding this makes NumPy indexing intuitive.
Control flow — what ML code actually looks like
Functions — every pattern ML code actually uses
ML code uses functions heavily — not just simple ones, but functions with default arguments, keyword arguments, variable-length arguments, and closures. Knowing these patterns is the difference between reading a HuggingFace model config and being confused by one.
Default arguments and keyword arguments
*args and **kwargs — flexible interfaces
Lambda functions and closures — functional patterns
List, dict and set comprehensions — write less, do more
Comprehensions are one of Python's most powerful features and one of the things that makes ML code look dense to beginners. Once you internalise them, you'll never go back to multi-line loops. They're faster than explicit loops and far more readable once you know the pattern.
OOP for ML — classes you will write and classes you will use
Every sklearn estimator, every PyTorch module, every HuggingFace model is a class. Understanding OOP is not optional for serious ML work. You need to know how to use existing classes (instantiate, call methods, access attributes) and how to write your own custom transformers and datasets. This section covers both.
Class basics — the sklearn estimator as the template
Custom sklearn transformers — fit into any pipeline
Inheritance and special methods
Generators — how ML data loading actually works
A generator is a function that yields values one at a time instead of returning them all at once. This is how every production ML data loader works — loading one batch at a time instead of the entire dataset into memory. A 100GB image dataset doesn't fit in RAM. Batched generators do.
PyTorch's DataLoader, TensorFlow's tf.data.Dataset, and HuggingFace's dataset iterators are all generators under the hood. Understanding generators makes all of these click immediately.
The iterator protocol — how Python's for loop works
Decorators — wrapping functions with logging, timing and caching
A decorator is a function that takes a function and returns a modified version of it. They appear everywhere in ML code: @torch.no_grad() on evaluation loops, @staticmethod on utility methods, @property on computed attributes, @lru_cache on expensive computations. Understanding them makes these usages obvious instead of magical.
Error handling in ML pipelines — failures happen, handle them gracefully
Production ML pipelines fail. An API returns a 503. A CSV file has unexpected nulls. A batch contains all the same class. A GPU runs out of memory mid-epoch. Code that doesn't handle failures fails silently or crashes with an unhelpful message. Code that handles them correctly logs the error, saves state, and either recovers or fails loudly with a useful message.
Reading and writing ML artifacts — models, data, configs
ML projects constantly read and write files: training data, trained models, preprocessing stats, experiment configs, evaluation results. Knowing the right format for each type of artifact — and the gotchas of each — saves hours of debugging.
Type hints, docstrings and virtual environments
Type hints are not required in Python but they are expected in professional ML codebases. They document what a function expects, enable IDE autocompletion and static analysis, and catch type errors before runtime. HuggingFace, sklearn, and PyTorch all use type hints extensively. Writing them yourself is a signal that your code is production-ready.
Virtual environments — reproducible ML projects
Profiling slow code — find the bottleneck before optimising
The first rule of optimisation: don't guess where the bottleneck is, measure it. Python has two simple tools for this — timeit for measuring small snippets, and cProfile for finding which functions in a full pipeline are slow. Most ML slowdowns are in one of three places: data loading, data preprocessing, or a Python loop that should be a NumPy operation.
Every Python error common in ML code — explained and fixed
You now write Python the way ML engineers write it.
The patterns in this module — comprehensions, generators, OOP, decorators, proper error handling — are what separate ML code that only works in a notebook from ML code that works in production. Every module from here uses these patterns naturally.
Module 09 moves to NumPy — the numerical foundation of all ML in Python. With the Python patterns you've just learned, every NumPy operation will make immediate sense: arrays are just vectors and matrices (Module 03), broadcasting is just the rule from Module 03 applied in code, and vectorisation is just avoiding Python loops in favour of matrix operations (Module 04).
The backbone of all numerical ML — arrays, indexing, slicing, broadcasting, and vectorised operations that replace every for loop.
🎯 Key Takeaways
- ✓ML Python is not beginner Python. The patterns that matter: comprehensions instead of loops, generators for data loading, OOP for custom estimators, decorators for logging and caching, type hints for readability.
- ✓Default mutable arguments (def f(x=[]) is a classic Python bug — the same list object is reused across calls. Always use None as default and create the mutable object inside the function.
- ✓The sklearn estimator interface is the universal pattern: __init__ stores hyperparameters, fit() learns from data and returns self, predict()/transform() applies to new data. Learned parameters get a trailing underscore (model.weights_).
- ✓Generators yield one value at a time instead of building the full list. This is how every ML data loader works — one batch at a time, never the full dataset in memory. def my_gen(): yield value is all you need.
- ✓Decorators are functions that wrap other functions. @timer, @retry, @lru_cache, @torch.no_grad() — all follow the same pattern: a function that takes a function and returns a modified function.
- ✓For performance: measure before optimising. timeit for small snippets, cProfile for full pipelines. 90% of ML slowdowns come from Python loops over array elements — replace with vectorised NumPy operations for 1000× speedups.
- ✓Virtual environments are non-negotiable for ML projects. One environment per project, requirements.txt with pinned versions, always activate before installing or running code.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.