Naive Bayes — Probabilistic Text Classification
Bayes theorem applied to classification. Why the naive independence assumption works surprisingly well for spam filters and document classification.
A new email arrives. It contains the words "free", "win", "cash", "claim". How do you know it is spam before reading it fully?
You have seen thousands of emails before. From that experience you know: the word "free" appears in 80% of spam emails but only 5% of legitimate ones. "Win" appears in 70% of spam but 2% of legitimate. "Meeting" appears in 0.1% of spam but 40% of legitimate.
When a new email arrives, you look at the words it contains and ask: given these words, what is the probability this email is spam? You combine the evidence from each word to get an overall probability. If the probability of spam is above 50% you classify it as spam. That is the entire Naive Bayes algorithm.
The "naive" part is an assumption: we treat each word as independent. The presence of "free" and the presence of "cash" in the same email are treated as if they provide completely separate, unrelated evidence. In reality these words are correlated — spam emails often contain both. The assumption is wrong. But it simplifies the math enormously and somehow still works very well in practice.
A doctor diagnosing a patient. The patient has three symptoms: fever, cough, and fatigue. The doctor looks up: how common is fever in patients with flu? How common is cough? How common is fatigue? The doctor combines all three answers — treating each symptom as independent evidence — to reach a diagnosis.
In reality fever, cough, and fatigue are not independent — they often come together in flu. But treating them as independent gives a good enough estimate of "how likely is this flu vs cold vs allergies?" That is the naive assumption, and it works because the errors in each direction often cancel out.
Bayes theorem — update your belief when you see evidence
Bayes theorem (from Module 08) says: the probability of a hypothesis given evidence equals the probability of the evidence given the hypothesis, times the prior probability of the hypothesis, divided by the probability of the evidence. Written in plain English:
How likely is this email spam, given the words I see? equals How likely are these words in a spam email? times How common is spam overall? divided by How likely are these words in any email?
The naive extension — combining multiple features
An email has many words, not just one. To combine evidence from all words we use the naive independence assumption: the probability of seeing all the words together in a spam email equals the product of their individual probabilities. This is the "naive" assumption — words are treated as independent of each other.
We compute this for every class and pick the class with the highest value.
In practice: use log probabilities to avoid numerical underflow from multiplying many small numbers.
Three variants — one for each type of feature
"Naive Bayes" is not one algorithm — it is a family. The difference between variants is only in how they model P(feature | class) — the likelihood of seeing each feature value in each class. The right choice depends on what type of features you have.
Laplace smoothing — why a zero probability destroys everything
Imagine a word that appears in test data but never appeared in any spam email in training. Without smoothing, its probability given spam is exactly 0. When you multiply all word probabilities together — which is what Naive Bayes does — a single zero makes the entire product zero. One unseen word makes it impossible to classify the email as spam, no matter how many other spam indicators it contains.
Laplace smoothing (also called additive smoothing) fixes this by adding a small count to every word — even words that never appeared. Adding 1 to every word count (alpha=1) ensures no probability is ever exactly zero. The vocabulary expands to include all possible words, each with a small non-zero count.
GaussianNB — Naive Bayes for continuous features
When features are continuous numbers — like delivery distance, order value, or customer age — you cannot count occurrences. GaussianNB assumes each feature follows a Gaussian (normal) distribution within each class. During training it learns the mean and variance of each feature for each class. During prediction it computes how likely the observed feature value is given each class's Gaussian distribution.
Day-one task — build a Swiggy review sentiment classifier
Your first week at Swiggy's data team. The product manager asks: "Can you automatically classify customer reviews as positive or negative so we can route negative ones to customer support immediately?" 250,000 reviews per month. You need something fast, accurate enough, and deployable by end of week. Naive Bayes is the right answer.
Every common Naive Bayes error — explained and fixed
You have now covered every major classical ML algorithm. Next: ensemble methods that combine them.
Linear Regression, Logistic Regression, Decision Trees, SVM, KNN, Naive Bayes — six algorithms, six different philosophies. Linear regression fits a line. Logistic regression finds a probability boundary. Decision trees grow a flowchart. SVMs maximise a margin. KNN asks its neighbours. Naive Bayes applies Bayes theorem. Each has a domain where it wins.
Module 28 — Random Forest — combines hundreds of decision trees through a technique called bagging. Each tree is trained on a random subset of data with a random subset of features. Their predictions are averaged. The result consistently beats any single tree on almost every tabular dataset. It is one of the first algorithms you should reach for in production.
Bagging, random feature subsets, out-of-bag evaluation, and why Random Forest beats a single tree on every real dataset.
🎯 Key Takeaways
- ✓Naive Bayes uses Bayes theorem to compute the probability of each class given the input features. It picks the class with the highest posterior probability. The "naive" part is treating each feature as independent — wrong in theory, works well in practice.
- ✓Three variants for three feature types: MultinomialNB for word counts and text (most common), BernoulliNB for binary presence/absence features especially in short texts, GaussianNB for continuous numeric features.
- ✓Laplace smoothing (alpha parameter) is essential. Without it, a single word that never appeared in training causes the entire probability to become zero. Alpha=1.0 is standard. Tune it with cross-validation — alpha=0.1 often outperforms the default on text.
- ✓Naive Bayes is one of the fastest ML algorithms — training is a single pass to count frequencies. Prediction is a few multiplications. For high-volume real-time classification (spam, sentiment, support ticket routing) it is often the most practical choice.
- ✓The independence assumption makes Naive Bayes probabilities overconfident — predictions cluster near 0 and 1. When you need calibrated probabilities, post-process with CalibratedClassifierCV(method="isotonic").
- ✓Naive Bayes genuinely wins for text classification with small datasets, real-time requirements, or high-dimensional sparse features. For tabular numeric data with strong feature correlations, Logistic Regression or Random Forest almost always outperforms it.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.