AI/ML — Module 27Intermediate

Naive Bayes — Probabilistic Text Classification

Bayes theorem applied to classification. Why the naive independence assumption works surprisingly well for spam filters and document classification.

22–28 min March 2026

Module 27 · Classical Machine Learning

Classical ML · 13 modulesModule 27

What Linear Logistic Decision Support K-Nearest Naive Random Gradient XGBoost LightGBM K-Means Principal

Before any formula — what problem does this solve?

A new email arrives. It contains the words "free", "win", "cash", "claim". How do you know it is spam before reading it fully?

You have seen thousands of emails before. From that experience you know: the word "free" appears in 80% of spam emails but only 5% of legitimate ones. "Win" appears in 70% of spam but 2% of legitimate. "Meeting" appears in 0.1% of spam but 40% of legitimate.

When a new email arrives, you look at the words it contains and ask: given these words, what is the probability this email is spam? You combine the evidence from each word to get an overall probability. If the probability of spam is above 50% you classify it as spam. That is the entire Naive Bayes algorithm.

The "naive" part is an assumption: we treat each word as independent. The presence of "free" and the presence of "cash" in the same email are treated as if they provide completely separate, unrelated evidence. In reality these words are correlated — spam emails often contain both. The assumption is wrong. But it simplifies the math enormously and somehow still works very well in practice.

🧠 Analogy — read this first

A doctor diagnosing a patient. The patient has three symptoms: fever, cough, and fatigue. The doctor looks up: how common is fever in patients with flu? How common is cough? How common is fatigue? The doctor combines all three answers — treating each symptom as independent evidence — to reach a diagnosis.

In reality fever, cough, and fatigue are not independent — they often come together in flu. But treating them as independent gives a good enough estimate of "how likely is this flu vs cold vs allergies?" That is the naive assumption, and it works because the errors in each direction often cancel out.

🎯 Pro Tip

Naive Bayes is one of the fastest algorithms in all of ML. Training is a single pass through the data to count word frequencies. Prediction is a few multiplications. For text classification problems — spam detection, sentiment analysis, document categorisation — it is still the first algorithm many production teams reach for.

The mathematical foundation

Bayes theorem — update your belief when you see evidence

Bayes theorem (from Module 08) says: the probability of a hypothesis given evidence equals the probability of the evidence given the hypothesis, times the prior probability of the hypothesis, divided by the probability of the evidence. Written in plain English:

How likely is this email spam, given the words I see? equals How likely are these words in a spam email? times How common is spam overall? divided by How likely are these words in any email?

Bayes theorem — each term explained

P(spam | words) = P(words | spam) × P(spam) / P(words)

P(spam | words)PosteriorWhat we want: probability this email is spam, GIVEN we have seen the words it contains. This is what we compute and use to classify.

P(words | spam)LikelihoodHow likely are these specific words in a spam email? Learned from training data: count how often each word appears in spam emails.

P(spam)PriorHow common is spam overall? If 30% of all emails in training were spam, this is 0.30. This is the base rate before seeing any evidence.

P(words)EvidenceHow likely are these words in any email at all? This is the same for all classes, so in practice we ignore it and just compare numerators.

The naive extension — combining multiple features

An email has many words, not just one. To combine evidence from all words we use the naive independence assumption: the probability of seeing all the words together in a spam email equals the product of their individual probabilities. This is the "naive" assumption — words are treated as independent of each other.

Naive Bayes formula for text classification

P(spam | w₁, w₂, ..., wₙ) ∝ P(spam) × P(w₁|spam) × P(w₂|spam) × ... × P(wₙ|spam)

∝ means "proportional to" — we skip dividing by P(words) since it is the same for all classes.
We compute this for every class and pick the class with the highest value.
In practice: use log probabilities to avoid numerical underflow from multiplying many small numbers.

python

import numpy as np

# ── Naive Bayes from scratch for text classification ──────────────────
# DoorDash customer reviews: positive vs negative sentiment

training_data = [
    # (text, label)  1=positive, 0=negative
    ("food was amazing delivery was fast",         1),
    ("excellent service quick delivery loved it",  1),
    ("great food on time highly recommend",        1),
    ("very good taste will order again",           1),
    ("food was cold delivery was late terrible",   0),
    ("worst experience ever food was bad",         0),
    ("very late delivery food arrived cold",       0),
    ("bad quality terrible service never again",   0),
]

# ── Step 1: Training — count word frequencies per class ──────────────
from collections import defaultdict

class NaiveBayesScratch:
    def __init__(self, alpha=1.0):
        self.alpha = alpha    # Laplace smoothing parameter
        self.class_priors   = {}   # P(class)
        self.word_probs     = {}   # P(word | class)
        self.vocab          = set()

    def fit(self, texts, labels):
        labels = np.array(labels)
        classes = np.unique(labels)
        n_total = len(labels)

        for c in classes:
            # Prior: P(class) = count(class) / total
            self.class_priors[c] = (labels == c).sum() / n_total

            # Collect all words in documents of this class
            class_texts = [texts[i] for i in range(len(texts)) if labels[i] == c]
            word_counts = defaultdict(int)
            total_words = 0
            for text in class_texts:
                for word in text.split():
                    word_counts[word] += 1
                    total_words += 1
                    self.vocab.add(word)

            self.word_probs[c] = (word_counts, total_words)

        return self

    def _word_log_prob(self, word, c):
        """
        P(word | class) with Laplace smoothing.
        Laplace smoothing: add alpha to every word count,
        add alpha * vocab_size to total.
        Prevents P(word | class) = 0 for unseen words.
        """
        word_counts, total_words = self.word_probs[c]
        count = word_counts.get(word, 0)
        # Smoothed probability
        prob = (count + self.alpha) / (total_words + self.alpha * len(self.vocab))
        return np.log(prob)

    def predict_proba_single(self, text):
        words = text.split()
        scores = {}
        for c in self.class_priors:
            # Start with log prior
            log_score = np.log(self.class_priors[c])
            # Add log likelihood for each word
            for word in words:
                log_score += self._word_log_prob(word, c)
            scores[c] = log_score

        # Convert log scores to probabilities
        max_score = max(scores.values())
        exp_scores = {c: np.exp(s - max_score) for c, s in scores.items()}
        total = sum(exp_scores.values())
        return {c: v / total for c, v in exp_scores.items()}

    def predict(self, texts):
        predictions = []
        for text in texts:
            proba = self.predict_proba_single(text)
            predictions.append(max(proba, key=proba.get))
        return np.array(predictions)


texts  = [t for t, _ in training_data]
labels = [l for _, l in training_data]

nb = NaiveBayesScratch(alpha=1.0)
nb.fit(texts, labels)

# Test on new reviews
test_reviews = [
    "food was amazing quick delivery",
    "terrible food arrived very late",
    "good taste will order again",
]

print("Predictions on new reviews:")
for review in test_reviews:
    proba = nb.predict_proba_single(review)
    pred  = max(proba, key=proba.get)
    label = "POSITIVE" if pred == 1 else "NEGATIVE"
    print(f"  '{review}'")
    print(f"   → {label}  (P(pos)={proba[1]:.3f}, P(neg)={proba[0]:.3f})")
    print()

Which Naive Bayes to use

Three variants — one for each type of feature

"Naive Bayes" is not one algorithm — it is a family. The difference between variants is only in how they model P(feature | class) — the likelihood of seeing each feature value in each class. The right choice depends on what type of features you have.

Three Naive Bayes variants — when to use each

MultinomialNBMultinomialNB(alpha=1.0)

FEATURES

Count data — word counts, term frequencies, occurrence counts

ASSUMPTION

P(word | class) follows a multinomial distribution. Models how often each word appears.

USE WHEN

Text classification with word counts or TF-IDF. The most common variant for NLP.

Requires non-negative features. Use with CountVectorizer or TfidfVectorizer.

BernoulliNBBernoulliNB(alpha=1.0)

FEATURES

Binary data — word present/absent, 0/1 features

ASSUMPTION

Each feature is a binary variable: does this word appear in the document (yes/no)?

USE WHEN

Short texts, binary feature vectors, spam detection where presence matters more than count.

Penalises absent features explicitly — important when non-occurrence is informative.

GaussianNBGaussianNB(var_smoothing=1e-9)

FEATURES

Continuous numeric features

ASSUMPTION

P(feature | class) follows a Gaussian (normal) distribution. Learns mean and variance per feature per class.

USE WHEN

Numeric features like age, salary, temperature. Not for text.

No alpha smoothing parameter — uses var_smoothing to add small variance to prevent zero-variance features.

python

import numpy as np
from sklearn.naive_bayes import MultinomialNB, BernoulliNB, GaussianNB
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report, accuracy_score
from sklearn.pipeline import Pipeline

np.random.seed(42)

# ── Dataset: DoorDash customer support tickets ─────────────────────────
# Categories: delivery_issue, food_quality, payment_issue, general
texts = [
    # delivery issues
    "order not delivered yet where is my food",
    "delivery taking too long its been 2 hours",
    "wrong address delivery failed please help",
    "rider called from wrong location delivery late",
    "estimated time was 30 minutes now 90 minutes",
    # food quality
    "food was cold when it arrived terrible",
    "wrong items received not what i ordered",
    "food quality very bad not fresh at all",
    "packaging was broken food spilled everywhere",
    "taste was awful completely different from menu",
    # payment issues
    "payment deducted but order not placed",
    "double charged for single order refund please",
    "coupon code not applied payment failed",
    "wallet balance not updated after cancellation",
    "upi payment failed but money deducted",
    # general
    "how do i change my delivery address",
    "want to cancel my subscription please",
    "how do i rate my delivery partner",
    "what are your customer service hours",
    "how do i delete my account",
] * 20   # repeat to get more samples

labels = (
    ['delivery'] * 5 + ['food'] * 5 +
    ['payment'] * 5 + ['general'] * 5
) * 20

X_train_txt, X_test_txt, y_train, y_test = train_test_split(
    texts, labels, test_size=0.2, stratify=labels, random_state=42
)

# ── MultinomialNB with CountVectorizer (word counts) ──────────────────
pipe_count = Pipeline([
    ('vectorizer', CountVectorizer(
        ngram_range=(1, 2),    # unigrams and bigrams
        min_df=2,              # ignore rare terms
        stop_words='english',
    )),
    ('model', MultinomialNB(alpha=1.0)),
])
pipe_count.fit(X_train_txt, y_train)
count_acc = accuracy_score(y_test, pipe_count.predict(X_test_txt))

# ── MultinomialNB with TF-IDF ─────────────────────────────────────────
pipe_tfidf = Pipeline([
    ('vectorizer', TfidfVectorizer(
        ngram_range=(1, 2),
        min_df=2,
        stop_words='english',
        sublinear_tf=True,     # log(1 + tf) dampens very frequent words
    )),
    ('model', MultinomialNB(alpha=0.1)),
])
pipe_tfidf.fit(X_train_txt, y_train)
tfidf_acc = accuracy_score(y_test, pipe_tfidf.predict(X_test_txt))

# ── BernoulliNB — binary presence/absence ─────────────────────────────
pipe_bern = Pipeline([
    ('vectorizer', CountVectorizer(
        binary=True,           # convert counts to 0/1
        ngram_range=(1, 2),
        min_df=2,
        stop_words='english',
    )),
    ('model', BernoulliNB(alpha=1.0)),
])
pipe_bern.fit(X_train_txt, y_train)
bern_acc = accuracy_score(y_test, pipe_bern.predict(X_test_txt))

print(f"{'Variant':<30} {'Test Accuracy'}")
print("─" * 44)
print(f"  MultinomialNB + CountVect:     {count_acc:.4f}")
print(f"  MultinomialNB + TF-IDF:        {tfidf_acc:.4f}")
print(f"  BernoulliNB + binary:          {bern_acc:.4f}")

print("
Classification report (MultinomialNB + TF-IDF):")
print(classification_report(y_test, pipe_tfidf.predict(X_test_txt)))

# ── Classify new tickets ───────────────────────────────────────────────
new_tickets = [
    "my order has not arrived it has been 3 hours",
    "food tastes very bad not fresh",
    "charged twice for same order need refund",
    "how do i update my phone number",
]
predictions = pipe_tfidf.predict(new_tickets)
probas      = pipe_tfidf.predict_proba(new_tickets)

print("New ticket classification:")
for ticket, pred, proba in zip(new_tickets, predictions, probas):
    classes   = pipe_tfidf.classes_
    top_class = pred
    top_prob  = proba.max()
    print(f"  '{ticket[:45]}...'")
    print(f"   → {top_class.upper()}  (confidence: {top_prob:.2%})")

The most important detail

Laplace smoothing — why a zero probability destroys everything

Imagine a word that appears in test data but never appeared in any spam email in training. Without smoothing, its probability given spam is exactly 0. When you multiply all word probabilities together — which is what Naive Bayes does — a single zero makes the entire product zero. One unseen word makes it impossible to classify the email as spam, no matter how many other spam indicators it contains.

Laplace smoothing (also called additive smoothing) fixes this by adding a small count to every word — even words that never appeared. Adding 1 to every word count (alpha=1) ensures no probability is ever exactly zero. The vocabulary expands to include all possible words, each with a small non-zero count.

Laplace smoothing formula — alpha prevents zero probabilities

P(word | class) = (count(word, class) + α) / (total_words_in_class + α × |vocab|)

α = 1.0 (alpha): standard Laplace smoothing — adds 1 to every count

α = 0.1: lighter smoothing — less bias toward uniform distribution

α = 0: no smoothing — zero probabilities possible (dangerous)

|vocab|: size of the vocabulary (number of unique words seen in training)

python

import numpy as np
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

# ── Demonstrate zero probability problem without smoothing ────────────
train_texts = [
    "free money win prize claim now",
    "free offer win cash reward today",
    "winner selected claim your prize free",
    "meeting tomorrow at 9am conference room",
    "project update please review attached",
    "team lunch scheduled for friday noon",
]
train_labels = [1, 1, 1, 0, 0, 0]   # 1=spam, 0=legit

# Test email with an unseen word
test_text = ["free money meeting tomorrow quantum"]
# "quantum" never appeared in training

vec   = CountVectorizer()
X_tr  = vec.fit_transform(train_texts)
X_te  = vec.transform(test_text)

print("Effect of alpha (Laplace smoothing) on classification:")
print(f"
Test email: '{test_text[0]}'")
print(f"Contains 'quantum' — a word never seen in training
")

for alpha in [0.0001, 0.1, 0.5, 1.0, 2.0, 5.0, 10.0]:
    nb   = MultinomialNB(alpha=alpha)
    nb.fit(X_tr, train_labels)
    pred  = nb.predict(X_te)[0]
    proba = nb.predict_proba(X_te)[0]
    label = "SPAM" if pred == 1 else "LEGIT"
    print(f"  alpha={alpha:<6}: {label}  P(spam)={proba[1]:.4f}  P(legit)={proba[0]:.4f}")

# ── How to choose alpha ───────────────────────────────────────────────
# Use cross-validation — treat alpha like any other hyperparameter
np.random.seed(42)
from sklearn.datasets import fetch_20newsgroups

# Use a small subset of 20newsgroups for speed
newsgroups = fetch_20newsgroups(
    subset='train',
    categories=['sci.med', 'sci.space', 'rec.sport.baseball', 'talk.politics.guns'],
    remove=('headers', 'footers', 'quotes'),
    random_state=42,
)
X_news = newsgroups.data[:400]
y_news = newsgroups.target[:400]

print("
Alpha tuning on 20Newsgroups (4 categories):")
best_alpha, best_score = 1.0, 0.0
for alpha in [0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 5.0]:
    pipe = Pipeline([
        ('vec', CountVectorizer(min_df=2, stop_words='english')),
        ('nb',  MultinomialNB(alpha=alpha)),
    ])
    scores = cross_val_score(pipe, X_news, y_news, cv=5, scoring='accuracy')
    mean   = scores.mean()
    if mean > best_score:
        best_score, best_alpha = mean, alpha
    print(f"  alpha={alpha:<6}: CV acc = {mean:.4f} ± {scores.std():.4f}")

print(f"
Best alpha: {best_alpha}  (CV acc = {best_score:.4f})")

Numeric features

GaussianNB — Naive Bayes for continuous features

When features are continuous numbers — like delivery distance, order value, or customer age — you cannot count occurrences. GaussianNB assumes each feature follows a Gaussian (normal) distribution within each class. During training it learns the mean and variance of each feature for each class. During prediction it computes how likely the observed feature value is given each class's Gaussian distribution.

python

import numpy as np
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report, accuracy_score
from sklearn.pipeline import Pipeline
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)
n = 3000

# DoorDash late delivery classification
distance   = np.abs(np.random.normal(4.0, 2.0, n)).clip(0.5, 15)
traffic    = np.random.randint(1, 11, n).astype(float)
prep       = np.abs(np.random.normal(15, 5, n)).clip(5, 35)
order_val  = np.abs(np.random.normal(350, 150, n)).clip(50, 1200)
delivery   = (8.6 + 7.3*distance + 0.8*prep + 1.5*traffic
              + np.random.normal(0, 4, n)).clip(10, 120)
y = (delivery > 45).astype(int)

X = np.column_stack([distance, traffic, prep, order_val])
feat_names = ['distance_km', 'traffic_score', 'restaurant_prep', 'order_value']

X_tr, X_te, y_tr, y_te = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# ── GaussianNB — no scaling needed (it models the distribution itself) ─
gnb = GaussianNB()
gnb.fit(X_tr, y_tr)

print(f"GaussianNB accuracy: {accuracy_score(y_te, gnb.predict(X_te)):.4f}")
print(f"CV accuracy: {cross_val_score(gnb, X, y, cv=5).mean():.4f}")

# ── What GaussianNB learned: per-class Gaussian parameters ────────────
print("
Learned Gaussian parameters per class:")
print(f"{'Feature':<20} {'Mean (on-time)':<18} {'Mean (late)':<16} {'Std (on-time)':<16} {'Std (late)'}")
print("─" * 84)
for i, name in enumerate(feat_names):
    mean_0 = gnb.theta_[0, i]       # mean for class 0 (on-time)
    mean_1 = gnb.theta_[1, i]       # mean for class 1 (late)
    std_0  = np.sqrt(gnb.var_[0, i])
    std_1  = np.sqrt(gnb.var_[1, i])
    print(f"  {name:<18}  {mean_0:<18.2f} {mean_1:<16.2f} {std_0:<16.2f} {std_1:.2f}")

print(f"
Class priors: on-time={gnb.class_prior_[0]:.3f}  late={gnb.class_prior_[1]:.3f}")

# ── Intuition check: late deliveries should have larger distance & traffic
print("
Interpretation check:")
for i, name in enumerate(feat_names):
    diff = gnb.theta_[1, i] - gnb.theta_[0, i]
    direction = "higher in late" if diff > 0 else "lower in late"
    print(f"  {name:<20}: {direction} (diff = {diff:+.2f})")

# ── Compare GaussianNB vs other algorithms on numeric features ─────────
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

print("
Algorithm comparison on DoorDash numeric features:")
for name, model in [
    ('GaussianNB',          GaussianNB()),
    ('LogisticRegression',  LogisticRegression(max_iter=1000, random_state=42)),
    ('RandomForest',        RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)),
]:
    scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
    print(f"  {name:<22}: {scores.mean():.4f} ± {scores.std():.4f}")

What this looks like at work

Day-one task — build a DoorDash review sentiment classifier

Your first week at DoorDash's data team. The product manager asks: "Can you automatically classify customer reviews as positive or negative so we can route negative ones to customer support immediately?" 250,000 reviews per month. You need something fast, accurate enough, and deployable by end of week. Naive Bayes is the right answer.

python

import numpy as np
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.pipeline import Pipeline
from sklearn.calibration import CalibratedClassifierCV
import joblib

np.random.seed(42)

# ── Simulated DoorDash review dataset ───────────────────────────────────
positive_templates = [
    "food was excellent delivery was super fast loved it",
    "amazing taste quick delivery will definitely order again",
    "perfect order arrived hot and fresh great experience",
    "best food in the area fast delivery highly recommend",
    "food quality outstanding delivery partner was very polite",
    "on time delivery tasty food great packaging",
    "wow this restaurant never disappoints fast and delicious",
    "ordered for the third time always consistent quality",
]
negative_templates = [
    "food was cold delivery took forever terrible experience",
    "wrong items delivered complained but no response",
    "very late delivery food quality was awful never again",
    "packaging damaged food spilled very disappointed",
    "delivery partner rude and slow food tasted bad",
    "paid premium for cold food unacceptable quality",
    "worst order ever stale food extreme delay",
    "missing items delivery person unreachable pathetic service",
]

# Generate dataset with variations
reviews, sentiments = [], []
for i in range(2000):
    if np.random.random() > 0.45:   # 55% positive
        template = positive_templates[i % len(positive_templates)]
        reviews.append(template + f" order {i}")
        sentiments.append(1)
    else:
        template = negative_templates[i % len(negative_templates)]
        reviews.append(template + f" order {i}")
        sentiments.append(0)

X_train, X_test, y_train, y_test = train_test_split(
    reviews, sentiments, test_size=0.2,
    stratify=sentiments, random_state=42
)

# ── Production pipeline ───────────────────────────────────────────────
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(
        ngram_range=(1, 2),       # unigrams + bigrams
        min_df=2,                 # ignore words in < 2 docs
        max_df=0.95,              # ignore words in > 95% of docs (too common)
        stop_words='english',
        sublinear_tf=True,        # dampen very frequent words
        max_features=10_000,      # cap vocabulary size
    )),
    ('model', MultinomialNB(alpha=0.1)),
])

# ── Cross-validation ──────────────────────────────────────────────────
cv_scores = cross_val_score(
    pipeline, reviews, sentiments,
    cv=5, scoring='f1', n_jobs=-1
)
print(f"5-fold CV F1: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")

# ── Train and evaluate ────────────────────────────────────────────────
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred,
                             target_names=['Negative', 'Positive']))

# ── Most informative features ─────────────────────────────────────────
# The words that most strongly predict each class
tfidf   = pipeline.named_steps['tfidf']
nb      = pipeline.named_steps['model']
vocab   = np.array(tfidf.get_feature_names_out())

# Log probability ratio: high = strong positive signal, low = strong negative signal
log_ratios = nb.feature_log_prob_[1] - nb.feature_log_prob_[0]

print("
Top 10 most positive words:")
for word in vocab[np.argsort(log_ratios)[-10:][::-1]]:
    print(f"  + {word}")

print("
Top 10 most negative words:")
for word in vocab[np.argsort(log_ratios)[:10]]:
    print(f"  - {word}")

# ── Real-time scoring ─────────────────────────────────────────────────
new_reviews = [
    "food was amazing delivery was so fast will order again",
    "terrible experience food arrived cold and late never ordering again",
    "good food but delivery was a bit slow overall okay",
]
predictions = pipeline.predict(new_reviews)
probabilities = pipeline.predict_proba(new_reviews)

print("
Real-time review scoring:")
for review, pred, proba in zip(new_reviews, predictions, probabilities):
    sentiment = "POSITIVE" if pred == 1 else "NEGATIVE"
    confidence = proba.max()
    route = "→ Route to support" if pred == 0 and confidence > 0.75 else "→ No action needed"
    print(f"  Review: '{review[:50]}...'")
    print(f"  {sentiment} ({confidence:.1%} confidence) {route}")
    print()

# ── Save model ────────────────────────────────────────────────────────
joblib.dump(pipeline, '/tmp/swiggy_sentiment_nb.pkl')
print("Model saved — ready for production deployment")

Errors you will hit

Every common Naive Bayes error — explained and fixed

ValueError: Input X must be non-negative when using MultinomialNB

Why it happens

MultinomialNB requires non-negative feature values because it models word counts — negative counts make no mathematical sense. This error appears when you pass raw TF-IDF scores that have been modified (e.g. by subtracting a mean), or when you use a StandardScaler before MultinomialNB which centres features around zero.

Fix

Never use StandardScaler before MultinomialNB. TF-IDF and CountVectorizer already produce non-negative values — do not modify them. If you have numeric features that you need to scale, switch to GaussianNB (designed for continuous features) or use MinMaxScaler(feature_range=(0, 1)) which preserves non-negativity.

Naive Bayes gives overconfident probabilities — predicts 0.999 for most samples

Why it happens

The naive independence assumption causes probability estimates to be extreme. When the model multiplies many individual word probabilities together, the product of slightly-high probabilities becomes very close to 1, and slightly-low becomes very close to 0. The model is correct about the direction but wrong about the magnitude of certainty.

Fix

Use CalibratedClassifierCV to post-process probabilities: calibrated = CalibratedClassifierCV(pipeline, cv=5, method='isotonic').fit(X_train, y_train). This maps the raw predicted probabilities to calibrated ones using a held-out calibration set. For applications that need well-calibrated probabilities (e.g. risk scoring), this calibration step is essential.

MultinomialNB performs poorly on short texts or single-sentence reviews

Why it happens

Short texts have very few words. With only 5–8 words per document, many documents produce identical feature vectors — the vocabulary is too small to discriminate. Also, rare words that strongly indicate a category may appear in only one or two training documents, making their probability estimates unreliable.

Fix

Use BernoulliNB instead for short texts — it explicitly penalises absent features, which is more informative for short documents. Reduce alpha for less smoothing (try 0.1 or 0.01). Add character n-grams in the vectorizer: TfidfVectorizer(analyzer='char_wb', ngram_range=(3,5)) captures subword patterns robust to short text.

Model classifies everything as the majority class on imbalanced dataset

Why it happens

Naive Bayes uses class priors — if 90% of your training data is positive, P(positive) = 0.9 is used as a prior. Combined with any ambiguous evidence, the model defaults to predicting positive for everything. Small classes get overwhelmed by the large prior.

Fix

Pass class_prior to override the learned prior: MultinomialNB(class_prior=[0.5, 0.5]) treats both classes equally regardless of training frequency. Or use fit_prior=False which ignores the class prior entirely. For production, balance your training set with oversampling before fitting, or tune class_prior values with cross-validation.

What comes next

You have now covered every major classical ML algorithm. Next: ensemble methods that combine them.

Linear Regression, Logistic Regression, Decision Trees, SVM, KNN, Naive Bayes — six algorithms, six different philosophies. Linear regression fits a line. Logistic regression finds a probability boundary. Decision trees grow a flowchart. SVMs maximise a margin. KNN asks its neighbours. Naive Bayes applies Bayes theorem. Each has a domain where it wins.

Module 28 — Random Forest — combines hundreds of decision trees through a technique called bagging. Each tree is trained on a random subset of data with a random subset of features. Their predictions are averaged. The result consistently beats any single tree on almost every tabular dataset. It is one of the first algorithms you should reach for in production.

Next — Module 28 · Classical ML

Random Forest — Instacart Stock Prediction

Bagging, random feature subsets, out-of-bag evaluation, and why Random Forest beats a single tree on every real dataset.

read →

🎯 Key Takeaways

✓Naive Bayes uses Bayes theorem to compute the probability of each class given the input features. It picks the class with the highest posterior probability. The "naive" part is treating each feature as independent — wrong in theory, works well in practice.
✓Three variants for three feature types: MultinomialNB for word counts and text (most common), BernoulliNB for binary presence/absence features especially in short texts, GaussianNB for continuous numeric features.
✓Laplace smoothing (alpha parameter) is essential. Without it, a single word that never appeared in training causes the entire probability to become zero. Alpha=1.0 is standard. Tune it with cross-validation — alpha=0.1 often outperforms the default on text.
✓Naive Bayes is one of the fastest ML algorithms — training is a single pass to count frequencies. Prediction is a few multiplications. For high-volume real-time classification (spam, sentiment, support ticket routing) it is often the most practical choice.
✓The independence assumption makes Naive Bayes probabilities overconfident — predictions cluster near 0 and 1. When you need calibrated probabilities, post-process with CalibratedClassifierCV(method="isotonic").
✓Naive Bayes genuinely wins for text classification with small datasets, real-time requirements, or high-dimensional sparse features. For tabular numeric data with strong feature correlations, Logistic Regression or Random Forest almost always outperforms it.

Discussion

Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.

Continue with GitHub