62. Naive Bayes: Fast, Simple, Surprisingly Effective
Your email spam filter makes a decision in milliseconds. Thousands of words. Instant classification. Most of the algorithms we've covered so far would struggle with that. KNN needs to compute distances across thousands of features. SVM slows down on high dimensions. Even tree-based models take time. Naive Bayes does it in one pass. It counts words, multiplies probabilities, picks the class with the highest probability. Done. It's been doing this since the 1990s and it still works. What Bayes theorem is in plain words, not symbols Why the naive assumption works even when it is wrong The three variants: Gaussian, Multinomial, Bernoulli Building a text classifier from scratch When Naive Bayes wins and when it loses Full working code with scikit-learn You want to know: given that this email contains the word "casino", what is the probability it is spam? That's a conditional probability. Written as: P(spam | word="casino") Bayes theorem says you can calculate this using things you already know from training data: P(spam | casino) = P(casino | spam) * P(spam) ───────────────────────────── P(casino) In words: P(casino | spam): how often does the word "casino" appear in spam emails? You know this from training data. P(spam): what fraction of all emails are spam? You know this too. P(casino): how often does "casino" appear in any email? Also known. So you can calculate the probability that an email is spam, given that it contains "casino", using just counts from your training data. For classification, you don't even need the denominator P(casino) because it's the same for all classes. You just compare: P(spam | casino) vs P(not spam | casino) Whichever is bigger wins. Real emails have many words. You need: P(spam | word1, word2, word3, ..., word1000) Calculating the joint probability of all those words together is nearly impossible. The data would never be enough. The naive assumption: treat every word as independent of every other word. Pretend that seeing the word "casino" tells you nothing about whether "free" also appears. P(spam | word1, word2, ..., wordN) ≈ P(word1 | spam) * P(word2 | spam) * ... * P(wordN | spam) * P(spam) Now you just multiply individual word probabilities. Those you can estimate easily from training data. Is this assumption true? Absolutely not. Words in emails are not independent at all. "Free money" tends to appear together in spam. Does it work anyway? Yes. Shockingly well. The reason it still works is that even wrong independence assumptions lead to the right class comparison most of the time. The relative ordering of class probabilities tends to be preserved even when the absolute probabilities are wrong. Different variants handle different types of features. Gaussian Naive Bayes Multinomial Naive Bayes Bernoulli Naive Bayes from sklearn.naive_bayes import GaussianNB from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.metrics import accuracy_score, classification_report iris = load_iris() X, y = iris.data, iris.target X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y ) gnb = GaussianNB() gnb.fit(X_train, y_train) y_pred = gnb.predict(X_test) print(f"Gaussian NB Accuracy: {accuracy_score(y_test, y_pred):.3f}") print() print(classification_report(y_test, y_pred, target_names=iris.target_names)) Output: Gaussian NB Accuracy: 0.967 precision recall f1-score support setosa 1.00 1.00 1.00 10 versicolor 1.00 0.90 0.95 10 virginica 0.91 1.00 0.95 10 accuracy 0.97 30 What the model actually learned: for each feature and each class, it calculated the mean and variance. At prediction time, it checks how likely each feature value is under each class's distribution. # What GaussianNB learned import pandas as pd import numpy as np print("Class means for each feature:") means_df = pd.DataFrame( gnb.theta_, columns=iris.feature_names, index=iris.target_names ) print(means_df.round(2)) Output: Class means for each feature: sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) setosa 5.01 3.43 1.46 0.25 versicolor 5.94 2.77 4.26 1.33 virginica 6.59 2.97 5.55 2.03 These means tell the whole story. Virginica has the longest petals. Setosa has the shortest. When a new flower comes in, the model checks which class's distribution it fits best. This is where Naive Bayes really shines. Let's build a spam classifier. from sklearn.naive_bayes import MultinomialNB from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score, classification_report import pandas as pd import numpy as np # Simple spam dataset emails = [ # Spam ("Get rich quick! Free money! Click here now!", 1), ("You won a prize! Claim your free casino chips!", 1), ("Cheap meds online! No prescription needed!", 1), ("URGENT: Your account needs verification. Click now!", 1), ("Make money from home! Easy income guaranteed!", 1), ("Free Viagra! Cialis! Lowest prices online!", 1), ("Congratulations you have been selected for a prize!", 1), ("Win big today! Limited time casino offer!", 1), # Not spam ("Hey, are we still meeting for lunch tomorrow?", 0), ("The quarterly report is ready for your review.", 0), ("Can you send me the project files?", 0), ("Meeting rescheduled to 3pm on Thursday.", 0), ("Your order has been shipped. Track it here.", 0), ("Thanks for the presentation today, great work!", 0), ("Please review the attached document and let me know.", 0), ("Team lunch is on Friday at noon, see you there!", 0), ] texts, labels = zip(*emails) texts = list(texts) labels = list(labels) X_train, X_test, y_train, y_test = train_test_split( texts, labels, test_size=0.25, random_state=42 ) # Convert text to word counts vectorizer = CountVectorizer(stop_words='english', lowercase=True) X_train_counts = vectorizer.fit_transform(X_train) X_test_counts = vectorizer.transform(X_test) # Train Multinomial NB mnb = MultinomialNB(alpha=1.0) # alpha=1 is Laplace smoothing mnb.fit(X_train_counts, y_train) y_pred = mnb.predict(X_test_counts) print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}") print() print(classification_report(y_test, y_pred, target_names=['not spam', 'spam'])) This is the most interesting part. You can see exactly which words push toward spam and which toward not-spam. # Get feature names (words) feature_names = vectorizer.get_feature_names_out() # Log probabilities for each class log_probs = mnb.feature_log_prob_ # shape: (n_classes, n_features) # Top spam words spam_log_probs = log_probs[1] notspam_log_probs = log_probs[0] # Words most associated with spam spam_word_scores = pd.DataFrame({ 'Word': feature_names, 'Spam prob': np.exp(spam_log_probs), 'Ham prob': np.exp(notspam_log_probs), 'Diff': spam_log_probs - notspam_log_probs }).sort_values('Diff', ascending=False) print("Top words that scream SPAM:") print(spam_word_scores.head(10)[['Word', 'Diff']].to_string(index=False)) print("\nTop words that scream NOT SPAM:") print(spam_word_scores.tail(10)[['Word', 'Diff']].to_string(index=False)) new_emails = [ "Free money! You won! Click here!", "Can we schedule a call for next week?", "Exclusive casino offer just for you, free chips!", "The project deadline has been moved to Friday.", ] new_counts = vectorizer.transform(new_emails) predictions = mnb.predict(new_counts) probabilities = mnb.predict_proba(new_counts) for email, pred, proba in zip(new_emails, predictions, probabilities): label = "SPAM" if pred == 1 else "NOT SPAM" print(f"[{label}] (confidence: {max(proba):.1%})") print(f" '{email[:60]}...' " if len(email) > 60 else f" '{email}'") print() Output: [SPAM] (confidence: 99.8%) 'Free money! You won! Click here!' [NOT SPAM] (confidence: 94.2%) 'Can we schedule a call for next week?' [SPAM] (confidence: 99.1%) 'Exclusive casino offer just for you, free chips!' [NOT SPAM] (confidence: 91.7%) 'The project deadline has been moved to Friday.' Raw word counts give too much weight to common words. TF-IDF (Term Frequency-Inverse Document Frequency) adjusts for this. Words that appear in many documents get lower weight. from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.naive_bayes import MultinomialNB from sklearn.pipeline import Pipeline # Pipeline with TF-IDF tfidf_pipeline = Pipeline([ ('tfidf', TfidfVectorizer(stop_words='english', lowercase=True)), ('nb', MultinomialNB(alpha=0.1)) ]) tfidf_pipeline.fit(X_train, y_train) y_pred_tfidf = tfidf_pipeline.predict(X_test) print(f"TF-IDF + Naive Bayes Accuracy: {accuracy_score(y_test, y_pred_tfidf):.3f}") When you only care if a word appears at all, not how many times: from sklearn.naive_bayes import BernoulliNB from sklearn.feature_extraction.text import CountVectorizer # BernoulliNB works with binary features (word present or not) bin_vectorizer = CountVectorizer(binary=True, stop_words='english') X_train_bin = bin_vectorizer.fit_transform(X_train) X_test_bin = bin_vectorizer.transform(X_test) bnb = BernoulliNB(alpha=1.0) bnb.fit(X_train_bin, y_train) y_pred_b = bnb.predict(X_test_bin) print(f"Bernoulli NB Accuracy: {accuracy_score(y_test, y_pred_b):.3f}") When to use which: Gaussian NB: continuous numeric features Multinomial NB: word counts, TF-IDF, frequency data Bernoulli NB: binary features, short text, word presence/absence What if a word appears in test data but never appeared in training? Its probability would be 0. And 0 multiplied by anything is 0. The whole prediction collapses. Laplace smoothing fixes this by adding a small count to every word, even unseen ones. # alpha controls smoothing # alpha=1.0 is classic Laplace smoothing # alpha=0.1 is lighter smoothing - better when you have lots of data for alpha in [0.01, 0.1, 0.5, 1.0, 2.0, 5.0]: mnb_a = MultinomialNB(alpha=alpha) mnb_a.fit(X_train_counts, y_train) acc = accuracy_score(y_test, mnb_a.predict(X_test_counts)) print(f"alpha={alpha:<5} accuracy={acc:.3f}") Alpha=1.0 is the safe default. On larger datasets try smaller values like 0.1. Let's test on a real text classification dataset with 20 categories. from sklearn.datasets import fetch_20newsgroups from sklearn.naive_bayes import MultinomialNB from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.pipeline import Pipeline from sklearn.metrics import accuracy_score import time # Load 4 categories for speed categories = ['sci.space', 'rec.sport.hockey', 'talk.politics.guns', 'comp.graphics'] train_data = fetch_20newsgroups(subset='train', categories=categories, remove=('headers', 'footers')) test_data = fetch_20newsgroups(subset='test', categories=categories, remove=('headers', 'footers')) print(f"Training documents: {len(train_data.data)}") print(f"Testing documents: {len(test_data.data)}") print(f"Categories: {categories}") # Build and train pipeline pipeline = Pipeline([ ('tfidf', TfidfVectorizer(max_features=10000, stop_words='english')), ('nb', MultinomialNB(alpha=0.1)) ]) start = time.time() pipeline.fit(train_data.data, train_data.target) train_time = time.time() - start start = time.time() y_pred = pipeline.predict(test_data.data) predict_time = time.time() - start acc = accuracy_score(test_data.target, y_pred) print(f"\nAccuracy: {acc:.3f}") print(f"Train time: {train_time:.3f}s") print(f"Predict time: {predict_time:.3f}s") Output: Training documents: 2169 Testing documents: 1444 Categories: ['sci.space', 'rec.sport.hockey', 'talk.politics.guns', 'comp.graphics'] Accuracy: 0.941 Train time: 0.043s Predict time: 0.008s 94.1% accuracy. Trained in 0.04 seconds. Predicted 1444 documents in 0.008 seconds. That speed is the whole point. Neural networks will get higher accuracy on text. But if you need something fast, interpretable, and good enough, Naive Bayes is hard to beat. Use Naive Bayes when: Text classification: spam, sentiment, topic classification Dataset is small. NB needs very little data to work well. You need fast training and prediction at scale You want a quick, solid baseline before trying complex models Features are truly or mostly independent (rare but happens) Skip Naive Bayes when: Features are strongly correlated. The naive assumption causes big problems. You need very high accuracy and have enough data for complex models Numeric features with complex non-linear relationships You need probability estimates to be accurate, not just the class ranking from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB from sklearn.datasets import load_breast_cancer from sklearn.model_selection import cross_val_score from sklearn.preprocessing import MinMaxScaler # MultinomialNB needs non-negative input data = load_breast_cancer() X_bc, y_bc = data.data, data.target # MinMaxScaler for MultinomialNB (needs non-negative features) X_scaled = MinMaxScaler().fit_transform(X_bc) models = { 'GaussianNB': (GaussianNB(), X_bc), 'MultinomialNB': (MultinomialNB(alpha=1), X_scaled), 'BernoulliNB': (BernoulliNB(alpha=1), X_bc), } print(f"{'Model':<18} {'CV Mean':<10} {'CV Std'}") print("-" * 38) for name, (model, X_use) in models.items(): scores = cross_val_score(model, X_use, y_bc, cv=5) print(f"{name:<18} {scores.mean():.3f} {scores.std():.3f}") Output: Model CV Mean CV Std -------------------------------------- GaussianNB 0.939 0.020 MultinomialNB 0.898 0.022 BernoulliNB 0.627 0.033 GaussianNB wins on numeric data as expected. MultinomialNB is mediocre on numeric data but excellent on text. BernoulliNB is binary-focused and struggles with continuous values. Task Code Numeric features GaussianNB() Word counts / TF-IDF MultinomialNB(alpha=1.0) Binary features BernoulliNB(alpha=1.0) Text vectorization TfidfVectorizer(stop_words='english') Full text pipeline Pipeline([('tfidf', TfidfVectorizer()), ('nb', MultinomialNB())]) Get probabilities .predict_proba(X) See word probs np.exp(model.feature_log_prob_) Tune smoothing try alpha values 0.01 to 5.0 Level 1: Level 2: Level 3: movie_reviews corpus from NLTK works). Compare CountVectorizer vs TfidfVectorizer with MultinomialNB. Which gives better accuracy? Try tuning alpha with cross-validation. Scikit-learn: Naive Bayes Scikit-learn: Text feature extraction StatQuest: Naive Bayes (YouTube) 20 Newsgroups dataset Bayes theorem visualized Next up, Post 63: Confusion Matrix: What Your Model Got Wrong and Why. TP, TN, FP, FN explained properly with real examples. The one tool that tells you exactly where your model is failing.
