AI News Hub Logo

AI News Hub

63. Confusion Matrix: What Your Model Got Wrong and Why

DEV Community
Akhilesh

Your model has 95% accuracy. You ship it. Three weeks later someone tells you it's missing 40% of actual fraud cases. You check. The dataset had 95% legit transactions and 5% fraud. Your model just learned to say "not fraud" every single time. 95% accuracy. Zero fraud caught. That's what happens when you trust accuracy alone. The confusion matrix is the tool that would have caught this immediately. What the four cells of a confusion matrix mean TP, TN, FP, FN with real-world examples not textbook ones How to build and read a confusion matrix in Python Why class imbalance makes accuracy useless How to visualize it properly Multi-class confusion matrices Every prediction your model makes falls into one of four buckets. Let's use a disease test as the example because the stakes are obvious. PREDICTED Positive Negative ACTUAL Positive | TP | FN | Negative | FP | TN | True Positive (TP): Model said positive. Actually positive. Correct. True Negative (TN): Model said negative. Actually negative. Correct. False Positive (FP): Model said positive. Actually negative. Wrong. False Negative (FN): Model said negative. Actually positive. Wrong. In most real problems, FP and FN have very different costs. Missing cancer (FN) is catastrophic. Flagging a legit transaction as fraud (FP) is annoying but fixable. That difference is exactly why you need more than accuracy. from sklearn.datasets import load_breast_cancer from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay import matplotlib.pyplot as plt data = load_breast_cancer() X, y = data.data, data.target # 0 = malignant, 1 = benign X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y ) model = RandomForestClassifier(n_estimators=100, random_state=42) model.fit(X_train, y_train) y_pred = model.predict(X_test) # Raw confusion matrix cm = confusion_matrix(y_test, y_pred) print("Confusion Matrix:") print(cm) print() # Label what each cell is tn, fp, fn, tp = cm.ravel() print(f"True Positives (TP): {tp} FP (false alarm) → Optimize for Recall. Catch everything, even if some are false alarms. Problem: Spam filter FP (blocking legit email) > FN (letting spam through) → Optimize for Precision. Only block what you're sure about. Problem: Fraud detection FN (missed fraud) > FP (flagging legit transaction) → Optimize for Recall on fraud class. Problem: Hiring tool FP (hiring wrong person) ≈ FN (missing good candidate) → Optimize for F1. Balance both. # See how threshold change affects TP, FP, FN, TN from sklearn.ensemble import RandomForestClassifier model_prob = RandomForestClassifier(n_estimators=100, random_state=42) model_prob.fit(X_train, y_train) proba = model_prob.predict_proba(X_test)[:, 1] # probability of benign print(f"{'Threshold':= thresh).astype(int) cm_t = confusion_matrix(y_test, y_pred_t) tn_t, fp_t, fn_t, tp_t = cm_t.ravel() rec = tp_t / (tp_t + fn_t) if (tp_t + fn_t) > 0 else 0 prec = tp_t / (tp_t + fp_t) if (tp_t + fp_t) > 0 else 0 print(f"{thresh:<12} {tp_t:<6} {tn_t:<6} {fp_t:<6} {fn_t:<6} {rec:<10.3f} {prec:.3f}") Output: Threshold TP TN FP FN Recall Precision ------------------------------------------------------------ 0.3 72 38 4 0 1.000 0.947 0.4 72 39 3 0 1.000 0.960 0.5 71 40 2 1 0.986 0.973 0.6 70 41 1 2 0.972 0.986 0.7 68 42 0 4 0.944 1.000 0.8 65 42 0 7 0.903 1.000 At threshold 0.3 you catch every single malignant tumor (Recall=1.0) but have 4 false alarms. Which is better? In cancer detection, threshold 0.3 is better. In a low-stakes screening tool, maybe 0.6. With more than two classes, the matrix grows but the same logic applies. from sklearn.datasets import load_iris from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, classification_report iris = load_iris() X_i, y_i = iris.data, iris.target X_train_i, X_test_i, y_train_i, y_test_i = train_test_split( X_i, y_i, test_size=0.2, random_state=42, stratify=y_i ) model_i = RandomForestClassifier(n_estimators=100, random_state=42) model_i.fit(X_train_i, y_train_i) y_pred_i = model_i.predict(X_test_i) cm_i = confusion_matrix(y_test_i, y_pred_i) print("Multi-class Confusion Matrix:") print(cm_i) print() # Visualize fig, ax = plt.subplots(figsize=(7, 5)) disp = ConfusionMatrixDisplay( confusion_matrix=cm_i, display_labels=iris.target_names ) disp.plot(ax=ax, colorbar=False, cmap='Blues') ax.set_title('Iris - 3 Class Confusion Matrix') plt.tight_layout() plt.savefig('multiclass_cm.png', dpi=100) plt.show() Output: Multi-class Confusion Matrix: [[10 0 0] [ 0 9 1] [ 0 0 10]] Reading this: rows are actual classes, columns are predicted. Row "versicolor": 9 correctly identified as versicolor, 1 incorrectly called virginica. That 1 is a false negative for versicolor and a false positive for virginica. The diagonal is always your correct predictions. Off-diagonal cells are errors. print(classification_report( y_test_i, y_pred_i, target_names=iris.target_names )) Output: precision recall f1-score support setosa 1.00 1.00 1.00 10 versicolor 1.00 0.90 0.95 10 virginica 0.91 1.00 0.95 10 accuracy 0.97 30 macro avg 0.97 0.97 0.97 30 weighted avg 0.97 0.97 0.97 30 Every class gets its own precision, recall, and F1. You can see exactly which classes are problematic. Versicolor has lower recall because one example got misclassified as virginica. from sklearn.metrics import ( confusion_matrix, classification_report, accuracy_score, roc_auc_score ) import numpy as np def diagnose_model(model, X_test, y_test, class_names, threshold=0.5): y_pred = model.predict(X_test) y_proba = model.predict_proba(X_test) print("=" * 55) print("MODEL DIAGNOSIS REPORT") print("=" * 55) # Overall accuracy acc = accuracy_score(y_test, y_pred) print(f"\nAccuracy: {acc:.3f}") # Confusion matrix cm = confusion_matrix(y_test, y_pred) print(f"\nConfusion Matrix:") print(cm) # Per-class report print(f"\nClassification Report:") print(classification_report(y_test, y_pred, target_names=class_names)) # For binary: show TP/TN/FP/FN breakdown if len(class_names) == 2: tn, fp, fn, tp = cm.ravel() print(f"True Positives: {tp}") print(f"True Negatives: {tn}") print(f"False Positives: {fp} <- wrong positive predictions") print(f"False Negatives: {fn} <- missed actual positives") print(f"\nROC-AUC: {roc_auc_score(y_test, y_proba[:, 1]):.3f}") print("=" * 55) # Use it diagnose_model(model, X_test, y_test, data.target_names) Term Formula Meaning Accuracy (TP+TN) / total Overall correct % Precision TP / (TP+FP) When I say positive, am I right? Recall TP / (TP+FN) Did I catch all actual positives? F1 Score 2*(P*R)/(P+R) Balance of precision and recall Specificity TN / (TN+FP) Did I correctly identify negatives? FPR FP / (FP+TN) How often did I false alarm? Task Code Build matrix confusion_matrix(y_test, y_pred) Visualize ConfusionMatrixDisplay(cm).plot() Normalize confusion_matrix(y_test, y_pred, normalize='true') Full report classification_report(y_test, y_pred) Extract TP/TN/FP/FN tn, fp, fn, tp = cm.ravel() (binary only) Level 1: load_breast_cancer(). Print the confusion matrix. Calculate precision and recall by hand from the TP, TN, FP, FN values. Then verify with classification_report. Level 2: class_weight='balanced' and retrain. How does the confusion matrix change? Level 3: Scikit-learn: Confusion Matrix Scikit-learn: Classification metrics StatQuest: Confusion Matrix (YouTube) Google ML Crash Course: Classification Next up, Post 64: Precision and Recall: Beyond Accuracy. We go deep on the tradeoff between catching everything and being right when you do. The F1 score, when to use each metric, and how to pick the right one for your problem.