67. DBSCAN: Clustering That Handles Messy Data
Last post K-Means failed on crescent-shaped data. It cut across the natural curves instead of following them. You also had to tell it K upfront. And one outlier could drag a centroid completely off course. DBSCAN fixes all three problems. It finds clusters based on density, not distance to a centroid. It discovers K automatically. It labels outliers explicitly instead of forcing them into a cluster. Different idea. Different use cases. Worth knowing. How density-based clustering works What eps and min_samples actually control Core points, border points, and noise points explained How to tune DBSCAN parameters properly When DBSCAN wins and when K-Means is still better Anomaly detection with DBSCAN Full working code K-Means asks: what's the nearest centroid? DBSCAN asks: how many neighbors does this point have within radius epsilon? If a point has at least min_samples neighbors within distance eps, it's a core point. Core points form the dense heart of a cluster. Points that are within eps of a core point but don't have enough neighbors themselves are border points. They're on the edge of a cluster. Points that are not within eps of any core point are noise. DBSCAN labels them -1. They don't belong to any cluster. Core point: has >= min_samples neighbors within eps Border point: within eps of a core point, but = 2: # silhouette needs at least 2 clusters sil = silhouette_score(X_iris, lbl) if n_k >= 2 else 0 ari = adjusted_rand_score(y_iris, lbl) print(f"{eps:= 2: sil = silhouette_score(X_scaled, labels) print(f"Silhouette score: {sil:.3f}") if plot and X.shape[1] == 2: plt.figure(figsize=(7, 5)) unique_labels = set(labels) colors = plt.cm.tab10(np.linspace(0, 1, len(unique_labels))) for label, color in zip(sorted(unique_labels), colors): if label == -1: plt.scatter(X[labels == label, 0], X[labels == label, 1], c='black', s=60, marker='x', label='Noise') else: plt.scatter(X[labels == label, 0], X[labels == label, 1], color=color, s=30, alpha=0.7, label=f'Cluster {label}') plt.title(f'DBSCAN Result: {n_clusters} clusters') plt.legend() plt.savefig('dbscan_result.png', dpi=100) plt.show() return labels X_demo, _ = make_moons(n_samples=300, noise=0.08, random_state=42) labels_demo = run_dbscan(X_demo, min_samples=5) Task Code Basic DBSCAN DBSCAN(eps=0.5, min_samples=5).fit_predict(X) Always scale first StandardScaler().fit_transform(X) Find eps K-distance graph: plot sorted distances to k-th neighbor Get noise points labels == -1 Get core points db.core_sample_indices_ Count clusters len(set(labels)) - (1 if -1 in labels else 0) Evaluate quality silhouette_score(X, labels) Compare to truth adjusted_rand_score(true_labels, labels) Level 1: make_circles(noise=0.05) with K-Means also on the same data. Plot both results side by side. Which one correctly identifies the two circles? Level 2: Level 3: Scikit-learn: DBSCAN Scikit-learn: Clustering comparison Original DBSCAN paper (1996) StatQuest: DBSCAN (YouTube) HDBSCAN for varying density clusters Next up, Post 68: PCA: Shrinking Data Without Losing Information. Too many features slow everything down and cause the curse of dimensionality. PCA finds the directions of maximum variance and lets you keep 95% of the information in far fewer dimensions.
