Notifications

No notifications

/Phase 4

Clustering & Unsupervised ML

Clustering & Unsupervised ML — Discovering Hidden Patterns in Data

Unsupervised learning finds structure in unlabeled data — no target variable to predict. The algorithm discovers natural groupings, anomalies, and hidden dimensions on its own. This is invaluable when you don't know what you're looking for.

K-Means Clustering

The most popular clustering algorithm, K-Means partitions data into K clusters by minimizing within-cluster variance.

Algorithm Steps:

1. Choose K (number of clusters)
2. Randomly initialize K centroids
3. Assign each point to the nearest centroid
4. Recalculate centroids as the mean of assigned points
5. Repeat steps 3-4 until convergence

Choosing K — The Elbow Method & Silhouette Score

MethodHow It WorksBest K
Elbow MethodPlot inertia (within-cluster SSE) vs KWhere curve "bends"
Silhouette ScoreMeasures cluster cohesion vs separation (−1 to 1)Highest average score

Hierarchical Clustering

Builds a dendrogram (tree) showing nested cluster relationships:

  • Agglomerative (bottom-up): Start with each point as its own cluster, merge closest pairs
  • Divisive (top-down): Start with one cluster, split recursively

DBSCAN — Density-Based Clustering

Groups points in dense regions and marks sparse points as noise. Unlike K-Means, it:

  • Doesn't require specifying K
  • Can find arbitrarily shaped clusters
  • Automatically identifies outliers
Key parameters: eps (neighborhood radius), min_samples (minimum points for a dense region).

PCA — Dimensionality Reduction

Principal Component Analysis reduces high-dimensional data to 2-3 dimensions for visualization while preserving maximum variance. Essential for plotting clusters from datasets with many features.

Real-World Use Cases

ApplicationIndustryTechnique
Customer segmentationRetail/MarketingK-Means, Hierarchical
Anomaly detectionFinance/SecurityDBSCAN
Image compressionTechK-Means on pixel colors
Document groupingNLPK-Means + TF-IDF

On this page

Detailed Theory

Clustering is unsupervised ML: no labels, just data. The goal is to discover natural groupings — customer segments, anomaly clusters, document topics. Unlike classification, there's no "right answer"; the analyst's job is to pick a sensible algorithm, validate the clusters, and translate them into business segments stakeholders can act on.

What Clustering Actually Is

Given points in feature space, clustering algorithms group points so that:

  • Points in the same cluster are close to each other.
  • Points in different clusters are far apart.
"Close" depends on the distance metric, and "clusters" depend on the algorithm. There's no universal answer — different methods give different shapes.

The Three Algorithms That Cover 90% of Cases

AlgorithmShapeNeed K?Handles noise?
K-Meanssphericalyesno
Hierarchical (Agglomerative)any (via linkage)optional (cut tree)no
DBSCANarbitraryno (eps + min_samples)yes ("noise" label)

Learn these three. Reach for GMM / HDBSCAN / Spectral once they aren't enough.

Beginner Mistakes to Skip

1. Forgetting to scale. Income (50000) overwhelms age (30). StandardScaler first, *always*. 2. Picking K by eye. Use the elbow method + silhouette score, not vibes. 3. Clustering raw categorical data. K-Means assumes Euclidean distance — one-hot encode or use K-Modes / Gower distance. 4. Trusting K-Means for non-spherical clusters. It draws sphere-ish boundaries; use DBSCAN for crescents/blobs. 5. Reading too much into clusters. They're a *hypothesis*, not truth. Validate with profile tables and business sense. 6. High-dimensional clustering without reducing first. Distances become meaningless past ~50 dims; reduce with PCA / UMAP.

Intermediate: Distance Metrics

MetricWhen to use
Euclideancontinuous numeric data (default)
Manhattangrid-like, sparse, robust to outliers
Cosinetext / TF-IDF / embeddings
Jaccardbinary set membership
Gowermixed numeric + categorical

The metric is more important than the algorithm for getting useful clusters.

Intermediate: K-Means in Detail

Lloyd's algorithm:

1. Pick K random centroids. 2. Assign each point to the nearest centroid. 3. Move each centroid to the mean of its assigned points. 4. Repeat until assignments stop changing.

Use KMeans(n_clusters=K, n_init=10, random_state=42) — n_init runs the algorithm 10 times with different seeds and keeps the best (avoids bad local minima). Use K-Means++ initialisation (default) for faster convergence.

Intermediate: Choosing K — Elbow + Silhouette

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

inertia, sil = [], [] for k in range(2, 11): km = KMeans(n_clusters=k, n_init=10, random_state=42).fit(X) inertia.append(km.inertia_) sil.append(silhouette_score(X, km.labels_))

  • Elbow: plot inertia vs K, look for the bend. Subjective.
  • Silhouette: numeric score in [-1, 1]. Pick K with the highest score. Objective.
  • Combine both — silhouette breaks ties when the elbow is fuzzy.

Intermediate: Silhouette Coefficient

For each point i:

s(i) = (b(i) - a(i)) / max(a(i), b(i))

where a(i) = avg distance to own cluster, b(i) = avg distance to nearest other cluster.

  • ~+1 — well-clustered.
  • ~0 — on a boundary.
  • < 0 — likely in the wrong cluster.
Mean silhouette is the single best one-number summary of clustering quality.

Intermediate: Hierarchical (Agglomerative) Clustering

Start with each point as its own cluster, repeatedly merge the closest pair until one remains. Output is a dendrogram — cut it at any height to choose K.

Linkage choices:

  • single — nearest pair (chains, sensitive to noise)
  • complete — farthest pair (compact)
  • average — mean of all pairs
  • ward — minimises within-cluster variance (best default)
Great for small datasets (<10k) and when you want to *see* the cluster structure.

Advanced: DBSCAN

Density-based: a cluster is a dense region of points; sparse regions are *noise*.

Two knobs:

  • eps — neighbourhood radius. Tune with a k-distance plot: sort distances to k-th NN, look for the knee.
  • min_samples — minimum points to form a dense region. Higher → denser, larger clusters.
Strengths: finds arbitrary shapes, no K, labels outliers as -1. Weaknesses: struggles with varying density (use HDBSCAN instead).

Advanced: Validating Clusters Internally vs Externally

  • Internal (no ground truth): silhouette, Davies-Bouldin, Calinski-Harabasz.
  • External (with labels): Adjusted Rand Index, Normalised Mutual Information.
  • Stability: cluster on bootstrap samples — do points keep landing together?
  • Business validity: build a cluster profile table (mean of each feature per cluster, count) and confirm clusters tell a coherent story.

Advanced: High-Dimensional & Mixed Data

  • Dimensionality reduction first — PCA (linear), UMAP / t-SNE (non-linear). UMAP often *creates* the clusters DBSCAN then formalises.
  • Mixed data — use K-Prototypes (numeric + categorical) or Gower distance with hierarchical.
  • Text/embeddings — cosine distance + spherical K-Means or HDBSCAN on UMAP-reduced embeddings.

Advanced: Customer Segmentation Pipeline

Production-grade segmentation:

1. Engineer RFM features (Recency, Frequency, Monetary) plus behaviour signals. 2. Log-transform skewed monetary values; clip outliers. 3. StandardScaler → cluster (start K-Means K=4..6). 4. Build a profile table; name clusters ("VIP whales", "churning casuals"). 5. Persist labels back to the warehouse keyed by user_id. 6. Refresh quarterly; check stability — if 70% of users keep the same label, segments are robust.

Practice Path

1. Take a customer dataset, scale features, run K-Means for K=2..10, plot elbow + silhouette, pick K. 2. Profile the chosen clusters: mean of each feature per cluster + business-friendly names. 3. Re-run with DBSCAN; compare which points are flagged as noise vs which K-Means cluster they joined. 4. Reduce a high-dim embedding to 2D with UMAP and visualise the clusters.