Last 30 Days
No notifications
Unsupervised learning finds structure in unlabeled data — no target variable to predict. The algorithm discovers natural groupings, anomalies, and hidden dimensions on its own. This is invaluable when you don't know what you're looking for.
The most popular clustering algorithm, K-Means partitions data into K clusters by minimizing within-cluster variance.
Algorithm Steps:
1. Choose K (number of clusters)
2. Randomly initialize K centroids
3. Assign each point to the nearest centroid
4. Recalculate centroids as the mean of assigned points
5. Repeat steps 3-4 until convergence| Method | How It Works | Best K |
| Elbow Method | Plot inertia (within-cluster SSE) vs K | Where curve "bends" |
| Silhouette Score | Measures cluster cohesion vs separation (−1 to 1) | Highest average score |
Builds a dendrogram (tree) showing nested cluster relationships:
Groups points in dense regions and marks sparse points as noise. Unlike K-Means, it:
eps (neighborhood radius), min_samples (minimum points for a dense region).Principal Component Analysis reduces high-dimensional data to 2-3 dimensions for visualization while preserving maximum variance. Essential for plotting clusters from datasets with many features.
| Application | Industry | Technique |
| Customer segmentation | Retail/Marketing | K-Means, Hierarchical |
| Anomaly detection | Finance/Security | DBSCAN |
| Image compression | Tech | K-Means on pixel colors |
| Document grouping | NLP | K-Means + TF-IDF |
Clustering is unsupervised ML: no labels, just data. The goal is to discover natural groupings — customer segments, anomaly clusters, document topics. Unlike classification, there's no "right answer"; the analyst's job is to pick a sensible algorithm, validate the clusters, and translate them into business segments stakeholders can act on.
Given points in feature space, clustering algorithms group points so that:
| Algorithm | Shape | Need K? | Handles noise? |
| K-Means | spherical | yes | no |
| Hierarchical (Agglomerative) | any (via linkage) | optional (cut tree) | no |
| DBSCAN | arbitrary | no (eps + min_samples) | yes ("noise" label) |
Learn these three. Reach for GMM / HDBSCAN / Spectral once they aren't enough.
1. Forgetting to scale. Income (50000) overwhelms age (30). StandardScaler first, *always*.
2. Picking K by eye. Use the elbow method + silhouette score, not vibes.
3. Clustering raw categorical data. K-Means assumes Euclidean distance — one-hot encode or use K-Modes / Gower distance.
4. Trusting K-Means for non-spherical clusters. It draws sphere-ish boundaries; use DBSCAN for crescents/blobs.
5. Reading too much into clusters. They're a *hypothesis*, not truth. Validate with profile tables and business sense.
6. High-dimensional clustering without reducing first. Distances become meaningless past ~50 dims; reduce with PCA / UMAP.
| Metric | When to use |
| Euclidean | continuous numeric data (default) |
| Manhattan | grid-like, sparse, robust to outliers |
| Cosine | text / TF-IDF / embeddings |
| Jaccard | binary set membership |
| Gower | mixed numeric + categorical |
The metric is more important than the algorithm for getting useful clusters.
Lloyd's algorithm:
1. Pick K random centroids. 2. Assign each point to the nearest centroid. 3. Move each centroid to the mean of its assigned points. 4. Repeat until assignments stop changing.
Use KMeans(n_clusters=K, n_init=10, random_state=42) — n_init runs the algorithm 10 times with different seeds and keeps the best (avoids bad local minima). Use K-Means++ initialisation (default) for faster convergence.
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_scoreinertia, sil = [], []
for k in range(2, 11):
km = KMeans(n_clusters=k, n_init=10, random_state=42).fit(X)
inertia.append(km.inertia_)
sil.append(silhouette_score(X, km.labels_))
For each point i:
s(i) = (b(i) - a(i)) / max(a(i), b(i))where a(i) = avg distance to own cluster, b(i) = avg distance to nearest other cluster.
Start with each point as its own cluster, repeatedly merge the closest pair until one remains. Output is a dendrogram — cut it at any height to choose K.
Linkage choices:
Density-based: a cluster is a dense region of points; sparse regions are *noise*.
Two knobs:
-1. Weaknesses: struggles with varying density (use HDBSCAN instead).Production-grade segmentation:
1. Engineer RFM features (Recency, Frequency, Monetary) plus behaviour signals.
2. Log-transform skewed monetary values; clip outliers.
3. StandardScaler → cluster (start K-Means K=4..6).
4. Build a profile table; name clusters ("VIP whales", "churning casuals").
5. Persist labels back to the warehouse keyed by user_id.
6. Refresh quarterly; check stability — if 70% of users keep the same label, segments are robust.
1. Take a customer dataset, scale features, run K-Means for K=2..10, plot elbow + silhouette, pick K. 2. Profile the chosen clusters: mean of each feature per cluster + business-friendly names. 3. Re-run with DBSCAN; compare which points are flagged as noise vs which K-Means cluster they joined. 4. Reduce a high-dim embedding to 2D with UMAP and visualise the clusters.