Clustering Notebook
Clustering is unsupervised learning: there is no target column to predict. The algorithm groups rows by similarity, and your job is to decide whether the groups mean anything. The classic DA use case is customer segmentation — RFM Analysis segments customers with hand-written rules; K-means lets the data draw the boundaries instead.
This notebook covers K-means end to end with scikit-learn. The supervised workflow (train/test split, scaling, pipelines) is in Machine Learning Notebook; choosing between model families is in ML Model Selection Guide.
All examples use the parking dataset — an rfm DataFrame with one row per customer: recency_days, frequency, monetary.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
How K-means Works (One Paragraph)
You pick k, the number of clusters. The algorithm places k center points, assigns every row to its nearest center, moves each center to the mean of its assigned rows, and repeats until nothing moves. The result: k groups where rows are close to their own center and far from the others. Everything in this notebook follows from one fact — "close" means Euclidean distance, which is why scaling matters so much.
The Workflow
1. Select numeric features that describe the behavior you care about
2. Scale them (mandatory)
3. Choose k with the elbow + silhouette checks
4. Fit, assign labels
5. Profile the clusters — do they mean anything?
Step 1 — Features
features = ['recency_days', 'frequency', 'monetary']
X = rfm[features].copy()
X.describe() # check ranges — they will be wildly different
Use numeric features that describe behavior. Leave out IDs (meaningless distance), and leave out one-hot categorical dummies unless you must — distance on 0/1 columns mixes poorly with continuous ones (see Common Mistakes).
K-means is sensitive to outliers — a single plate with 100× the spend drags a centroid toward it. Clip the tails first:
for col in features:
lower, upper = X[col].quantile([0.01, 0.99])
X[col] = X[col].clip(lower, upper)
Step 2 — Scale (Mandatory)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
Without scaling, monetary (range 0–50,000) dominates the distance and frequency (range 1–40) contributes nothing — the clusters become "spend tiers" no matter what else is in the data. After StandardScaler every feature has mean 0 and SD 1, so each one votes equally.
Step 3 — Choose k
There is no "correct" k — there are useful and useless ones. Two diagnostics, run together:
inertias, silhouettes = {}, {}
for k in range(2, 9):
km = KMeans(n_clusters=k, n_init='auto', random_state=42)
labels = km.fit_predict(X_scaled)
inertias[k] = km.inertia_
silhouettes[k] = silhouette_score(X_scaled, labels)
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].plot(list(inertias.keys()), list(inertias.values()), marker='o')
axes[0].set_title('Elbow: inertia by k')
axes[1].plot(list(silhouettes.keys()), list(silhouettes.values()), marker='o')
axes[1].set_title('Silhouette score by k')
plt.tight_layout()
plt.show()
| Diagnostic | What it measures | How to read it |
|---|---|---|
| Inertia (elbow) | Total distance of rows to their centers | Falls as k grows — pick the k where the curve bends and extra clusters stop paying |
| Silhouette | How much closer rows are to their own cluster than the next one (−1 to 1) | Higher is better; below ~0.25 means weak structure |
The two often disagree by one — break the tie with the business: 4 segments the team can act on beat 6 segments nobody can name. If silhouette is low at every k, the data may simply not have clusters — that is a finding, not a failure.
Step 4 — Fit and Label
k = 4
km = KMeans(n_clusters=k, n_init='auto', random_state=42)
rfm['cluster'] = km.fit_predict(X_scaled)
rfm['cluster'].value_counts() # cluster sizes — a 1% cluster is usually outliers
random_state=42 makes the run reproducible — K-means starts from random centers, and different starts can land on different solutions. n_init='auto' lets sklearn run several starts and keep the best.
Step 5 — Profile the Clusters
The fit is the easy part. The value is in the profile table:
profile = rfm.groupby('cluster').agg(
customers = ('cluster', 'size'),
avg_recency = ('recency_days', 'mean'),
avg_frequency = ('frequency', 'mean'),
avg_monetary = ('monetary', 'mean'),
).round(1).sort_values('avg_monetary', ascending=False)
print(profile)
Read each row and name it in business language:
| cluster | customers | avg_recency | avg_frequency | avg_monetary | Name |
|---|---|---|---|---|---|
| 2 | 180 | 6.2 | 28.4 | 18,400 | Heavy regulars |
| 0 | 640 | 21.5 | 9.1 | 5,200 | Steady commuters |
| 3 | 410 | 12.8 | 3.2 | 1,100 | Occasional visitors |
| 1 | 270 | 71.3 | 2.1 | 900 | Lapsed |
If you cannot give a cluster a one-line name, it is probably not a real segment — try a different k or different features.
To see the centers in original units (instead of scaled ones):
centers = pd.DataFrame(
scaler.inverse_transform(km.cluster_centers_),
columns=features
).round(1)
Visualize
# 2 features at a time, colored by cluster
fig, ax = plt.subplots()
sns.scatterplot(data=rfm, x='frequency', y='monetary',
hue='cluster', palette='Set2', alpha=0.6, ax=ax)
ax.set_title('Clusters: frequency vs monetary')
plt.tight_layout()
plt.show()
With more than 3 features, project to 2D with PCA just for plotting:
from sklearn.decomposition import PCA
coords = PCA(n_components=2).fit_transform(X_scaled)
sns.scatterplot(x=coords[:, 0], y=coords[:, 1], hue=rfm['cluster'],
palette='Set2', alpha=0.6)
K-means vs Rule-Based RFM
| RFM scores (rules) | K-means | |
|---|---|---|
| Boundaries | You define them (NTILE buckets) | Data defines them |
| Explainability | Trivial — "R≥4 and F≥4" | Needs the profile table |
| Stability over reruns | High | Centers shift as data changes |
| Finds unexpected segments | No | Yes — its main advantage |
A solid pattern: run K-means to discover the structure, then translate the discovered boundaries into simple rules for production — rules are easier to explain, monitor, and keep stable. Both approaches are inputs to the actions table in RFM Analysis Notebook.
When K-means Is the Wrong Tool
K-means assumes clusters are roughly round, similar in size, and that every row belongs somewhere. When that fails:
| Situation | Better option |
|---|---|
| Mostly categorical features | K-modes / k-prototypes, or rethink the features |
| Irregular shapes, noise points that belong nowhere | DBSCAN (density-based; finds outliers as "no cluster") |
| You want a tree of nested segments | Hierarchical clustering (dendrogram) |
| Clusters of very different densities | DBSCAN or Gaussian Mixture Models |
All are in scikit-learn with the same fit/predict workflow — the diagnostics and profiling steps here transfer directly.
Common Mistakes
1. Skipping Scaling
The biggest-range feature silently becomes the only feature. Always StandardScaler (or MinMaxScaler) before K-means — it is distance-based, like KNN.
2. Clustering on IDs or Codes
station_code as a number tells K-means station 9 is "close to" station 8. Numeric-looking labels are still labels. Drop them or engineer real behavior features from them.
3. Picking k by Gut
"Let's do 5" bakes an arbitrary choice into every downstream decision. Run elbow + silhouette, then choose — the plots take four lines of code.
4. Shipping Unprofiled Clusters
Cluster 0/1/2/3 means nothing to anyone. The profile table with sizes, means, and names is the deliverable; the labels are an intermediate product.
5. Forgetting random_state
Reruns land on different (equally valid) solutions, segment counts shift, and the marketing list changes between Tuesday and Wednesday. Pin it.
6. Treating Clusters as Truth
Clusters are a compression of the features you chose — different features, different "truth". Validate that the segments behave differently on a metric outside the clustering features (e.g. retention next month) before building strategy on them.
Upstream of this: the feature prep lives in Pandas Notebook and the scaling concepts in Machine Learning Notebook. Downstream: segment-level actions and tracking, as in RFM Analysis Notebook.