K-means Clustering for Customer Segmentation

2026-06-13 Data Analysis 5 min read

Python Machine Learning

Clustering is unsupervised learning: there is no target column to predict. The algorithm groups rows by similarity, and your job is to decide whether the groups mean anything. The classic DA use case is customer segmentation — RFM Segmentation with SQL & Python segments customers with hand-written rules; K-means lets the data draw the boundaries instead.

This notebook covers K-means end to end with scikit-learn. The supervised workflow (train/test split, scaling, pipelines) is in scikit-learn Workflow for Data Analysts; choosing between model families is in How to Choose the Right Machine Learning Model.

All examples use the parking dataset — an rfm DataFrame with one row per customer: recency_days, frequency, monetary.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

How K-means Works (One Paragraph)

You pick k, the number of clusters. The algorithm places k center points, assigns every row to its nearest center, moves each center to the mean of its assigned rows, and repeats until nothing moves. The result: k groups where rows are close to their own center and far from the others. Everything in this notebook follows from one fact — "close" means Euclidean distance, which is why scaling matters so much.

The Workflow

1. Select numeric features that describe the behavior you care about
2. Scale them (mandatory)
3. Choose k with the elbow + silhouette checks
4. Fit, assign labels
5. Profile the clusters — do they mean anything?

Step 1 — Features

features = ['recency_days', 'frequency', 'monetary']
X = rfm[features].copy()

X.describe()    # check ranges — they will be wildly different

Use numeric features that describe behavior. Leave out IDs (meaningless distance), and leave out one-hot categorical dummies unless you must — distance on 0/1 columns mixes poorly with continuous ones (see Common Mistakes).

K-means is sensitive to outliers — a single plate with 100× the spend drags a centroid toward it. Clip the tails first:

for col in features:
    lower, upper = X[col].quantile([0.01, 0.99])
    X[col] = X[col].clip(lower, upper)

Step 2 — Scale (Mandatory)

scaler   = StandardScaler()
X_scaled = scaler.fit_transform(X)

Without scaling, monetary (range 0–50,000) dominates the distance and frequency (range 1–40) contributes nothing — the clusters become "spend tiers" no matter what else is in the data. After StandardScaler every feature has mean 0 and SD 1, so each one votes equally.

Step 3 — Choose k

There is no "correct" k — there are useful and useless ones. Two diagnostics, run together:

inertias, silhouettes = {}, {}

for k in range(2, 9):
    km = KMeans(n_clusters=k, n_init='auto', random_state=42)
    labels = km.fit_predict(X_scaled)
    inertias[k]    = km.inertia_
    silhouettes[k] = silhouette_score(X_scaled, labels)

fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].plot(list(inertias.keys()), list(inertias.values()), marker='o')
axes[0].set_title('Elbow: inertia by k')
axes[1].plot(list(silhouettes.keys()), list(silhouettes.values()), marker='o')
axes[1].set_title('Silhouette score by k')
plt.tight_layout()
plt.show()

Diagnostic	What it measures	How to read it
Inertia (elbow)	Total distance of rows to their centers	Falls as k grows — pick the k where the curve bends and extra clusters stop paying
Silhouette	How much closer rows are to their own cluster than the next one (−1 to 1)	Higher is better; below ~0.25 means weak structure

The two often disagree by one — break the tie with the business: 4 segments the team can act on beat 6 segments nobody can name. If silhouette is low at every k, the data may simply not have clusters — that is a finding, not a failure.

Step 4 — Fit and Label

k  = 4
km = KMeans(n_clusters=k, n_init='auto', random_state=42)
rfm['cluster'] = km.fit_predict(X_scaled)

rfm['cluster'].value_counts()    # cluster sizes — a 1% cluster is usually outliers

random_state=42 makes the run reproducible — K-means starts from random centers, and different starts can land on different solutions. n_init='auto' lets sklearn run several starts and keep the best.

Step 5 — Profile the Clusters

The fit is the easy part. The value is in the profile table:

profile = rfm.groupby('cluster').agg(
    customers     = ('cluster',      'size'),
    avg_recency   = ('recency_days', 'mean'),
    avg_frequency = ('frequency',    'mean'),
    avg_monetary  = ('monetary',     'mean'),
).round(1).sort_values('avg_monetary', ascending=False)

print(profile)

Read each row and name it in business language:

cluster	customers	avg_recency	avg_frequency	avg_monetary	Name
2	180	6.2	28.4	18,400	Heavy regulars
0	640	21.5	9.1	5,200	Steady commuters
3	410	12.8	3.2	1,100	Occasional visitors
1	270	71.3	2.1	900	Lapsed

If you cannot give a cluster a one-line name, it is probably not a real segment — try a different k or different features.

To see the centers in original units (instead of scaled ones):

centers = pd.DataFrame(
    scaler.inverse_transform(km.cluster_centers_),
    columns=features
).round(1)

Visualize

# 2 features at a time, colored by cluster
fig, ax = plt.subplots()
sns.scatterplot(data=rfm, x='frequency', y='monetary',
                hue='cluster', palette='Set2', alpha=0.6, ax=ax)
ax.set_title('Clusters: frequency vs monetary')
plt.tight_layout()
plt.show()

With more than 3 features, project to 2D with PCA just for plotting:

from sklearn.decomposition import PCA

coords = PCA(n_components=2).fit_transform(X_scaled)
sns.scatterplot(x=coords[:, 0], y=coords[:, 1], hue=rfm['cluster'],
                palette='Set2', alpha=0.6)

K-means vs Rule-Based RFM

	RFM scores (rules)	K-means
Boundaries	You define them (NTILE buckets)	Data defines them
Explainability	Trivial — "R≥4 and F≥4"	Needs the profile table
Stability over reruns	High	Centers shift as data changes
Finds unexpected segments	No	Yes — its main advantage

A solid pattern: run K-means to discover the structure, then translate the discovered boundaries into simple rules for production — rules are easier to explain, monitor, and keep stable. Both approaches are inputs to the actions table in RFM Segmentation with SQL & Python.

When K-means Is the Wrong Tool

K-means assumes clusters are roughly round, similar in size, and that every row belongs somewhere. When that fails:

Situation	Better option
Mostly categorical features	K-modes / k-prototypes, or rethink the features
Irregular shapes, noise points that belong nowhere	DBSCAN (density-based; finds outliers as "no cluster")
You want a tree of nested segments	Hierarchical clustering (dendrogram)
Clusters of very different densities	DBSCAN or Gaussian Mixture Models

All are in scikit-learn with the same fit/predict workflow — the diagnostics and profiling steps here transfer directly.

Common Mistakes

1. Skipping Scaling

The biggest-range feature silently becomes the only feature. Always StandardScaler (or MinMaxScaler) before K-means — it is distance-based, like KNN.

2. Clustering on IDs or Codes

station_code as a number tells K-means station 9 is "close to" station 8. Numeric-looking labels are still labels. Drop them or engineer real behavior features from them.

3. Picking k by Gut

"Let's do 5" bakes an arbitrary choice into every downstream decision. Run elbow + silhouette, then choose — the plots take four lines of code.

4. Shipping Unprofiled Clusters

Cluster 0/1/2/3 means nothing to anyone. The profile table with sizes, means, and names is the deliverable; the labels are an intermediate product.

5. Forgetting random_state

Reruns land on different (equally valid) solutions, segment counts shift, and the marketing list changes between Tuesday and Wednesday. Pin it.

6. Treating Clusters as Truth

Clusters are a compression of the features you chose — different features, different "truth". Validate that the segments behave differently on a metric outside the clustering features (e.g. retention next month) before building strategy on them.

Upstream of this: the feature prep lives in Pandas Cheat Sheet for Data Analysts and the scaling concepts in scikit-learn Workflow for Data Analysts. Downstream: segment-level actions and tracking, as in RFM Segmentation with SQL & Python.

← Back to Blog