RFM Analysis Notebook
RFM is the most practical customer segmentation framework in DA work. It uses three metrics from transaction history to classify customers by behaviour — no survey, no assumptions, just what customers actually did.
This notebook covers two versions: - License plate version — for parking data without a membership system - Member ID version — for businesses with an app or registered accounts (more accurate)
All examples use the parking system tables: parking_history, payment_history, and member_vehicles (the app membership table).
See SQL Patterns for the building blocks used here, and Pandas Notebook for the Python data manipulation layer.
What is RFM?
| Metric | Question | Parking definition |
|---|---|---|
| Recency (R) | How recently did this customer transact? | Days since last paid parking session |
| Frequency (F) | How often do they transact? | Number of paid sessions in the analysis window |
| Monetary (M) | How much have they spent? | Total amount paid in the analysis window |
A customer who parked yesterday (R=high), parks 3 times a week (F=high), and spends consistently (M=high) is your most valuable segment. A customer who used to park regularly but disappeared three months ago is a churn risk worth targeting.
Version 1 — License Plate Based
Use this when there is no membership system. Each license plate is treated as one customer.
Step 1 — Calculate Raw RFM Metrics in SQL
-- analysis window: last 12 months
-- reference date: today (or a fixed reporting date)
WITH paid_sessions AS (
SELECT
ph.license_plate,
ph.parking_id,
pay.paid_time,
pay.amount_received
FROM parking_history ph
JOIN payment_history pay ON ph.parking_id = pay.parking_id
WHERE pay.paid_time >= DATE_SUB(CURDATE(), INTERVAL 12 MONTH)
)
SELECT
license_plate,
DATEDIFF(CURDATE(), MAX(paid_time)) AS recency_days,
COUNT(*) AS frequency,
SUM(amount_received) AS monetary
FROM paid_sessions
GROUP BY license_plate;
Step 2 — Score Each Metric (1 to 5)
Assign a score of 1–5 to each metric using NTILE(5). Score 5 = best.
- Recency: lower days = better → score 5 for the most recent customers
- Frequency: higher count = better → score 5 for the most frequent customers
- Monetary: higher spend = better → score 5 for the highest spenders
WITH paid_sessions AS (
SELECT ph.license_plate, pay.paid_time, pay.amount_received
FROM parking_history ph
JOIN payment_history pay ON ph.parking_id = pay.parking_id
WHERE pay.paid_time >= DATE_SUB(CURDATE(), INTERVAL 12 MONTH)
),
rfm_raw AS (
SELECT
license_plate,
DATEDIFF(CURDATE(), MAX(paid_time)) AS recency_days,
COUNT(*) AS frequency,
SUM(amount_received) AS monetary
FROM paid_sessions
GROUP BY license_plate
)
SELECT
license_plate,
recency_days,
frequency,
monetary,
-- recency: fewer days = better = score 5
6 - NTILE(5) OVER (ORDER BY recency_days ASC) AS r_score,
NTILE(5) OVER (ORDER BY frequency ASC) AS f_score,
NTILE(5) OVER (ORDER BY monetary ASC) AS m_score
FROM rfm_raw;
Step 3 — Assign Segments
WITH rfm_scored AS (
-- ... (previous CTE) ...
)
SELECT
license_plate,
r_score, f_score, m_score,
CONCAT(r_score, f_score, m_score) AS rfm_code,
CASE
WHEN r_score >= 4 AND f_score >= 4 THEN 'Champions'
WHEN r_score >= 3 AND f_score >= 3 THEN 'Loyal'
WHEN r_score >= 4 AND f_score <= 2 THEN 'New Customers'
WHEN r_score <= 2 AND f_score >= 3 THEN 'At Risk'
WHEN r_score <= 2 AND f_score <= 2 THEN 'Lost'
ELSE 'Potential Loyalists'
END AS segment
FROM rfm_scored;
Version 2 — Member ID Based (APP Users)
When customers register via an app, one person can have multiple license plates. Use member_id as the customer key — it follows the person, not the vehicle.
Assumed table: member_vehicles(member_id, license_plate) — links one member to one or more plates.
Step 1 — Join to Member Table
WITH member_sessions AS (
SELECT
mv.member_id,
pay.paid_time,
pay.amount_received
FROM parking_history ph
JOIN payment_history pay ON ph.parking_id = pay.parking_id
JOIN member_vehicles mv ON ph.license_plate = mv.license_plate
WHERE pay.paid_time >= DATE_SUB(CURDATE(), INTERVAL 12 MONTH)
),
rfm_raw AS (
SELECT
member_id,
DATEDIFF(CURDATE(), MAX(paid_time)) AS recency_days,
COUNT(*) AS frequency,
SUM(amount_received) AS monetary
FROM member_sessions
GROUP BY member_id
)
SELECT
member_id,
recency_days,
frequency,
monetary,
6 - NTILE(5) OVER (ORDER BY recency_days ASC) AS r_score,
NTILE(5) OVER (ORDER BY frequency ASC) AS f_score,
NTILE(5) OVER (ORDER BY monetary ASC) AS m_score
FROM rfm_raw;
Compared to the license plate version: - A member who changed cars is still recognised as the same customer - All parking sessions across different stations are attributed to one person - You can send push notifications directly to the member account after segmentation
Python Implementation
Use Python when you need to iterate on scoring logic, visualise distributions, or feed segments into downstream analysis.
import pandas as pd
import numpy as np
from datetime import date
# load raw data (already joined in SQL or loaded as CSV)
df = pd.read_csv('parking_payments.csv', parse_dates=['paid_time'])
reference_date = pd.Timestamp(date.today())
analysis_start = reference_date - pd.DateOffset(months=12)
df = df[df['paid_time'] >= analysis_start]
Calculate Raw RFM Metrics
rfm = df.groupby('member_id').agg(
last_visit = ('paid_time', 'max'),
frequency = ('parking_id', 'count'),
monetary = ('amount_received', 'sum')
).reset_index()
rfm['recency_days'] = (reference_date - rfm['last_visit']).dt.days
rfm = rfm.drop(columns='last_visit')
Score Each Metric
rfm['r_score'] = pd.qcut(rfm['recency_days'], q=5,
labels=[5, 4, 3, 2, 1]) # lower days = better = score 5
rfm['f_score'] = pd.qcut(rfm['frequency'].rank(method='first'), q=5,
labels=[1, 2, 3, 4, 5])
rfm['m_score'] = pd.qcut(rfm['monetary'].rank(method='first'), q=5,
labels=[1, 2, 3, 4, 5])
rfm[['r_score', 'f_score', 'm_score']] = rfm[['r_score', 'f_score', 'm_score']].astype(int)
rfm['rfm_score'] = rfm['r_score'] + rfm['f_score'] + rfm['m_score']
pd.qcut divides data into equal-sized buckets. Use rank(method='first') to break ties before passing to qcut — without it, duplicate values cause unequal bucket sizes.
Assign Segments
def assign_segment(row):
r, f = row['r_score'], row['f_score']
if r >= 4 and f >= 4:
return 'Champions'
elif r >= 3 and f >= 3:
return 'Loyal'
elif r >= 4 and f <= 2:
return 'New Customers'
elif r <= 2 and f >= 3:
return 'At Risk'
elif r <= 2 and f <= 2:
return 'Lost'
else:
return 'Potential Loyalists'
rfm['segment'] = rfm.apply(assign_segment, axis=1)
Summarise Segments
summary = rfm.groupby('segment').agg(
customer_count = ('member_id', 'count'),
avg_recency = ('recency_days', 'mean'),
avg_frequency = ('frequency', 'mean'),
avg_monetary = ('monetary', 'mean'),
total_revenue = ('monetary', 'sum')
).round(1).sort_values('total_revenue', ascending=False)
print(summary)
Customer Segments and Business Actions
| Segment | Profile | Action |
|---|---|---|
| Champions | Parked recently, frequently, spent most | Reward with loyalty points or priority space |
| Loyal | Regular visitors, consistent spend | Promote monthly pass to lock in revenue |
| Potential Loyalists | Recent, moderate frequency | Send welcome offer to build habit |
| New Customers | Recent first-timers, low frequency | Onboarding push — guide second and third visit |
| At Risk | Used to be frequent, haven't been back | Win-back campaign: limited discount |
| Lost | Low recency, low frequency | Low-cost reactivation or write off |
Tracking Segments Over Time
Run RFM monthly and compare how customers move between segments — this is more useful than a one-time snapshot.
# add a reporting month column to each run
rfm['report_month'] = pd.Timestamp(date.today()).to_period('M')
# save results
rfm.to_csv(f"rfm_{date.today().strftime('%Y%m')}.csv", index=False)
-- compare segment size month over month
SELECT
report_month,
segment,
COUNT(*) AS customer_count
FROM rfm_history
GROUP BY report_month, segment
ORDER BY report_month, segment;
Key movements to monitor: - Champions → At Risk: early churn signal - New Customers → Potential Loyalists: onboarding is working - At Risk → Lost: win-back campaign is too late
Common Mistakes
1. Including Sessions with No Payment
Always join parking_history to payment_history before calculating RFM. Sessions that ended without payment distort recency and frequency.
2. Choosing the Wrong Analysis Window
Too short (1 month) misses seasonal customers. Too long (3 years) dilutes recent behaviour. 12 months is the standard starting point — adjust if your business has strong seasonality.
3. Using Equal Score Boundaries Instead of Percentiles
Setting R score 5 = "parked within 7 days" is fragile — it depends on your data's distribution and breaks when traffic changes. NTILE and pd.qcut adapt to the actual data distribution automatically.
4. Treating All Segments as Equally Important
Segments with high monetary value deserve more investment. Always weight business actions by revenue impact, not just customer count.