RFM Analysis Notebook

RFM is the most practical customer segmentation framework in DA work. It uses three metrics from transaction history to classify customers by behaviour — no survey, no assumptions, just what customers actually did.

This notebook covers two versions: - License plate version — for parking data without a membership system - Member ID version — for businesses with an app or registered accounts (more accurate)

All examples use the parking system tables: parking_history, payment_history, and member_vehicles (the app membership table).

See SQL Patterns for the building blocks used here, and Pandas Notebook for the Python data manipulation layer.


What is RFM?

Metric Question Parking definition
Recency (R) How recently did this customer transact? Days since last paid parking session
Frequency (F) How often do they transact? Number of paid sessions in the analysis window
Monetary (M) How much have they spent? Total amount paid in the analysis window

A customer who parked yesterday (R=high), parks 3 times a week (F=high), and spends consistently (M=high) is your most valuable segment. A customer who used to park regularly but disappeared three months ago is a churn risk worth targeting.


Version 1 — License Plate Based

Use this when there is no membership system. Each license plate is treated as one customer.

Step 1 — Calculate Raw RFM Metrics in SQL

-- analysis window: last 12 months
-- reference date: today (or a fixed reporting date)
WITH paid_sessions AS (
    SELECT
        ph.license_plate,
        ph.parking_id,
        pay.paid_time,
        pay.amount_received
    FROM parking_history ph
    JOIN payment_history pay ON ph.parking_id = pay.parking_id
    WHERE pay.paid_time >= DATE_SUB(CURDATE(), INTERVAL 12 MONTH)
)
SELECT
    license_plate,
    DATEDIFF(CURDATE(), MAX(paid_time))   AS recency_days,
    COUNT(*)                              AS frequency,
    SUM(amount_received)                  AS monetary
FROM paid_sessions
GROUP BY license_plate;

Step 2 — Score Each Metric (1 to 5)

Assign a score of 1–5 to each metric using NTILE(5). Score 5 = best.

  • Recency: lower days = better → score 5 for the most recent customers
  • Frequency: higher count = better → score 5 for the most frequent customers
  • Monetary: higher spend = better → score 5 for the highest spenders
WITH paid_sessions AS (
    SELECT ph.license_plate, pay.paid_time, pay.amount_received
    FROM parking_history ph
    JOIN payment_history pay ON ph.parking_id = pay.parking_id
    WHERE pay.paid_time >= DATE_SUB(CURDATE(), INTERVAL 12 MONTH)
),
rfm_raw AS (
    SELECT
        license_plate,
        DATEDIFF(CURDATE(), MAX(paid_time)) AS recency_days,
        COUNT(*)                            AS frequency,
        SUM(amount_received)                AS monetary
    FROM paid_sessions
    GROUP BY license_plate
)
SELECT
    license_plate,
    recency_days,
    frequency,
    monetary,
    -- recency: fewer days = better = score 5
    6 - NTILE(5) OVER (ORDER BY recency_days ASC)  AS r_score,
    NTILE(5) OVER (ORDER BY frequency ASC)          AS f_score,
    NTILE(5) OVER (ORDER BY monetary ASC)           AS m_score
FROM rfm_raw;

Step 3 — Assign Segments

WITH rfm_scored AS (
    -- ... (previous CTE) ...
)
SELECT
    license_plate,
    r_score, f_score, m_score,
    CONCAT(r_score, f_score, m_score) AS rfm_code,
    CASE
        WHEN r_score >= 4 AND f_score >= 4             THEN 'Champions'
        WHEN r_score >= 3 AND f_score >= 3             THEN 'Loyal'
        WHEN r_score >= 4 AND f_score <= 2             THEN 'New Customers'
        WHEN r_score <= 2 AND f_score >= 3             THEN 'At Risk'
        WHEN r_score <= 2 AND f_score <= 2             THEN 'Lost'
        ELSE                                                'Potential Loyalists'
    END AS segment
FROM rfm_scored;

Version 2 — Member ID Based (APP Users)

When customers register via an app, one person can have multiple license plates. Use member_id as the customer key — it follows the person, not the vehicle.

Assumed table: member_vehicles(member_id, license_plate) — links one member to one or more plates.

Step 1 — Join to Member Table

WITH member_sessions AS (
    SELECT
        mv.member_id,
        pay.paid_time,
        pay.amount_received
    FROM parking_history ph
    JOIN payment_history pay   ON ph.parking_id   = pay.parking_id
    JOIN member_vehicles mv    ON ph.license_plate = mv.license_plate
    WHERE pay.paid_time >= DATE_SUB(CURDATE(), INTERVAL 12 MONTH)
),
rfm_raw AS (
    SELECT
        member_id,
        DATEDIFF(CURDATE(), MAX(paid_time)) AS recency_days,
        COUNT(*)                            AS frequency,
        SUM(amount_received)                AS monetary
    FROM member_sessions
    GROUP BY member_id
)
SELECT
    member_id,
    recency_days,
    frequency,
    monetary,
    6 - NTILE(5) OVER (ORDER BY recency_days ASC)  AS r_score,
    NTILE(5) OVER (ORDER BY frequency ASC)          AS f_score,
    NTILE(5) OVER (ORDER BY monetary ASC)           AS m_score
FROM rfm_raw;

Compared to the license plate version: - A member who changed cars is still recognised as the same customer - All parking sessions across different stations are attributed to one person - You can send push notifications directly to the member account after segmentation


Python Implementation

Use Python when you need to iterate on scoring logic, visualise distributions, or feed segments into downstream analysis.

import pandas as pd
import numpy as np
from datetime import date

# load raw data (already joined in SQL or loaded as CSV)
df = pd.read_csv('parking_payments.csv', parse_dates=['paid_time'])

reference_date = pd.Timestamp(date.today())
analysis_start = reference_date - pd.DateOffset(months=12)
df = df[df['paid_time'] >= analysis_start]

Calculate Raw RFM Metrics

rfm = df.groupby('member_id').agg(
    last_visit  = ('paid_time',       'max'),
    frequency   = ('parking_id',      'count'),
    monetary    = ('amount_received',  'sum')
).reset_index()

rfm['recency_days'] = (reference_date - rfm['last_visit']).dt.days
rfm = rfm.drop(columns='last_visit')

Score Each Metric

rfm['r_score'] = pd.qcut(rfm['recency_days'], q=5,
                          labels=[5, 4, 3, 2, 1])   # lower days = better = score 5
rfm['f_score'] = pd.qcut(rfm['frequency'].rank(method='first'), q=5,
                          labels=[1, 2, 3, 4, 5])
rfm['m_score'] = pd.qcut(rfm['monetary'].rank(method='first'), q=5,
                          labels=[1, 2, 3, 4, 5])

rfm[['r_score', 'f_score', 'm_score']] = rfm[['r_score', 'f_score', 'm_score']].astype(int)
rfm['rfm_score'] = rfm['r_score'] + rfm['f_score'] + rfm['m_score']

pd.qcut divides data into equal-sized buckets. Use rank(method='first') to break ties before passing to qcut — without it, duplicate values cause unequal bucket sizes.

Assign Segments

def assign_segment(row):
    r, f = row['r_score'], row['f_score']
    if r >= 4 and f >= 4:
        return 'Champions'
    elif r >= 3 and f >= 3:
        return 'Loyal'
    elif r >= 4 and f <= 2:
        return 'New Customers'
    elif r <= 2 and f >= 3:
        return 'At Risk'
    elif r <= 2 and f <= 2:
        return 'Lost'
    else:
        return 'Potential Loyalists'

rfm['segment'] = rfm.apply(assign_segment, axis=1)

Summarise Segments

summary = rfm.groupby('segment').agg(
    customer_count = ('member_id',     'count'),
    avg_recency    = ('recency_days',  'mean'),
    avg_frequency  = ('frequency',     'mean'),
    avg_monetary   = ('monetary',      'mean'),
    total_revenue  = ('monetary',      'sum')
).round(1).sort_values('total_revenue', ascending=False)

print(summary)

Customer Segments and Business Actions

Segment Profile Action
Champions Parked recently, frequently, spent most Reward with loyalty points or priority space
Loyal Regular visitors, consistent spend Promote monthly pass to lock in revenue
Potential Loyalists Recent, moderate frequency Send welcome offer to build habit
New Customers Recent first-timers, low frequency Onboarding push — guide second and third visit
At Risk Used to be frequent, haven't been back Win-back campaign: limited discount
Lost Low recency, low frequency Low-cost reactivation or write off

Tracking Segments Over Time

Run RFM monthly and compare how customers move between segments — this is more useful than a one-time snapshot.

# add a reporting month column to each run
rfm['report_month'] = pd.Timestamp(date.today()).to_period('M')

# save results
rfm.to_csv(f"rfm_{date.today().strftime('%Y%m')}.csv", index=False)
-- compare segment size month over month
SELECT
    report_month,
    segment,
    COUNT(*) AS customer_count
FROM rfm_history
GROUP BY report_month, segment
ORDER BY report_month, segment;

Key movements to monitor: - Champions → At Risk: early churn signal - New Customers → Potential Loyalists: onboarding is working - At Risk → Lost: win-back campaign is too late


Common Mistakes

1. Including Sessions with No Payment

Always join parking_history to payment_history before calculating RFM. Sessions that ended without payment distort recency and frequency.

2. Choosing the Wrong Analysis Window

Too short (1 month) misses seasonal customers. Too long (3 years) dilutes recent behaviour. 12 months is the standard starting point — adjust if your business has strong seasonality.

3. Using Equal Score Boundaries Instead of Percentiles

Setting R score 5 = "parked within 7 days" is fragile — it depends on your data's distribution and breaks when traffic changes. NTILE and pd.qcut adapt to the actual data distribution automatically.

4. Treating All Segments as Equally Important

Segments with high monetary value deserve more investment. Always weight business actions by revenue impact, not just customer count.