RFM Segmentation with SQL & Python

2026-04-25 Data Analysis 4 min read

SQL Python

RFM is the most practical way to group customers by behavior. It uses three numbers from each customer's transactions to sort them into segments. No survey, no guessing — just what they actually did.

This notebook covers two versions: - License plate version — for parking data with no membership system - Member ID version — for businesses with an app or registered accounts (more accurate)

All examples use parking system tables: parking_history, payment_history, and member_vehicles (the app membership table).

See SQL Patterns for Analysts & Interviews for the building blocks used here, and Pandas Cheat Sheet for Data Analysts for the Python data work.

What is RFM?

Metric	Question	Parking definition
Recency (R)	How recently did this customer pay?	Days since last paid parking session
Frequency (F)	How often do they pay?	Number of paid sessions in the analysis window
Monetary (M)	How much have they spent?	Total amount paid in the analysis window

A customer who parked yesterday (R=high), parks 3 times a week (F=high), and spends a lot (M=high) is your most valuable group. A customer who used to park often but stopped three months ago is a churn risk worth targeting.

Version 1 — License Plate Based

Use this when there is no membership system. Each license plate counts as one customer.

Step 1 — Calculate Raw RFM Metrics in SQL

-- analysis window: last 12 months
-- reference date: today (or a fixed reporting date)
WITH paid_sessions AS (
    SELECT
        ph.license_plate,
        ph.parking_id,
        pay.paid_time,
        pay.amount_received
    FROM parking_history ph
    JOIN payment_history pay ON ph.parking_id = pay.parking_id
    WHERE pay.paid_time >= DATE_SUB(CURDATE(), INTERVAL 12 MONTH)
)
SELECT
    license_plate,
    DATEDIFF(CURDATE(), MAX(paid_time))   AS recency_days,
    COUNT(*)                              AS frequency,
    SUM(amount_received)                  AS monetary
FROM paid_sessions
GROUP BY license_plate;

Step 2 — Score Each Metric (1 to 5)

Give each metric a score from 1 to 5 with NTILE(5). Score 5 = best.

Recency: fewer days = better → score 5 for the most recent customers
Frequency: higher count = better → score 5 for the most frequent customers
Monetary: higher spend = better → score 5 for the highest spenders

WITH paid_sessions AS (
    SELECT ph.license_plate, pay.paid_time, pay.amount_received
    FROM parking_history ph
    JOIN payment_history pay ON ph.parking_id = pay.parking_id
    WHERE pay.paid_time >= DATE_SUB(CURDATE(), INTERVAL 12 MONTH)
),
rfm_raw AS (
    SELECT
        license_plate,
        DATEDIFF(CURDATE(), MAX(paid_time)) AS recency_days,
        COUNT(*)                            AS frequency,
        SUM(amount_received)                AS monetary
    FROM paid_sessions
    GROUP BY license_plate
)
SELECT
    license_plate,
    recency_days,
    frequency,
    monetary,
    -- recency: fewer days = better = score 5
    6 - NTILE(5) OVER (ORDER BY recency_days ASC)  AS r_score,
    NTILE(5) OVER (ORDER BY frequency ASC)          AS f_score,
    NTILE(5) OVER (ORDER BY monetary ASC)           AS m_score
FROM rfm_raw;

Step 3 — Assign Segments

WITH rfm_scored AS (
    -- ... (previous CTE) ...
)
SELECT
    license_plate,
    r_score, f_score, m_score,
    CONCAT(r_score, f_score, m_score) AS rfm_code,
    CASE
        WHEN r_score >= 4 AND f_score >= 4             THEN 'Champions'
        WHEN r_score >= 3 AND f_score >= 3             THEN 'Loyal'
        WHEN r_score >= 4 AND f_score <= 2             THEN 'New Customers'
        WHEN r_score <= 2 AND f_score >= 3             THEN 'At Risk'
        WHEN r_score <= 2 AND f_score <= 2             THEN 'Lost'
        ELSE                                                'Potential Loyalists'
    END AS segment
FROM rfm_scored;

Version 2 — Member ID Based (APP Users)

When customers sign up through an app, one person can own several license plates. Use member_id as the customer key — it follows the person, not the car.

Assumed table: member_vehicles(member_id, license_plate) — links one member to one or more plates.

Step 1 — Join to Member Table

WITH member_sessions AS (
    SELECT
        mv.member_id,
        pay.paid_time,
        pay.amount_received
    FROM parking_history ph
    JOIN payment_history pay   ON ph.parking_id   = pay.parking_id
    JOIN member_vehicles mv    ON ph.license_plate = mv.license_plate
    WHERE pay.paid_time >= DATE_SUB(CURDATE(), INTERVAL 12 MONTH)
),
rfm_raw AS (
    SELECT
        member_id,
        DATEDIFF(CURDATE(), MAX(paid_time)) AS recency_days,
        COUNT(*)                            AS frequency,
        SUM(amount_received)                AS monetary
    FROM member_sessions
    GROUP BY member_id
)
SELECT
    member_id,
    recency_days,
    frequency,
    monetary,
    6 - NTILE(5) OVER (ORDER BY recency_days ASC)  AS r_score,
    NTILE(5) OVER (ORDER BY frequency ASC)          AS f_score,
    NTILE(5) OVER (ORDER BY monetary ASC)           AS m_score
FROM rfm_raw;

Compared to the license plate version: - A member who changed cars is still seen as the same customer - All parking sessions across stations are linked to one person - You can send push notifications to the member account after segmentation

Python Implementation

Use Python when you need to try out different scoring rules, draw distribution charts, or feed segments into the next step.

import pandas as pd
import numpy as np
from datetime import date

# load raw data (already joined in SQL or loaded as CSV)
df = pd.read_csv('parking_payments.csv', parse_dates=['paid_time'])

reference_date = pd.Timestamp(date.today())
analysis_start = reference_date - pd.DateOffset(months=12)
df = df[df['paid_time'] >= analysis_start]

Calculate Raw RFM Metrics

rfm = df.groupby('member_id').agg(
    last_visit  = ('paid_time',       'max'),
    frequency   = ('parking_id',      'count'),
    monetary    = ('amount_received',  'sum')
).reset_index()

rfm['recency_days'] = (reference_date - rfm['last_visit']).dt.days
rfm = rfm.drop(columns='last_visit')

Score Each Metric

rfm['r_score'] = pd.qcut(rfm['recency_days'].rank(method='first'), q=5,
                          labels=[5, 4, 3, 2, 1])   # fewer days = better = score 5
rfm['f_score'] = pd.qcut(rfm['frequency'].rank(method='first'), q=5,
                          labels=[1, 2, 3, 4, 5])
rfm['m_score'] = pd.qcut(rfm['monetary'].rank(method='first'), q=5,
                          labels=[1, 2, 3, 4, 5])

rfm[['r_score', 'f_score', 'm_score']] = rfm[['r_score', 'f_score', 'm_score']].astype(int)
rfm['rfm_score'] = rfm['r_score'] + rfm['f_score'] + rfm['m_score']

pd.qcut cuts the data into equal-sized buckets. Always pass rank(method='first') instead of the raw values — repeated values create duplicate bin edges, which makes qcut raise an error or produce uneven buckets. This applies to all three scores.

Assign Segments

def assign_segment(row):
    r, f = row['r_score'], row['f_score']
    if r >= 4 and f >= 4:
        return 'Champions'
    elif r >= 3 and f >= 3:
        return 'Loyal'
    elif r >= 4 and f <= 2:
        return 'New Customers'
    elif r <= 2 and f >= 3:
        return 'At Risk'
    elif r <= 2 and f <= 2:
        return 'Lost'
    else:
        return 'Potential Loyalists'

rfm['segment'] = rfm.apply(assign_segment, axis=1)

Summarise Segments

summary = rfm.groupby('segment').agg(
    customer_count = ('member_id',     'count'),
    avg_recency    = ('recency_days',  'mean'),
    avg_frequency  = ('frequency',     'mean'),
    avg_monetary   = ('monetary',      'mean'),
    total_revenue  = ('monetary',      'sum')
).round(1).sort_values('total_revenue', ascending=False)

print(summary)

Customer Segments and Business Actions

Segment	Profile	Action
Champions	Parked recently, often, spent the most	Reward with loyalty points or priority space
Loyal	Regular visitors, steady spend	Promote a monthly pass to lock in revenue
Potential Loyalists	Recent, medium frequency	Send a welcome offer to build the habit
New Customers	Recent first-timers, low frequency	Onboarding push — help them come back a second and third time
At Risk	Used to come often, haven't been back	Win-back campaign: a limited discount
Lost	Low recency, low frequency	Cheap reactivation try, or write off

Tracking Segments Over Time

Run RFM every month and watch how customers move between segments. This is more useful than a one-time snapshot.

# add a reporting month column to each run
rfm['report_month'] = pd.Timestamp(date.today()).to_period('M')

# save results
rfm.to_csv(f"rfm_{date.today().strftime('%Y%m')}.csv", index=False)

-- compare segment size month over month
SELECT
    report_month,
    segment,
    COUNT(*) AS customer_count
FROM rfm_history
GROUP BY report_month, segment
ORDER BY report_month, segment;

Movements to watch: - Champions → At Risk: an early sign of churn - New Customers → Potential Loyalists: your onboarding is working - At Risk → Lost: your win-back campaign was too late

Common Mistakes

1. Including Sessions with No Payment

Always join parking_history to payment_history before calculating RFM. Sessions that ended without payment will mess up recency and frequency.

2. Choosing the Wrong Analysis Window

Too short (1 month) misses seasonal customers. Too long (3 years) hides recent changes in behavior. 12 months is the standard starting point. Adjust it if your business has a strong seasonal pattern.

3. Using Fixed Score Boundaries Instead of Percentiles

Setting R score 5 = "parked within 7 days" is fragile. It depends on your current data and breaks when traffic changes. NTILE and pd.qcut follow the actual data distribution, so they adjust on their own.

4. Treating All Segments as Equally Important

Segments with high monetary value deserve more attention. Always weigh business actions by revenue impact, not just by customer count.

RFM draws the segment boundaries with rules you define. To let the data draw them instead — and to discover segments the rules miss — run K-means on the same R/F/M columns: see K-means Clustering for Customer Segmentation.

← Back to Blog