Advanced SQL: Windows, CTEs & Tuning

2026-04-25 Data Analysis 5 min read

SQL

This notebook picks up where SQL Cheat Sheet for Data Analysts left off. The basics (SELECT, JOINs, GROUP BY, CTEs, window functions) are assumed. Each section here adds a technique that solves a new kind of problem.

All examples use the same parking system tables: parking_history, payment_history, management_table, and charging_records. Syntax is MySQL 8 unless a section says otherwise — for PostgreSQL swaps see the SQL Cheat Sheet for Data Analysts.

Window Function Frame Clause

The basic window functions in the SQL cheat sheet (ROW_NUMBER, LAG, SUM OVER) use a default frame. The frame clause lets you control exactly which rows go into each window calculation.

function() OVER (
    PARTITION BY <col>
    ORDER BY <col>
    ROWS BETWEEN <start> AND <end>
)

Frame Boundaries Reference

Frame	Meaning
`ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW`	First row of the partition up to the current row → running total
`ROWS BETWEEN 2 PRECEDING AND CURRENT ROW`	3-row trailing window → moving average
`ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING`	Centered 3-row window
`ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING`	The whole partition
`RANGE BETWEEN INTERVAL 6 DAY PRECEDING AND CURRENT ROW`	Last 7 days by value — today + the 6 days before it (not row count)

Examples

-- 3-month trailing moving average of revenue
SELECT station_code,
       DATE_FORMAT(paid_time, '%Y-%m')       AS month,
       SUM(amount_received)                  AS revenue,
       ROUND(AVG(SUM(amount_received)) OVER (
           PARTITION BY station_code
           ORDER BY DATE_FORMAT(paid_time, '%Y-%m')
           ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
       ), 0)                                 AS moving_avg_3m
FROM payment_history
GROUP BY station_code, DATE_FORMAT(paid_time, '%Y-%m');

-- 7-day rolling sum (by date value, not by row position)
SELECT DATE(paid_time)      AS day,
       SUM(amount_received) AS daily_rev,
       SUM(SUM(amount_received)) OVER (
           ORDER BY DATE(paid_time)
           RANGE BETWEEN INTERVAL 6 DAY PRECEDING AND CURRENT ROW
       )                    AS rolling_7d
FROM payment_history
GROUP BY DATE(paid_time);

ROWS BETWEEN counts rows. RANGE BETWEEN uses the actual value of the ORDER BY column. Use RANGE BETWEEN INTERVAL for calendar-based windows when the data may have gaps.

Without a frame clause, SUM() OVER (ORDER BY ...) defaults to RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW — note RANGE, not ROWS. When the ORDER BY column has ties, RANGE includes all the tied rows at once and the running total jumps in steps. Always write the frame on purpose so the intent is clear — usually ROWS is what you mean.

Advanced Window Functions

NTILE — Divide into Buckets

-- put each station into a revenue quartile
SELECT station_code,
       SUM(amount_received)                               AS revenue,
       NTILE(4) OVER (ORDER BY SUM(amount_received) DESC) AS quartile
FROM payment_history
GROUP BY station_code;

NTILE(n) cuts the rows into n equal-sized buckets. Quartile 1 = top 25%. Useful for tiering customers or stations when you don't know the exact percentile cut-offs in advance.

PERCENT_RANK — Relative Position (0 to 1)

SELECT station_code,
       SUM(amount_received)                                         AS revenue,
       ROUND(PERCENT_RANK() OVER (ORDER BY SUM(amount_received)), 3) AS pct_rank
FROM payment_history
GROUP BY station_code;

PERCENT_RANK() returns a value between 0 and 1: how many rows are below this row, as a fraction. A station at 0.80 beats 80% of all stations.

CUME_DIST — Cumulative Distribution

SELECT station_code,
       SUM(amount_received)                                        AS revenue,
       ROUND(CUME_DIST() OVER (ORDER BY SUM(amount_received)), 3) AS cume_dist
FROM payment_history
GROUP BY station_code;

CUME_DIST() returns the share of rows at or below the current row. Unlike PERCENT_RANK, it never returns 0 — even the lowest row has a value greater than 0.

Function	Range	Lowest row	Highest row
`PERCENT_RANK`	0 to 1	0.000	1.000
`CUME_DIST`	0+ to 1	1/n	1.000
`NTILE(4)`	1 to 4	4 (bottom bucket)	1 (top bucket)

FIRST_VALUE / LAST_VALUE

-- compare each station's monthly revenue to the best month in its history
SELECT station_code,
       DATE_FORMAT(paid_time, '%Y-%m')        AS month,
       SUM(amount_received)                   AS revenue,
       MAX(SUM(amount_received)) OVER (
           PARTITION BY station_code
           ORDER BY DATE_FORMAT(paid_time, '%Y-%m')
           ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
       )                                      AS all_time_max
FROM payment_history
GROUP BY station_code, DATE_FORMAT(paid_time, '%Y-%m');

FIRST_VALUE and LAST_VALUE return the first or last value in a window frame. LAST_VALUE needs UNBOUNDED FOLLOWING in the frame. Without it, the frame ends at the current row by default.

Recursive CTEs

A recursive CTE calls itself to handle hierarchical or sequential data. The most common use is to build a date series without a stored calendar table.

WITH RECURSIVE <cte_name> AS (
    -- anchor: the starting row(s)
    <initial_select>
    UNION ALL
    -- recursive part: join back to cte_name to add the next row
    <recursive_select>
)
SELECT * FROM <cte_name>;

Generate a Date Series

WITH RECURSIVE date_series AS (
    SELECT DATE('2025-01-01') AS dt
    UNION ALL
    SELECT DATE_ADD(dt, INTERVAL 1 DAY)
    FROM date_series
    WHERE dt < '2025-12-31'
)
SELECT * FROM date_series;

Fill Gaps in Daily Revenue

WITH RECURSIVE date_series AS (
    SELECT DATE('2025-01-01') AS dt
    UNION ALL
    SELECT DATE_ADD(dt, INTERVAL 1 DAY)
    FROM date_series
    WHERE dt < '2025-12-31'
),
daily_revenue AS (
    SELECT DATE(paid_time) AS day, SUM(amount_received) AS revenue
    FROM payment_history
    WHERE paid_time BETWEEN '2025-01-01' AND '2025-12-31'
    GROUP BY DATE(paid_time)
)
SELECT ds.dt                            AS day,
       COALESCE(dr.revenue, 0)          AS revenue
FROM date_series ds
LEFT JOIN daily_revenue dr ON ds.dt = dr.day
ORDER BY ds.dt;

Without a date series, GROUP BY DATE(paid_time) only returns days that have records. The LEFT JOIN fills the gaps with 0.

Walk a Hierarchy (Org Chart / Category Tree)

WITH RECURSIVE org AS (
    -- anchor: top-level managers (no manager_id)
    SELECT employee_id, name, manager_id, 1 AS level
    FROM employees
    WHERE manager_id IS NULL

    UNION ALL

    -- recursive: join each employee to their manager row in org
    SELECT e.employee_id, e.name, e.manager_id, org.level + 1
    FROM employees e
    JOIN org ON e.manager_id = org.employee_id
)
SELECT * FROM org ORDER BY level, name;

level tracks how deep you are — level 1 = CEO, level 2 = direct reports, and so on. Add WHERE level <= 3 to limit the depth.

Advanced Grouping

ROLLUP — Subtotals and Grand Total

WITH ROLLUP adds summary rows on its own at each level of the GROUP BY hierarchy.

SELECT city,
       station_code,
       SUM(amount_received) AS revenue
FROM payment_history py
JOIN management_table mt ON py.station_code = mt.station_code
GROUP BY city, station_code WITH ROLLUP;

The result includes: - Revenue per city + station_code (normal rows) - Revenue per city subtotal (station_code = NULL) - Grand total (city = NULL, station_code = NULL)

Use GROUPING() to tell subtotal NULLs apart from real NULLs:

SELECT CASE WHEN GROUPING(city) = 1 THEN 'All Cities' ELSE city END AS city,
       CASE WHEN GROUPING(station_code) = 1 THEN 'Subtotal' ELSE station_code END AS station,
       SUM(amount_received) AS revenue
FROM payment_history py
JOIN management_table mt ON py.station_code = mt.station_code
GROUP BY city, station_code WITH ROLLUP;

GROUPING SETS — Custom Aggregation Levels

PostgreSQL / SQL Server / BigQuery syntax. MySQL does not support GROUPING SETS or CUBE — it only has WITH ROLLUP. Emulate with several GROUP BY queries joined by UNION ALL.

GROUPING SETS runs several GROUP BY combinations in one query — more flexible than ROLLUP.

SELECT city, payment_method, SUM(amount_received) AS revenue
FROM payment_history py
JOIN management_table mt ON py.station_code = mt.station_code
GROUP BY GROUPING SETS (
    (city, payment_method),   -- by city + payment method
    (city),                   -- by city only
    ()                        -- grand total
);

Same as three separate GROUP BY queries joined with UNION ALL — but in one pass.

CUBE — All Combinations

CUBE builds every possible combination of the GROUP BY columns. Useful for multi-dimension reporting.

SELECT city, payment_method, YEAR(paid_time) AS year,
       SUM(amount_received) AS revenue
FROM payment_history py
JOIN management_table mt ON py.station_code = mt.station_code
GROUP BY CUBE (city, payment_method, YEAR(paid_time));

With 3 columns, CUBE produces 2³ = 8 grouping combinations. Use GROUPING() to label the NULL rows.

Advanced String Functions

GROUP_CONCAT — Aggregate Strings

-- list every license plate per station as a comma-separated string
SELECT station_code,
       GROUP_CONCAT(DISTINCT license_plate ORDER BY license_plate SEPARATOR ', ') AS plates
FROM parking_history
GROUP BY station_code;

GROUP_CONCAT collapses many rows into one string within a group. SEPARATOR sets the delimiter (default is comma).

-- raise the length limit so big groups are not cut off
SET SESSION group_concat_max_len = 100000;

REGEXP — Pattern Matching

-- license plates that start with a letter and 2 digits
SELECT license_plate
FROM parking_history
WHERE license_plate REGEXP '^[A-Z][0-9]{2}';

-- plates that have exactly 3 uppercase letters, a dash, and 4 digits
WHERE license_plate REGEXP '^[A-Z]{3}-[0-9]{4}$'

Pattern	Matches
`^`	Start of the string
`$`	End of the string
`[A-Z]`	Any uppercase letter
`[0-9]`	Any digit
`{n}`	Exactly n repeats
`.`	Any single character
`*`	Zero or more of the previous character
`+`	One or more of the previous character

For the full pattern language — groups, alternation, greedy vs lazy, plus the Python and Pandas sides — see Regex for Analysts: Python, Pandas & SQL.

REPLACE, LOCATE, LPAD, RPAD

SELECT REPLACE(license_plate, '-', '')           AS plate_no_dash,   -- remove dashes
       LOCATE('-', license_plate)                AS dash_position,    -- position of '-'
       LPAD(station_code, 6, '0')               AS padded_code,      -- '12' → '000012'
       RPAD(station_name, 20, ' ')              AS padded_name       -- right-pad with spaces
FROM management_table;

Advanced Date Functions

SELECT entry_time,
       TIMESTAMPDIFF(MINUTE, entry_time, exit_time) AS duration_mins,   -- exact difference
       TIMESTAMPDIFF(MONTH,  '2024-01-01', entry_time) AS months_since,
       EXTRACT(QUARTER FROM entry_time)               AS quarter,        -- 1–4
       EXTRACT(WEEK    FROM entry_time)               AS week_of_year,
       LAST_DAY(entry_time)                           AS last_of_month,  -- last day of the month
       DATE_FORMAT(entry_time, '%W')                  AS weekday_name,   -- 'Monday', etc.
       DAYOFWEEK(entry_time)                          AS day_num         -- 1=Sun, 7=Sat
FROM parking_history;

First Day of the Month / Year

-- first day of the month
DATE_FORMAT(entry_time, '%Y-%m-01')

-- first day of the year
DATE_FORMAT(entry_time, '%Y-01-01')

Quarter Boundaries

SELECT CONCAT(YEAR(paid_time), ' Q', QUARTER(paid_time)) AS quarter,
       SUM(amount_received)                               AS revenue
FROM payment_history
GROUP BY YEAR(paid_time), QUARTER(paid_time)
ORDER BY 1;

Views

A view is a saved SELECT query that acts like a table. Use it to wrap a complex query you run often, or to give other people a simpler view of a table.

CREATE VIEW station_monthly_revenue AS
SELECT mt.station_code,
       mt.station_name,
       mt.city,
       DATE_FORMAT(pay.paid_time, '%Y-%m') AS month,
       SUM(pay.amount_received)            AS revenue,
       COUNT(*)                            AS transactions
FROM payment_history pay
JOIN management_table mt ON pay.station_code = mt.station_code
GROUP BY mt.station_code, mt.station_name, mt.city,
         DATE_FORMAT(pay.paid_time, '%Y-%m');

Query it just like a table:

SELECT * FROM station_monthly_revenue WHERE city = 'Taipei' AND month = '2025-06';

Update or drop:

CREATE OR REPLACE VIEW station_monthly_revenue AS <new_query>;
DROP VIEW station_monthly_revenue;

Views do not store data. They run the underlying query each time. For speed on large datasets, use a materialized view or a real summary table instead.

Temporary Tables

Temporary tables exist only for the current session and are dropped on their own when the connection closes. Use them when a CTE would be referenced many times, or when you need to build a result step by step.

-- create a temp table
CREATE TEMPORARY TABLE tmp_station_stats AS
SELECT station_code,
       COUNT(*)           AS total_visits,
       SUM(amount_received) AS total_revenue
FROM payment_history
GROUP BY station_code;

-- use it like a normal table
SELECT mt.station_name, t.total_revenue
FROM tmp_station_stats t
JOIN management_table mt ON t.station_code = mt.station_code
WHERE t.total_visits > 500;

-- drop it on purpose when you are done (optional — auto-dropped at session end)
DROP TEMPORARY TABLE tmp_station_stats;

	CTE	Temp Table
Scope	One query	The whole session
Reuse	Same query only	Several queries
Indexable	No	Yes
Best for	Readable single queries	Multi-step pipelines

Query Performance & EXPLAIN

EXPLAIN shows how MySQL runs a query — which indexes it uses, how many rows it scans, and where the slow points are.

EXPLAIN SELECT ph.license_plate, pay.amount_received
FROM parking_history ph
JOIN payment_history pay ON ph.parking_id = pay.parking_id
WHERE ph.entry_time >= '2025-01-01';

Key Columns in EXPLAIN Output

Column	What to look for
`type`	`const` / `ref` / `range` = fast; `ALL` = full table scan (slow)
`key`	Which index is used. NULL = no index used
`rows`	Estimated rows scanned — lower is better
`Extra`	`Using filesort` or `Using temporary` = a possible bottleneck

Use EXPLAIN ANALYZE for Real Runtimes (MySQL 8+)

EXPLAIN ANALYZE SELECT ...;

Shows real rows, loops, and time per step — more reliable than the estimated rows column.

Index Basics

-- create an index on a column you filter on often
CREATE INDEX idx_entry_time ON parking_history(entry_time);

-- composite index: useful when you filter on both columns together
CREATE INDEX idx_station_time ON parking_history(station_code, entry_time);

Index tips for analysts: - Index columns used in WHERE, JOIN ON, and ORDER BY. - A composite index (a, b) helps queries that filter on a alone or on a AND b, but not on b alone. - Too many indexes slow down writes — only add the ones you actually use.

← Back to Blog