SQL Advanced Notebook

This notebook picks up where SQL Notes left off. The basics (SELECT, JOINs, GROUP BY, CTEs, window functions) are assumed knowledge. Every section here adds a technique that unlocks a new class of problem.

All examples use the same parking system tables: parking_history, payment_history, management_table, and charging_records.


Window Function Frame Clause

The basic window functions in SQL Notes (ROW_NUMBER, LAG, SUM OVER) use a default frame. The frame clause lets you control exactly which rows are included in each window calculation.

function() OVER (
    PARTITION BY <col>
    ORDER BY <col>
    ROWS BETWEEN <start> AND <end>
)

Frame Boundaries Reference

Frame Meaning
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW First row of partition up to current → running total
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW 3-row trailing window → moving average
ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING Centered 3-row window
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING Entire partition
RANGE BETWEEN INTERVAL 7 DAY PRECEDING AND CURRENT ROW Last 7 days by value (not row count)

Examples

-- 3-month trailing moving average of revenue
SELECT station_code,
       DATE_FORMAT(paid_time, '%Y-%m')       AS month,
       SUM(amount_received)                  AS revenue,
       ROUND(AVG(SUM(amount_received)) OVER (
           PARTITION BY station_code
           ORDER BY DATE_FORMAT(paid_time, '%Y-%m')
           ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
       ), 0)                                 AS moving_avg_3m
FROM payment_history
GROUP BY station_code, DATE_FORMAT(paid_time, '%Y-%m');
-- 7-day rolling sum (by date value, not row position)
SELECT DATE(paid_time)      AS day,
       SUM(amount_received) AS daily_rev,
       SUM(SUM(amount_received)) OVER (
           ORDER BY DATE(paid_time)
           RANGE BETWEEN INTERVAL 6 DAY PRECEDING AND CURRENT ROW
       )                    AS rolling_7d
FROM payment_history
GROUP BY DATE(paid_time);

ROWS BETWEEN counts rows; RANGE BETWEEN uses the ORDER BY column's actual value. Use RANGE BETWEEN INTERVAL for calendar-based windows where data may have gaps.

Without a frame clause, SUM() OVER (ORDER BY ...) defaults to UNBOUNDED PRECEDING AND CURRENT ROW (running total). Always write it explicitly to make the intent clear.


Advanced Window Functions

NTILE — Divide into Buckets

-- assign each station to a revenue quartile
SELECT station_code,
       SUM(amount_received)                               AS revenue,
       NTILE(4) OVER (ORDER BY SUM(amount_received) DESC) AS quartile
FROM payment_history
GROUP BY station_code;

NTILE(n) splits rows into n equal-sized buckets. Quartile 1 = top 25%. Useful for tiering customers or stations without knowing the exact percentile thresholds in advance.

PERCENT_RANK — Relative Position (0 to 1)

SELECT station_code,
       SUM(amount_received)                                         AS revenue,
       ROUND(PERCENT_RANK() OVER (ORDER BY SUM(amount_received)), 3) AS pct_rank
FROM payment_history
GROUP BY station_code;

PERCENT_RANK() returns a value between 0 and 1: how many rows fall below this row, as a fraction. A station at 0.80 outperforms 80% of all stations.

CUME_DIST — Cumulative Distribution

SELECT station_code,
       SUM(amount_received)                                        AS revenue,
       ROUND(CUME_DIST() OVER (ORDER BY SUM(amount_received)), 3) AS cume_dist
FROM payment_history
GROUP BY station_code;

CUME_DIST() returns the fraction of rows at or below the current row. Unlike PERCENT_RANK, it never returns 0 — the lowest row still has a value > 0.

Function Range Lowest row Highest row
PERCENT_RANK 0 to 1 0.000 1.000
CUME_DIST 0+ to 1 1/n 1.000
NTILE(4) 1 to 4 4 (bottom bucket) 1 (top bucket)

FIRST_VALUE / LAST_VALUE

-- compare each station's monthly revenue to the best month in its history
SELECT station_code,
       DATE_FORMAT(paid_time, '%Y-%m')        AS month,
       SUM(amount_received)                   AS revenue,
       MAX(SUM(amount_received)) OVER (
           PARTITION BY station_code
           ORDER BY DATE_FORMAT(paid_time, '%Y-%m')
           ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
       )                                      AS all_time_max
FROM payment_history
GROUP BY station_code, DATE_FORMAT(paid_time, '%Y-%m');

FIRST_VALUE and LAST_VALUE return the first or last value within a window frame. LAST_VALUE requires UNBOUNDED FOLLOWING in the frame — without it, the frame ends at the current row by default.


Recursive CTEs

A recursive CTE calls itself to process hierarchical or sequential data — most commonly to generate a date series without a stored calendar table.

WITH RECURSIVE <cte_name> AS (
    -- anchor: the starting row(s)
    <initial_select>
    UNION ALL
    -- recursive part: join back to cte_name to add the next row
    <recursive_select>
)
SELECT * FROM <cte_name>;

Generate a Date Series

WITH RECURSIVE date_series AS (
    SELECT DATE('2025-01-01') AS dt
    UNION ALL
    SELECT DATE_ADD(dt, INTERVAL 1 DAY)
    FROM date_series
    WHERE dt < '2025-12-31'
)
SELECT * FROM date_series;

Fill Gaps in Daily Revenue

WITH RECURSIVE date_series AS (
    SELECT DATE('2025-01-01') AS dt
    UNION ALL
    SELECT DATE_ADD(dt, INTERVAL 1 DAY)
    FROM date_series
    WHERE dt < '2025-12-31'
),
daily_revenue AS (
    SELECT DATE(paid_time) AS day, SUM(amount_received) AS revenue
    FROM payment_history
    WHERE paid_time BETWEEN '2025-01-01' AND '2025-12-31'
    GROUP BY DATE(paid_time)
)
SELECT ds.dt                            AS day,
       COALESCE(dr.revenue, 0)          AS revenue
FROM date_series ds
LEFT JOIN daily_revenue dr ON ds.dt = dr.day
ORDER BY ds.dt;

Without a date series, GROUP BY DATE(paid_time) only returns days that have records. The LEFT JOIN fills the gaps with 0.

Traverse a Hierarchy (Org Chart / Category Tree)

WITH RECURSIVE org AS (
    -- anchor: top-level managers (no manager_id)
    SELECT employee_id, name, manager_id, 1 AS level
    FROM employees
    WHERE manager_id IS NULL

    UNION ALL

    -- recursive: join each employee to their manager row in org
    SELECT e.employee_id, e.name, e.manager_id, org.level + 1
    FROM employees e
    JOIN org ON e.manager_id = org.employee_id
)
SELECT * FROM org ORDER BY level, name;

level tracks depth — level 1 = CEO, level 2 = direct reports, and so on. Add WHERE level <= 3 to limit depth.


Advanced Grouping

ROLLUP — Subtotals and Grand Total

WITH ROLLUP adds summary rows automatically at each level of the GROUP BY hierarchy.

SELECT city,
       station_code,
       SUM(amount_received) AS revenue
FROM payment_history py
JOIN management_table mt ON py.station_code = mt.station_code
GROUP BY city, station_code WITH ROLLUP;

Result includes: - Revenue per city + station_code (normal rows) - Revenue per city subtotal (station_code = NULL) - Grand total (city = NULL, station_code = NULL)

Distinguish subtotal NULLs from real NULLs using GROUPING():

SELECT CASE WHEN GROUPING(city) = 1 THEN 'All Cities' ELSE city END AS city,
       CASE WHEN GROUPING(station_code) = 1 THEN 'Subtotal' ELSE station_code END AS station,
       SUM(amount_received) AS revenue
FROM payment_history py
JOIN management_table mt ON py.station_code = mt.station_code
GROUP BY city, station_code WITH ROLLUP;

GROUPING SETS — Custom Aggregation Levels

GROUPING SETS runs multiple GROUP BY combinations in one query — more flexible than ROLLUP.

SELECT city, payment_method, SUM(amount_received) AS revenue
FROM payment_history py
JOIN management_table mt ON py.station_code = mt.station_code
GROUP BY GROUPING SETS (
    (city, payment_method),   -- by city + payment method
    (city),                   -- by city only
    ()                        -- grand total
);

Equivalent to three separate GROUP BY queries combined with UNION ALL — but in one pass.

CUBE — All Combinations

CUBE generates every possible combination of the GROUP BY columns — useful for multidimensional reporting.

SELECT city, payment_method, YEAR(paid_time) AS year,
       SUM(amount_received) AS revenue
FROM payment_history py
JOIN management_table mt ON py.station_code = mt.station_code
GROUP BY CUBE (city, payment_method, YEAR(paid_time));

With 3 columns, CUBE produces 2³ = 8 grouping combinations. Use GROUPING() to label the NULL rows.


Advanced String Functions

GROUP_CONCAT — Aggregate Strings

-- list all license plates per station as a comma-separated string
SELECT station_code,
       GROUP_CONCAT(DISTINCT license_plate ORDER BY license_plate SEPARATOR ', ') AS plates
FROM parking_history
GROUP BY station_code;

GROUP_CONCAT collapses multiple rows into one string within a group. SEPARATOR sets the delimiter (default is comma).

-- limit length to avoid truncation on large groups
SET SESSION group_concat_max_len = 100000;

REGEXP — Pattern Matching

-- license plates that start with a letter followed by 2 digits
SELECT license_plate
FROM parking_history
WHERE license_plate REGEXP '^[A-Z][0-9]{2}';

-- plates that contain exactly 3 uppercase letters followed by a dash and 4 digits
WHERE license_plate REGEXP '^[A-Z]{3}-[0-9]{4}$'
Pattern Matches
^ Start of string
$ End of string
[A-Z] Any uppercase letter
[0-9] Any digit
{n} Exactly n repetitions
. Any single character
* Zero or more of the preceding
+ One or more of the preceding

REPLACE, LOCATE, LPAD, RPAD

SELECT REPLACE(license_plate, '-', '')           AS plate_no_dash,   -- remove dashes
       LOCATE('-', license_plate)                AS dash_position,    -- position of '-'
       LPAD(station_code, 6, '0')               AS padded_code,      -- '12' → '000012'
       RPAD(station_name, 20, ' ')              AS padded_name       -- right-pad with spaces
FROM management_table;

Advanced Date Functions

SELECT entry_time,
       TIMESTAMPDIFF(MINUTE, entry_time, exit_time) AS duration_mins,   -- precise diff
       TIMESTAMPDIFF(MONTH,  '2024-01-01', entry_time) AS months_since,
       EXTRACT(QUARTER FROM entry_time)               AS quarter,        -- 1–4
       EXTRACT(WEEK    FROM entry_time)               AS week_of_year,
       LAST_DAY(entry_time)                           AS last_of_month,  -- last day of month
       DATE_FORMAT(entry_time, '%W')                  AS weekday_name,   -- 'Monday', etc.
       DAYOFWEEK(entry_time)                          AS day_num         -- 1=Sun, 7=Sat
FROM parking_history;

First Day of the Month / Year

-- first day of the month
DATE_FORMAT(entry_time, '%Y-%m-01')

-- first day of the year
DATE_FORMAT(entry_time, '%Y-01-01')

Quarter Boundaries

SELECT CONCAT(YEAR(paid_time), ' Q', QUARTER(paid_time)) AS quarter,
       SUM(amount_received)                               AS revenue
FROM payment_history
GROUP BY YEAR(paid_time), QUARTER(paid_time)
ORDER BY 1;

Views

A view is a saved SELECT query that behaves like a table. Use it to encapsulate a complex query used repeatedly, or to expose a simplified interface to a table.

CREATE VIEW station_monthly_revenue AS
SELECT mt.station_code,
       mt.station_name,
       mt.city,
       DATE_FORMAT(pay.paid_time, '%Y-%m') AS month,
       SUM(pay.amount_received)            AS revenue,
       COUNT(*)                            AS transactions
FROM payment_history pay
JOIN management_table mt ON pay.station_code = mt.station_code
GROUP BY mt.station_code, mt.station_name, mt.city,
         DATE_FORMAT(pay.paid_time, '%Y-%m');

Query it like a table:

SELECT * FROM station_monthly_revenue WHERE city = 'Taipei' AND month = '2025-06';

Update or drop:

CREATE OR REPLACE VIEW station_monthly_revenue AS <new_query>;
DROP VIEW station_monthly_revenue;

Views do not store data — they run the underlying query each time. For performance on large datasets, use a materialized view or a physical summary table instead.


Temporary Tables

Temporary tables exist only for the current session and are dropped automatically when the connection closes. Use them when a CTE would be referenced many times, or when you need to build a result incrementally.

-- create a temp table
CREATE TEMPORARY TABLE tmp_station_stats AS
SELECT station_code,
       COUNT(*)           AS total_visits,
       SUM(amount_received) AS total_revenue
FROM payment_history
GROUP BY station_code;

-- use it like a regular table
SELECT mt.station_name, t.total_revenue
FROM tmp_station_stats t
JOIN management_table mt ON t.station_code = mt.station_code
WHERE t.total_visits > 500;

-- drop explicitly when done (optional — auto-dropped at session end)
DROP TEMPORARY TABLE tmp_station_stats;
CTE Temp Table
Scope Single query Entire session
Re-use Same query only Multiple queries
Indexable No Yes
Best for Readable single queries Multi-step pipelines

Query Performance & EXPLAIN

EXPLAIN shows how MySQL executes a query — which indexes it uses, how many rows it scans, and where the bottlenecks are.

EXPLAIN SELECT ph.license_plate, pay.amount_received
FROM parking_history ph
JOIN payment_history pay ON ph.parking_id = pay.parking_id
WHERE ph.entry_time >= '2025-01-01';

Key Columns in EXPLAIN Output

Column What to look for
type const / ref / range = fast; ALL = full table scan (slow)
key Which index is used. NULL = no index used
rows Estimated rows scanned — lower is better
Extra Using filesort or Using temporary = potential bottleneck

Use EXPLAIN ANALYZE for Actual Runtimes (MySQL 8+)

EXPLAIN ANALYZE SELECT ...;

Shows actual rows, loops, and time per step — more reliable than the estimated rows column.

Index Basics

-- create an index on a frequently filtered column
CREATE INDEX idx_entry_time ON parking_history(entry_time);

-- composite index: useful when filtering by both columns together
CREATE INDEX idx_station_time ON parking_history(station_code, entry_time);

Index guidelines for analysts: - Index columns used in WHERE, JOIN ON, and ORDER BY. - A composite index (a, b) helps queries that filter on a alone or a AND b, but not b alone. - Too many indexes slow down writes — only add what you actually use.