SQL Advanced Notebook
This notebook picks up where SQL Notes left off. The basics (SELECT, JOINs, GROUP BY, CTEs, window functions) are assumed knowledge. Every section here adds a technique that unlocks a new class of problem.
All examples use the same parking system tables: parking_history, payment_history, management_table, and charging_records.
Window Function Frame Clause
The basic window functions in SQL Notes (ROW_NUMBER, LAG, SUM OVER) use a default frame. The frame clause lets you control exactly which rows are included in each window calculation.
function() OVER (
PARTITION BY <col>
ORDER BY <col>
ROWS BETWEEN <start> AND <end>
)
Frame Boundaries Reference
| Frame | Meaning |
|---|---|
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW |
First row of partition up to current → running total |
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW |
3-row trailing window → moving average |
ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING |
Centered 3-row window |
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING |
Entire partition |
RANGE BETWEEN INTERVAL 7 DAY PRECEDING AND CURRENT ROW |
Last 7 days by value (not row count) |
Examples
-- 3-month trailing moving average of revenue
SELECT station_code,
DATE_FORMAT(paid_time, '%Y-%m') AS month,
SUM(amount_received) AS revenue,
ROUND(AVG(SUM(amount_received)) OVER (
PARTITION BY station_code
ORDER BY DATE_FORMAT(paid_time, '%Y-%m')
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
), 0) AS moving_avg_3m
FROM payment_history
GROUP BY station_code, DATE_FORMAT(paid_time, '%Y-%m');
-- 7-day rolling sum (by date value, not row position)
SELECT DATE(paid_time) AS day,
SUM(amount_received) AS daily_rev,
SUM(SUM(amount_received)) OVER (
ORDER BY DATE(paid_time)
RANGE BETWEEN INTERVAL 6 DAY PRECEDING AND CURRENT ROW
) AS rolling_7d
FROM payment_history
GROUP BY DATE(paid_time);
ROWS BETWEEN counts rows; RANGE BETWEEN uses the ORDER BY column's actual value. Use RANGE BETWEEN INTERVAL for calendar-based windows where data may have gaps.
Without a frame clause, SUM() OVER (ORDER BY ...) defaults to UNBOUNDED PRECEDING AND CURRENT ROW (running total). Always write it explicitly to make the intent clear.
Advanced Window Functions
NTILE — Divide into Buckets
-- assign each station to a revenue quartile
SELECT station_code,
SUM(amount_received) AS revenue,
NTILE(4) OVER (ORDER BY SUM(amount_received) DESC) AS quartile
FROM payment_history
GROUP BY station_code;
NTILE(n) splits rows into n equal-sized buckets. Quartile 1 = top 25%. Useful for tiering customers or stations without knowing the exact percentile thresholds in advance.
PERCENT_RANK — Relative Position (0 to 1)
SELECT station_code,
SUM(amount_received) AS revenue,
ROUND(PERCENT_RANK() OVER (ORDER BY SUM(amount_received)), 3) AS pct_rank
FROM payment_history
GROUP BY station_code;
PERCENT_RANK() returns a value between 0 and 1: how many rows fall below this row, as a fraction. A station at 0.80 outperforms 80% of all stations.
CUME_DIST — Cumulative Distribution
SELECT station_code,
SUM(amount_received) AS revenue,
ROUND(CUME_DIST() OVER (ORDER BY SUM(amount_received)), 3) AS cume_dist
FROM payment_history
GROUP BY station_code;
CUME_DIST() returns the fraction of rows at or below the current row. Unlike PERCENT_RANK, it never returns 0 — the lowest row still has a value > 0.
| Function | Range | Lowest row | Highest row |
|---|---|---|---|
PERCENT_RANK |
0 to 1 | 0.000 | 1.000 |
CUME_DIST |
0+ to 1 | 1/n | 1.000 |
NTILE(4) |
1 to 4 | 4 (bottom bucket) | 1 (top bucket) |
FIRST_VALUE / LAST_VALUE
-- compare each station's monthly revenue to the best month in its history
SELECT station_code,
DATE_FORMAT(paid_time, '%Y-%m') AS month,
SUM(amount_received) AS revenue,
MAX(SUM(amount_received)) OVER (
PARTITION BY station_code
ORDER BY DATE_FORMAT(paid_time, '%Y-%m')
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS all_time_max
FROM payment_history
GROUP BY station_code, DATE_FORMAT(paid_time, '%Y-%m');
FIRST_VALUE and LAST_VALUE return the first or last value within a window frame. LAST_VALUE requires UNBOUNDED FOLLOWING in the frame — without it, the frame ends at the current row by default.
Recursive CTEs
A recursive CTE calls itself to process hierarchical or sequential data — most commonly to generate a date series without a stored calendar table.
WITH RECURSIVE <cte_name> AS (
-- anchor: the starting row(s)
<initial_select>
UNION ALL
-- recursive part: join back to cte_name to add the next row
<recursive_select>
)
SELECT * FROM <cte_name>;
Generate a Date Series
WITH RECURSIVE date_series AS (
SELECT DATE('2025-01-01') AS dt
UNION ALL
SELECT DATE_ADD(dt, INTERVAL 1 DAY)
FROM date_series
WHERE dt < '2025-12-31'
)
SELECT * FROM date_series;
Fill Gaps in Daily Revenue
WITH RECURSIVE date_series AS (
SELECT DATE('2025-01-01') AS dt
UNION ALL
SELECT DATE_ADD(dt, INTERVAL 1 DAY)
FROM date_series
WHERE dt < '2025-12-31'
),
daily_revenue AS (
SELECT DATE(paid_time) AS day, SUM(amount_received) AS revenue
FROM payment_history
WHERE paid_time BETWEEN '2025-01-01' AND '2025-12-31'
GROUP BY DATE(paid_time)
)
SELECT ds.dt AS day,
COALESCE(dr.revenue, 0) AS revenue
FROM date_series ds
LEFT JOIN daily_revenue dr ON ds.dt = dr.day
ORDER BY ds.dt;
Without a date series, GROUP BY DATE(paid_time) only returns days that have records. The LEFT JOIN fills the gaps with 0.
Traverse a Hierarchy (Org Chart / Category Tree)
WITH RECURSIVE org AS (
-- anchor: top-level managers (no manager_id)
SELECT employee_id, name, manager_id, 1 AS level
FROM employees
WHERE manager_id IS NULL
UNION ALL
-- recursive: join each employee to their manager row in org
SELECT e.employee_id, e.name, e.manager_id, org.level + 1
FROM employees e
JOIN org ON e.manager_id = org.employee_id
)
SELECT * FROM org ORDER BY level, name;
level tracks depth — level 1 = CEO, level 2 = direct reports, and so on. Add WHERE level <= 3 to limit depth.
Advanced Grouping
ROLLUP — Subtotals and Grand Total
WITH ROLLUP adds summary rows automatically at each level of the GROUP BY hierarchy.
SELECT city,
station_code,
SUM(amount_received) AS revenue
FROM payment_history py
JOIN management_table mt ON py.station_code = mt.station_code
GROUP BY city, station_code WITH ROLLUP;
Result includes:
- Revenue per city + station_code (normal rows)
- Revenue per city subtotal (station_code = NULL)
- Grand total (city = NULL, station_code = NULL)
Distinguish subtotal NULLs from real NULLs using GROUPING():
SELECT CASE WHEN GROUPING(city) = 1 THEN 'All Cities' ELSE city END AS city,
CASE WHEN GROUPING(station_code) = 1 THEN 'Subtotal' ELSE station_code END AS station,
SUM(amount_received) AS revenue
FROM payment_history py
JOIN management_table mt ON py.station_code = mt.station_code
GROUP BY city, station_code WITH ROLLUP;
GROUPING SETS — Custom Aggregation Levels
GROUPING SETS runs multiple GROUP BY combinations in one query — more flexible than ROLLUP.
SELECT city, payment_method, SUM(amount_received) AS revenue
FROM payment_history py
JOIN management_table mt ON py.station_code = mt.station_code
GROUP BY GROUPING SETS (
(city, payment_method), -- by city + payment method
(city), -- by city only
() -- grand total
);
Equivalent to three separate GROUP BY queries combined with UNION ALL — but in one pass.
CUBE — All Combinations
CUBE generates every possible combination of the GROUP BY columns — useful for multidimensional reporting.
SELECT city, payment_method, YEAR(paid_time) AS year,
SUM(amount_received) AS revenue
FROM payment_history py
JOIN management_table mt ON py.station_code = mt.station_code
GROUP BY CUBE (city, payment_method, YEAR(paid_time));
With 3 columns, CUBE produces 2³ = 8 grouping combinations. Use GROUPING() to label the NULL rows.
Advanced String Functions
GROUP_CONCAT — Aggregate Strings
-- list all license plates per station as a comma-separated string
SELECT station_code,
GROUP_CONCAT(DISTINCT license_plate ORDER BY license_plate SEPARATOR ', ') AS plates
FROM parking_history
GROUP BY station_code;
GROUP_CONCAT collapses multiple rows into one string within a group. SEPARATOR sets the delimiter (default is comma).
-- limit length to avoid truncation on large groups
SET SESSION group_concat_max_len = 100000;
REGEXP — Pattern Matching
-- license plates that start with a letter followed by 2 digits
SELECT license_plate
FROM parking_history
WHERE license_plate REGEXP '^[A-Z][0-9]{2}';
-- plates that contain exactly 3 uppercase letters followed by a dash and 4 digits
WHERE license_plate REGEXP '^[A-Z]{3}-[0-9]{4}$'
| Pattern | Matches |
|---|---|
^ |
Start of string |
$ |
End of string |
[A-Z] |
Any uppercase letter |
[0-9] |
Any digit |
{n} |
Exactly n repetitions |
. |
Any single character |
* |
Zero or more of the preceding |
+ |
One or more of the preceding |
REPLACE, LOCATE, LPAD, RPAD
SELECT REPLACE(license_plate, '-', '') AS plate_no_dash, -- remove dashes
LOCATE('-', license_plate) AS dash_position, -- position of '-'
LPAD(station_code, 6, '0') AS padded_code, -- '12' → '000012'
RPAD(station_name, 20, ' ') AS padded_name -- right-pad with spaces
FROM management_table;
Advanced Date Functions
SELECT entry_time,
TIMESTAMPDIFF(MINUTE, entry_time, exit_time) AS duration_mins, -- precise diff
TIMESTAMPDIFF(MONTH, '2024-01-01', entry_time) AS months_since,
EXTRACT(QUARTER FROM entry_time) AS quarter, -- 1–4
EXTRACT(WEEK FROM entry_time) AS week_of_year,
LAST_DAY(entry_time) AS last_of_month, -- last day of month
DATE_FORMAT(entry_time, '%W') AS weekday_name, -- 'Monday', etc.
DAYOFWEEK(entry_time) AS day_num -- 1=Sun, 7=Sat
FROM parking_history;
First Day of the Month / Year
-- first day of the month
DATE_FORMAT(entry_time, '%Y-%m-01')
-- first day of the year
DATE_FORMAT(entry_time, '%Y-01-01')
Quarter Boundaries
SELECT CONCAT(YEAR(paid_time), ' Q', QUARTER(paid_time)) AS quarter,
SUM(amount_received) AS revenue
FROM payment_history
GROUP BY YEAR(paid_time), QUARTER(paid_time)
ORDER BY 1;
Views
A view is a saved SELECT query that behaves like a table. Use it to encapsulate a complex query used repeatedly, or to expose a simplified interface to a table.
CREATE VIEW station_monthly_revenue AS
SELECT mt.station_code,
mt.station_name,
mt.city,
DATE_FORMAT(pay.paid_time, '%Y-%m') AS month,
SUM(pay.amount_received) AS revenue,
COUNT(*) AS transactions
FROM payment_history pay
JOIN management_table mt ON pay.station_code = mt.station_code
GROUP BY mt.station_code, mt.station_name, mt.city,
DATE_FORMAT(pay.paid_time, '%Y-%m');
Query it like a table:
SELECT * FROM station_monthly_revenue WHERE city = 'Taipei' AND month = '2025-06';
Update or drop:
CREATE OR REPLACE VIEW station_monthly_revenue AS <new_query>;
DROP VIEW station_monthly_revenue;
Views do not store data — they run the underlying query each time. For performance on large datasets, use a materialized view or a physical summary table instead.
Temporary Tables
Temporary tables exist only for the current session and are dropped automatically when the connection closes. Use them when a CTE would be referenced many times, or when you need to build a result incrementally.
-- create a temp table
CREATE TEMPORARY TABLE tmp_station_stats AS
SELECT station_code,
COUNT(*) AS total_visits,
SUM(amount_received) AS total_revenue
FROM payment_history
GROUP BY station_code;
-- use it like a regular table
SELECT mt.station_name, t.total_revenue
FROM tmp_station_stats t
JOIN management_table mt ON t.station_code = mt.station_code
WHERE t.total_visits > 500;
-- drop explicitly when done (optional — auto-dropped at session end)
DROP TEMPORARY TABLE tmp_station_stats;
| CTE | Temp Table | |
|---|---|---|
| Scope | Single query | Entire session |
| Re-use | Same query only | Multiple queries |
| Indexable | No | Yes |
| Best for | Readable single queries | Multi-step pipelines |
Query Performance & EXPLAIN
EXPLAIN shows how MySQL executes a query — which indexes it uses, how many rows it scans, and where the bottlenecks are.
EXPLAIN SELECT ph.license_plate, pay.amount_received
FROM parking_history ph
JOIN payment_history pay ON ph.parking_id = pay.parking_id
WHERE ph.entry_time >= '2025-01-01';
Key Columns in EXPLAIN Output
| Column | What to look for |
|---|---|
type |
const / ref / range = fast; ALL = full table scan (slow) |
key |
Which index is used. NULL = no index used |
rows |
Estimated rows scanned — lower is better |
Extra |
Using filesort or Using temporary = potential bottleneck |
Use EXPLAIN ANALYZE for Actual Runtimes (MySQL 8+)
EXPLAIN ANALYZE SELECT ...;
Shows actual rows, loops, and time per step — more reliable than the estimated rows column.
Index Basics
-- create an index on a frequently filtered column
CREATE INDEX idx_entry_time ON parking_history(entry_time);
-- composite index: useful when filtering by both columns together
CREATE INDEX idx_station_time ON parking_history(station_code, entry_time);
Index guidelines for analysts:
- Index columns used in WHERE, JOIN ON, and ORDER BY.
- A composite index (a, b) helps queries that filter on a alone or a AND b, but not b alone.
- Too many indexes slow down writes — only add what you actually use.