SQL Advanced Notebook
This notebook picks up where SQL Notes left off. The basics (SELECT, JOINs, GROUP BY, CTEs, window functions) are assumed. Each section here adds a technique that solves a new kind of problem.
All examples use the same parking system tables: parking_history, payment_history, management_table, and charging_records.
Window Function Frame Clause
The basic window functions in SQL Notes (ROW_NUMBER, LAG, SUM OVER) use a default frame. The frame clause lets you control exactly which rows go into each window calculation.
function() OVER (
PARTITION BY <col>
ORDER BY <col>
ROWS BETWEEN <start> AND <end>
)
Frame Boundaries Reference
| Frame | Meaning |
|---|---|
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW |
First row of the partition up to the current row → running total |
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW |
3-row trailing window → moving average |
ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING |
Centered 3-row window |
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING |
The whole partition |
RANGE BETWEEN INTERVAL 7 DAY PRECEDING AND CURRENT ROW |
Last 7 days by value (not row count) |
Examples
-- 3-month trailing moving average of revenue
SELECT station_code,
DATE_FORMAT(paid_time, '%Y-%m') AS month,
SUM(amount_received) AS revenue,
ROUND(AVG(SUM(amount_received)) OVER (
PARTITION BY station_code
ORDER BY DATE_FORMAT(paid_time, '%Y-%m')
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
), 0) AS moving_avg_3m
FROM payment_history
GROUP BY station_code, DATE_FORMAT(paid_time, '%Y-%m');
-- 7-day rolling sum (by date value, not by row position)
SELECT DATE(paid_time) AS day,
SUM(amount_received) AS daily_rev,
SUM(SUM(amount_received)) OVER (
ORDER BY DATE(paid_time)
RANGE BETWEEN INTERVAL 6 DAY PRECEDING AND CURRENT ROW
) AS rolling_7d
FROM payment_history
GROUP BY DATE(paid_time);
ROWS BETWEEN counts rows. RANGE BETWEEN uses the actual value of the ORDER BY column. Use RANGE BETWEEN INTERVAL for calendar-based windows when the data may have gaps.
Without a frame clause, SUM() OVER (ORDER BY ...) defaults to UNBOUNDED PRECEDING AND CURRENT ROW (running total). Always write the frame on purpose so the intent is clear.
Advanced Window Functions
NTILE — Divide into Buckets
-- put each station into a revenue quartile
SELECT station_code,
SUM(amount_received) AS revenue,
NTILE(4) OVER (ORDER BY SUM(amount_received) DESC) AS quartile
FROM payment_history
GROUP BY station_code;
NTILE(n) cuts the rows into n equal-sized buckets. Quartile 1 = top 25%. Useful for tiering customers or stations when you don't know the exact percentile cut-offs in advance.
PERCENT_RANK — Relative Position (0 to 1)
SELECT station_code,
SUM(amount_received) AS revenue,
ROUND(PERCENT_RANK() OVER (ORDER BY SUM(amount_received)), 3) AS pct_rank
FROM payment_history
GROUP BY station_code;
PERCENT_RANK() returns a value between 0 and 1: how many rows are below this row, as a fraction. A station at 0.80 beats 80% of all stations.
CUME_DIST — Cumulative Distribution
SELECT station_code,
SUM(amount_received) AS revenue,
ROUND(CUME_DIST() OVER (ORDER BY SUM(amount_received)), 3) AS cume_dist
FROM payment_history
GROUP BY station_code;
CUME_DIST() returns the share of rows at or below the current row. Unlike PERCENT_RANK, it never returns 0 — even the lowest row has a value greater than 0.
| Function | Range | Lowest row | Highest row |
|---|---|---|---|
PERCENT_RANK |
0 to 1 | 0.000 | 1.000 |
CUME_DIST |
0+ to 1 | 1/n | 1.000 |
NTILE(4) |
1 to 4 | 4 (bottom bucket) | 1 (top bucket) |
FIRST_VALUE / LAST_VALUE
-- compare each station's monthly revenue to the best month in its history
SELECT station_code,
DATE_FORMAT(paid_time, '%Y-%m') AS month,
SUM(amount_received) AS revenue,
MAX(SUM(amount_received)) OVER (
PARTITION BY station_code
ORDER BY DATE_FORMAT(paid_time, '%Y-%m')
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS all_time_max
FROM payment_history
GROUP BY station_code, DATE_FORMAT(paid_time, '%Y-%m');
FIRST_VALUE and LAST_VALUE return the first or last value in a window frame. LAST_VALUE needs UNBOUNDED FOLLOWING in the frame. Without it, the frame ends at the current row by default.
Recursive CTEs
A recursive CTE calls itself to handle hierarchical or sequential data. The most common use is to build a date series without a stored calendar table.
WITH RECURSIVE <cte_name> AS (
-- anchor: the starting row(s)
<initial_select>
UNION ALL
-- recursive part: join back to cte_name to add the next row
<recursive_select>
)
SELECT * FROM <cte_name>;
Generate a Date Series
WITH RECURSIVE date_series AS (
SELECT DATE('2025-01-01') AS dt
UNION ALL
SELECT DATE_ADD(dt, INTERVAL 1 DAY)
FROM date_series
WHERE dt < '2025-12-31'
)
SELECT * FROM date_series;
Fill Gaps in Daily Revenue
WITH RECURSIVE date_series AS (
SELECT DATE('2025-01-01') AS dt
UNION ALL
SELECT DATE_ADD(dt, INTERVAL 1 DAY)
FROM date_series
WHERE dt < '2025-12-31'
),
daily_revenue AS (
SELECT DATE(paid_time) AS day, SUM(amount_received) AS revenue
FROM payment_history
WHERE paid_time BETWEEN '2025-01-01' AND '2025-12-31'
GROUP BY DATE(paid_time)
)
SELECT ds.dt AS day,
COALESCE(dr.revenue, 0) AS revenue
FROM date_series ds
LEFT JOIN daily_revenue dr ON ds.dt = dr.day
ORDER BY ds.dt;
Without a date series, GROUP BY DATE(paid_time) only returns days that have records. The LEFT JOIN fills the gaps with 0.
Walk a Hierarchy (Org Chart / Category Tree)
WITH RECURSIVE org AS (
-- anchor: top-level managers (no manager_id)
SELECT employee_id, name, manager_id, 1 AS level
FROM employees
WHERE manager_id IS NULL
UNION ALL
-- recursive: join each employee to their manager row in org
SELECT e.employee_id, e.name, e.manager_id, org.level + 1
FROM employees e
JOIN org ON e.manager_id = org.employee_id
)
SELECT * FROM org ORDER BY level, name;
level tracks how deep you are — level 1 = CEO, level 2 = direct reports, and so on. Add WHERE level <= 3 to limit the depth.
Advanced Grouping
ROLLUP — Subtotals and Grand Total
WITH ROLLUP adds summary rows on its own at each level of the GROUP BY hierarchy.
SELECT city,
station_code,
SUM(amount_received) AS revenue
FROM payment_history py
JOIN management_table mt ON py.station_code = mt.station_code
GROUP BY city, station_code WITH ROLLUP;
The result includes:
- Revenue per city + station_code (normal rows)
- Revenue per city subtotal (station_code = NULL)
- Grand total (city = NULL, station_code = NULL)
Use GROUPING() to tell subtotal NULLs apart from real NULLs:
SELECT CASE WHEN GROUPING(city) = 1 THEN 'All Cities' ELSE city END AS city,
CASE WHEN GROUPING(station_code) = 1 THEN 'Subtotal' ELSE station_code END AS station,
SUM(amount_received) AS revenue
FROM payment_history py
JOIN management_table mt ON py.station_code = mt.station_code
GROUP BY city, station_code WITH ROLLUP;
GROUPING SETS — Custom Aggregation Levels
GROUPING SETS runs several GROUP BY combinations in one query — more flexible than ROLLUP.
SELECT city, payment_method, SUM(amount_received) AS revenue
FROM payment_history py
JOIN management_table mt ON py.station_code = mt.station_code
GROUP BY GROUPING SETS (
(city, payment_method), -- by city + payment method
(city), -- by city only
() -- grand total
);
Same as three separate GROUP BY queries joined with UNION ALL — but in one pass.
CUBE — All Combinations
CUBE builds every possible combination of the GROUP BY columns. Useful for multi-dimension reporting.
SELECT city, payment_method, YEAR(paid_time) AS year,
SUM(amount_received) AS revenue
FROM payment_history py
JOIN management_table mt ON py.station_code = mt.station_code
GROUP BY CUBE (city, payment_method, YEAR(paid_time));
With 3 columns, CUBE produces 2³ = 8 grouping combinations. Use GROUPING() to label the NULL rows.
Advanced String Functions
GROUP_CONCAT — Aggregate Strings
-- list every license plate per station as a comma-separated string
SELECT station_code,
GROUP_CONCAT(DISTINCT license_plate ORDER BY license_plate SEPARATOR ', ') AS plates
FROM parking_history
GROUP BY station_code;
GROUP_CONCAT collapses many rows into one string within a group. SEPARATOR sets the delimiter (default is comma).
-- raise the length limit so big groups are not cut off
SET SESSION group_concat_max_len = 100000;
REGEXP — Pattern Matching
-- license plates that start with a letter and 2 digits
SELECT license_plate
FROM parking_history
WHERE license_plate REGEXP '^[A-Z][0-9]{2}';
-- plates that have exactly 3 uppercase letters, a dash, and 4 digits
WHERE license_plate REGEXP '^[A-Z]{3}-[0-9]{4}$'
| Pattern | Matches |
|---|---|
^ |
Start of the string |
$ |
End of the string |
[A-Z] |
Any uppercase letter |
[0-9] |
Any digit |
{n} |
Exactly n repeats |
. |
Any single character |
* |
Zero or more of the previous character |
+ |
One or more of the previous character |
REPLACE, LOCATE, LPAD, RPAD
SELECT REPLACE(license_plate, '-', '') AS plate_no_dash, -- remove dashes
LOCATE('-', license_plate) AS dash_position, -- position of '-'
LPAD(station_code, 6, '0') AS padded_code, -- '12' → '000012'
RPAD(station_name, 20, ' ') AS padded_name -- right-pad with spaces
FROM management_table;
Advanced Date Functions
SELECT entry_time,
TIMESTAMPDIFF(MINUTE, entry_time, exit_time) AS duration_mins, -- exact difference
TIMESTAMPDIFF(MONTH, '2024-01-01', entry_time) AS months_since,
EXTRACT(QUARTER FROM entry_time) AS quarter, -- 1–4
EXTRACT(WEEK FROM entry_time) AS week_of_year,
LAST_DAY(entry_time) AS last_of_month, -- last day of the month
DATE_FORMAT(entry_time, '%W') AS weekday_name, -- 'Monday', etc.
DAYOFWEEK(entry_time) AS day_num -- 1=Sun, 7=Sat
FROM parking_history;
First Day of the Month / Year
-- first day of the month
DATE_FORMAT(entry_time, '%Y-%m-01')
-- first day of the year
DATE_FORMAT(entry_time, '%Y-01-01')
Quarter Boundaries
SELECT CONCAT(YEAR(paid_time), ' Q', QUARTER(paid_time)) AS quarter,
SUM(amount_received) AS revenue
FROM payment_history
GROUP BY YEAR(paid_time), QUARTER(paid_time)
ORDER BY 1;
Views
A view is a saved SELECT query that acts like a table. Use it to wrap a complex query you run often, or to give other people a simpler view of a table.
CREATE VIEW station_monthly_revenue AS
SELECT mt.station_code,
mt.station_name,
mt.city,
DATE_FORMAT(pay.paid_time, '%Y-%m') AS month,
SUM(pay.amount_received) AS revenue,
COUNT(*) AS transactions
FROM payment_history pay
JOIN management_table mt ON pay.station_code = mt.station_code
GROUP BY mt.station_code, mt.station_name, mt.city,
DATE_FORMAT(pay.paid_time, '%Y-%m');
Query it just like a table:
SELECT * FROM station_monthly_revenue WHERE city = 'Taipei' AND month = '2025-06';
Update or drop:
CREATE OR REPLACE VIEW station_monthly_revenue AS <new_query>;
DROP VIEW station_monthly_revenue;
Views do not store data. They run the underlying query each time. For speed on large datasets, use a materialized view or a real summary table instead.
Temporary Tables
Temporary tables exist only for the current session and are dropped on their own when the connection closes. Use them when a CTE would be referenced many times, or when you need to build a result step by step.
-- create a temp table
CREATE TEMPORARY TABLE tmp_station_stats AS
SELECT station_code,
COUNT(*) AS total_visits,
SUM(amount_received) AS total_revenue
FROM payment_history
GROUP BY station_code;
-- use it like a normal table
SELECT mt.station_name, t.total_revenue
FROM tmp_station_stats t
JOIN management_table mt ON t.station_code = mt.station_code
WHERE t.total_visits > 500;
-- drop it on purpose when you are done (optional — auto-dropped at session end)
DROP TEMPORARY TABLE tmp_station_stats;
| CTE | Temp Table | |
|---|---|---|
| Scope | One query | The whole session |
| Reuse | Same query only | Several queries |
| Indexable | No | Yes |
| Best for | Readable single queries | Multi-step pipelines |
Query Performance & EXPLAIN
EXPLAIN shows how MySQL runs a query — which indexes it uses, how many rows it scans, and where the slow points are.
EXPLAIN SELECT ph.license_plate, pay.amount_received
FROM parking_history ph
JOIN payment_history pay ON ph.parking_id = pay.parking_id
WHERE ph.entry_time >= '2025-01-01';
Key Columns in EXPLAIN Output
| Column | What to look for |
|---|---|
type |
const / ref / range = fast; ALL = full table scan (slow) |
key |
Which index is used. NULL = no index used |
rows |
Estimated rows scanned — lower is better |
Extra |
Using filesort or Using temporary = a possible bottleneck |
Use EXPLAIN ANALYZE for Real Runtimes (MySQL 8+)
EXPLAIN ANALYZE SELECT ...;
Shows real rows, loops, and time per step — more reliable than the estimated rows column.
Index Basics
-- create an index on a column you filter on often
CREATE INDEX idx_entry_time ON parking_history(entry_time);
-- composite index: useful when you filter on both columns together
CREATE INDEX idx_station_time ON parking_history(station_code, entry_time);
Index tips for analysts:
- Index columns used in WHERE, JOIN ON, and ORDER BY.
- A composite index (a, b) helps queries that filter on a alone or on a AND b, but not on b alone.
- Too many indexes slow down writes — only add the ones you actually use.