Time Series Notebook

Most business data is a time series: revenue per day, sessions per week, kWh per month. This notebook covers the DA layer of time series work — describing trend and seasonality, building honest comparisons, simple forecasting, and flagging anomalies. It stops before heavy modeling (ARIMA, Prophet); for most reporting questions you will not need them.

The Pandas groundwork (resample, rolling, shift) is introduced in Pandas Notebook; this notebook builds on it. For whether a change is signal or noise, see Statistics Notebook.

All examples use daily parking revenue built from payment_df.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.holtwinters import ExponentialSmoothing

Prepare the Series

daily = (
    payment_df
    .set_index('paid_time')
    .resample('D')['amount']
    .sum()
)

Two preparation rules before any analysis:

1. A sorted DatetimeIndex. Resample, rolling windows, and decomposition all assume it. set_index('paid_time') plus resample handles it; if you built the series another way, daily = daily.sort_index().

2. An explicit decision about missing days. resample('D').sum() fills calendar gaps with 0 — correct for revenue (no payments = zero revenue). But for an average-type metric, a missing day is "no data", not "zero":

avg_amount = payment_df.set_index('paid_time').resample('D')['amount'].mean()
# days with no rows are NaN — keep them NaN, or fill deliberately:
avg_amount = avg_amount.interpolate()       # only if a smooth estimate is acceptable

Filling "no data" with 0 silently drags down every rolling average that touches it — the most common silent bug in time series reporting.

Look at It First

fig, ax = plt.subplots(figsize=(12, 4))
daily.plot(ax=ax)
ax.set_title('Daily Revenue')
plt.tight_layout()
plt.show()

Before computing anything, read the plot for the four ingredients: trend (long-term direction), seasonality (repeating weekly/yearly pattern), events (one-off spikes and dips), and noise. Everything below is a tool for separating them.

Rolling Statistics

daily_df = daily.to_frame('revenue')
daily_df['ma_7']  = daily_df['revenue'].rolling(7).mean()    # weekly smoothing
daily_df['ma_28'] = daily_df['revenue'].rolling(28).mean()   # monthly trend

fig, ax = plt.subplots(figsize=(12, 4))
daily_df['revenue'].plot(ax=ax, alpha=0.4, label='daily')
daily_df['ma_7'].plot(ax=ax, label='7-day MA')
daily_df['ma_28'].plot(ax=ax, label='28-day MA')
ax.legend()
plt.tight_layout()
plt.show()

A 7-day window is the workhorse for daily business data: it contains every weekday exactly once, so day-of-week seasonality cancels out and what remains is trend. Plot the smoothed line with the raw one, never instead of it — see Visualization Selection Guide.

rolling(7) needs 7 values before it produces output (the first 6 are NaN). Add min_periods=1 only when a partial-window average is genuinely acceptable.

Seasonality Profiles

Day-of-week and month profiles answer "what is normal?" — the baseline every comparison needs.

# day-of-week profile
dow = daily.groupby(daily.index.dayofweek).mean()
dow.index = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']

# month profile (needs 1+ year of data to be meaningful)
monthly_profile = daily.groupby(daily.index.month).mean()

Practical consequences: compare Monday to last Monday, not to Sunday; compare this June to last June, not to May. The same idea at month grain — comparing a 30-day month to a 31-day month favors the longer one; compare daily averages instead.

# month-over-month, fairly: average per day, not total
monthly_avg = daily.resample('ME').mean()

Decomposition — Trend, Seasonality, Residual

seasonal_decompose splits the series into three parts you can inspect separately:

result = seasonal_decompose(daily, model='additive', period=7)   # 7 = weekly cycle
result.plot()
plt.tight_layout()
plt.show()

result.trend       # the smoothed direction
result.seasonal    # the repeating weekly pattern
result.resid       # what's left — events and noise live here
  • model='additive' when the seasonal swing is roughly constant in absolute terms; 'multiplicative' when the swing grows with the level (seasonality looks like ±20% rather than ±$5,000).
  • period is the cycle length in observations: 7 for daily data with a weekly cycle, 12 for monthly data with a yearly cycle.
  • The residual panel is the most useful one for a DA: a spike there is a real event, not seasonality — exactly what Step 1 of Metrics & Diagnosis Guide needs.

Lags and Autocorrelation

daily_df['lag_1'] = daily_df['revenue'].shift(1)     # yesterday
daily_df['lag_7'] = daily_df['revenue'].shift(7)     # same day last week
daily_df['wow_pct'] = (daily_df['revenue'] / daily_df['lag_7'] - 1) * 100

daily.autocorr(lag=7)     # correlation with 7 days ago — high = strong weekly cycle

Autocorrelation quantifies what the seasonality profile shows: autocorr(7) near 0.8 means same-day-last-week explains most of today. That is also why lag-7 is the right baseline for daily comparisons — and the basis of the seasonal naive forecast below.

Simple Forecasting

Climb this ladder only as far as the question requires:

Level 1 — Seasonal Naive (the Baseline)

"Next Monday = last Monday."

forecast_snaive = daily.shift(7)

Trivial — and surprisingly hard to beat on stable weekly-cycle data. Every fancier model must outperform this to justify itself, same logic as the dummy baseline in Machine Learning Notebook.

Level 2 — Holt-Winters (Trend + Seasonality)

Exponential smoothing with explicit trend and seasonal components — the most model you usually need for short-horizon business forecasts:

train = daily[:-28]
test  = daily[-28:]                  # hold out the last 4 weeks

model = ExponentialSmoothing(
    train,
    trend='add',
    seasonal='add',                  # 'mul' if the weekly swing scales with the level
    seasonal_periods=7,
).fit()

forecast = model.forecast(28)

Evaluate — Split by Time, Never Randomly

mae  = (test - forecast).abs().mean()
mape = ((test - forecast).abs() / test).mean() * 100

# the baseline to beat: same day last week
snaive = daily.shift(7)[-28:]
mae_baseline = (test - snaive).abs().mean()

print(f"Holt-Winters MAE: {mae:,.0f}   Seasonal naive MAE: {mae_baseline:,.0f}")

The test set must be the most recent block — a random split lets the model train on the future and produces scores you can never reproduce in real life. This is the time series version of data leakage.

fig, ax = plt.subplots(figsize=(12, 4))
train[-90:].plot(ax=ax, label='train')
test.plot(ax=ax, label='actual')
forecast.plot(ax=ax, linestyle='--', label='forecast')
ax.legend()
plt.tight_layout()
plt.show()

Level 3 — When You Need More

ARIMA/SARIMA (statsmodels) and Prophet handle multiple seasonality, holiday effects, and longer horizons — worth learning when forecasting becomes the job rather than a report section. The evaluation discipline (time-based split, beat the seasonal naive) carries over unchanged.

Anomaly Detection — Rolling Bands

Flag days that fall outside what recent history says is normal:

window = 28
roll_mean = daily.rolling(window).mean().shift(1)   # shift(1): today is judged
roll_std  = daily.rolling(window).std().shift(1)    # by PAST days only

upper = roll_mean + 3 * roll_std
lower = roll_mean - 3 * roll_std

anomalies = daily[(daily > upper) | (daily < lower)]

The shift(1) matters: without it, today's own (possibly anomalous) value is inside the window judging it. ±3 SD flags roughly the genuinely surprising days; tighten to ±2 if you prefer sensitive alerts and can tolerate false alarms — the trade-off is Type I vs Type II error from Statistics Notebook. For metrics with a strong weekly cycle, run the bands on the decomposition residual instead of the raw series, so a normal busy Saturday doesn't alert.

Common Mistakes

1. Random Train/Test Split on Time Data

Shuffled splits leak the future into training. Split by time: train on the past, test on the most recent block. The same applies to cross-validation — use expanding-window CV (sklearn.model_selection.TimeSeriesSplit), never KFold(shuffle=True).

2. Zero-Filling "No Data"

resample('D').sum() makes missing days 0 — right for totals, wrong for averages and rates. Decide per metric; NaN that stays visible beats a silent wrong zero.

3. Comparing Partial Periods

Month-to-date vs last full month is the classic false alarm (the dashboard version of this: divide by days elapsed, not days in month). Compare equal, complete windows.

4. Ignoring the Weekly Cycle

"Revenue fell 30% vs yesterday" — yesterday was Saturday. Daily business metrics almost always need same-weekday comparison (lag-7) or a 7-day MA before any conclusion.

5. Forecasting Without a Baseline

A model with 12% MAPE sounds fine until the seasonal naive scores 11%. Always report the naive baseline next to the model.

6. Over-Smoothing

A 90-day MA on daily data erases the events you are paid to notice. Match the window to the question: 7 days to remove weekday noise, 28 for monthly trend — and keep the raw series on the plot.

7. Trusting Decomposition Near the Edges

The trend line from seasonal_decompose is a centered moving average — it is NaN (or unreliable) at the start and end of the series, exactly where you care most. For "what is the trend right now", use a trailing MA or the Holt-Winters level instead.


This pairs with Metrics & Diagnosis Guide — decomposition and rolling bands answer its Step 1 ("shape the drop in time") with code, and the seasonality profiles are what make its window comparisons fair.