Statistics Notebook

Statistics is the reasoning layer beneath every analysis. SQL and Python tell you how to compute a number; statistics tells you what that number means and whether you can trust it. This notebook covers the concepts that appear most often in DA work and interviews.

All examples use Python with NumPy, Pandas, and SciPy. The dataset is the parking system: payment_df with columns amount, parking_type, station_code, payment_method, entry_time.

import numpy as np
import pandas as pd
from scipy import stats

See Pandas Notebook for data manipulation and ML Model Selection Guide for where statistics meets machine learning.


Descriptive Statistics

Descriptive statistics summarise a dataset with a small set of numbers. They are always the first step before any analysis.

Central Tendency — Where is the centre?

Measure Formula When to use
Mean Sum ÷ count Symmetric data, no extreme outliers
Median Middle value when sorted Skewed data or outliers present
Mode Most frequent value Categorical data
amounts = payment_df['amount']

amounts.mean()      # arithmetic average
amounts.median()    # middle value
amounts.mode()[0]   # most frequent value

Mean vs Median — when it matters:

# example: 9 transactions of 100, one of 10000
example = np.array([100]*9 + [10000])
print(example.mean())    # 1090 — pulled up by the outlier
print(np.median(example))  # 100 — unaffected

Use the median when reporting income, revenue per transaction, or any metric where outliers exist. Mean is misleading when the distribution has a long tail.

Spread — How spread out is the data?

Measure Meaning
Range Max − Min
Variance Average squared deviation from the mean
Standard deviation (SD) √Variance — same unit as the data
IQR Q3 − Q1 — middle 50% of the data
amounts.std()                           # standard deviation
amounts.var()                           # variance
amounts.quantile(0.75) - amounts.quantile(0.25)  # IQR
amounts.describe()                      # all at once

Standard deviation tells you the typical distance from the mean. An amount SD of 200 means most transactions are within ±200 of the average.

IQR is more robust than SD when outliers are present — it ignores the top and bottom 25%.

Percentiles and Quartiles

amounts.quantile(0.25)    # Q1 — 25th percentile
amounts.quantile(0.50)    # Q2 — median
amounts.quantile(0.75)    # Q3 — 75th percentile
amounts.quantile(0.90)    # 90th percentile

np.percentile(amounts, [25, 50, 75, 90])   # all at once

Percentiles answer: "what value is X% of the data below?" A transaction at the 90th percentile means 90% of all transactions are lower.

Skewness

amounts.skew()
Value Shape Implication
≈ 0 Symmetric Mean ≈ Median
> 0 Right-skewed (long right tail) Mean > Median — outliers pulling up
< 0 Left-skewed (long left tail) Mean < Median

Transaction amounts are almost always right-skewed — most transactions are small, a few are very large.


Distributions

Normal Distribution

The normal (Gaussian) distribution is the most important distribution in statistics. Many natural quantities approximate it, and statistical tests are often built on the assumption of normality.

# check if a sample looks normal
from scipy.stats import shapiro

stat, p = shapiro(amounts.sample(50))   # Shapiro-Wilk test (works well for n < 2000)
print(f"p-value: {p:.4f}")
# p > 0.05 → fail to reject normality (data looks normal)
# p < 0.05 → reject normality (data is not normal)

Properties of the normal distribution: - 68% of data falls within 1 SD of the mean - 95% within 2 SD - 99.7% within 3 SD

mean  = amounts.mean()
sd    = amounts.std()

within_1sd = amounts.between(mean - sd,   mean + sd).mean()    # ≈ 0.68
within_2sd = amounts.between(mean - 2*sd, mean + 2*sd).mean()  # ≈ 0.95

Central Limit Theorem (CLT)

The CLT states: the distribution of sample means approaches a normal distribution as sample size grows, regardless of the original distribution's shape.

# transaction amounts are right-skewed
# but the mean of 50 random samples will be approximately normal

sample_means = [payment_df['amount'].sample(50).mean() for _ in range(1000)]
pd.Series(sample_means).hist(bins=30)

Why this matters: most hypothesis tests (t-tests, etc.) assume normality of the sampling distribution, not the raw data. The CLT justifies using these tests even on skewed data as long as the sample is large enough (usually n ≥ 30).

Outlier Detection

# Method 1: IQR rule (common default)
Q1 = amounts.quantile(0.25)
Q3 = amounts.quantile(0.75)
IQR = Q3 - Q1

lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

outliers = payment_df[(amounts < lower) | (amounts > upper)]

# Method 2: Z-score (assumes normal distribution)
z_scores = np.abs(stats.zscore(amounts))
outliers_z = payment_df[z_scores > 3]   # more than 3 SD from mean

Use the IQR method by default — it works on skewed data. Use Z-scores only if the data is roughly normal.


Hypothesis Testing

Hypothesis testing is a framework for deciding whether an observed difference is real or just random noise.

The Framework

1. State the null hypothesis (H₀): "there is no difference / no effect"
2. State the alternative hypothesis (H₁): "there is a difference"
3. Choose a significance level (α): typically 0.05
4. Compute the test statistic and p-value
5. Decision: if p < α, reject H₀

p-value: the probability of observing a result at least as extreme as yours, assuming H₀ is true. A small p-value means the data is unlikely under H₀.

Common misunderstanding: p < 0.05 does not prove H₁ is true. It only means the data is inconsistent with H₀ at the chosen significance level.

Type I and Type II Errors

H₀ is actually true H₀ is actually false
Reject H₀ Type I error (false positive) — α Correct (power)
Fail to reject H₀ Correct Type II error (false negative) — β
  • Type I error (α): claiming an effect exists when it does not. Controlled by setting α = 0.05.
  • Type II error (β): missing a real effect. Controlled by increasing sample size.

t-test — Compare Means

One-sample t-test: Is the mean equal to a known value?

# Is the average transaction amount different from 300?
stat, p = stats.ttest_1samp(payment_df['amount'], popmean=300)
print(f"t={stat:.3f}, p={p:.4f}")

Two-sample t-test: Are two group means different?

hourly  = payment_df[payment_df['parking_type'] == 'hourly']['amount']
monthly = payment_df[payment_df['parking_type'] == 'monthly']['amount']

stat, p = stats.ttest_ind(hourly, monthly)
print(f"t={stat:.3f}, p={p:.4f}")

if p < 0.05:
    print("Significant difference in average amount between hourly and monthly")
else:
    print("No significant difference detected")

ttest_ind assumes equal variances by default. Add equal_var=False (Welch's t-test) when the two groups have different variances — this is generally the safer default.

stat, p = stats.ttest_ind(hourly, monthly, equal_var=False)

ANOVA — Compare Means Across 3 or More Groups

A t-test compares two groups. ANOVA (Analysis of Variance) extends this to three or more groups in one test.

hourly   = payment_df[payment_df['parking_type'] == 'hourly']['amount']
monthly  = payment_df[payment_df['parking_type'] == 'monthly']['amount']
daily    = payment_df[payment_df['parking_type'] == 'daily']['amount']

f_stat, p = stats.f_oneway(hourly, monthly, daily)
print(f"F={f_stat:.3f}, p={p:.4f}")

If p < 0.05: at least one group's mean is significantly different from the others. ANOVA does not tell you which groups differ — run a post-hoc test for that.

Post-hoc Test — Which Groups Are Different?

from statsmodels.stats.multicomp import pairwise_tukeyhsd

result = pairwise_tukeyhsd(
    endog=payment_df['amount'],
    groups=payment_df['parking_type'],
    alpha=0.05
)
print(result)

Tukey's HSD compares all pairs of groups while controlling for multiple comparisons. The output shows which specific pairs are significantly different.

Scenario Test
2 groups, normal data t-test
2 groups, non-normal or small n Mann-Whitney U
3+ groups, normal data One-way ANOVA → Tukey HSD
3+ groups, non-normal Kruskal-Wallis (non-parametric ANOVA)
# Kruskal-Wallis: non-parametric alternative to one-way ANOVA
stat, p = stats.kruskal(hourly, monthly, daily)
print(f"H={stat:.3f}, p={p:.4f}")

Chi-square Test — Compare Proportions / Category Distributions

Use the chi-square test when both variables are categorical.

# Is payment method distribution different between cities?
contingency = pd.crosstab(payment_df['city'], payment_df['payment_method'])

chi2, p, dof, expected = stats.chi2_contingency(contingency)
print(f"chi2={chi2:.3f}, p={p:.4f}, degrees of freedom={dof}")

If p < 0.05: the distribution of payment methods differs significantly across cities.

Mann-Whitney U Test — Non-parametric Alternative to t-test

Use when the data is not normally distributed and the sample is small.

stat, p = stats.mannwhitneyu(hourly, monthly, alternative='two-sided')
print(f"U={stat:.1f}, p={p:.4f}")

Mann-Whitney tests whether one group tends to have higher values than the other, without assuming normality. For large samples (n > 30), the t-test is usually robust enough due to the CLT.


Confidence Intervals

A confidence interval (CI) gives a range for the true population parameter, rather than a single point estimate. It is more informative than a p-value alone — it tells you both whether an effect is significant and how large it is.

"The average transaction amount is 350. But how confident are we? A 95% CI of [340, 360] says we are quite precise. A CI of [200, 500] says the estimate is very uncertain."

95% Confidence Interval for a Mean

import scipy.stats as stats

sample = payment_df['amount']
n      = len(sample)
mean   = sample.mean()
se     = stats.sem(sample)   # standard error = SD / √n

ci_low, ci_high = stats.t.interval(
    confidence=0.95,
    df=n - 1,
    loc=mean,
    scale=se
)
print(f"Mean: {mean:.2f}")
print(f"95% CI: [{ci_low:.2f}, {ci_high:.2f}]")

Interpretation: if we repeated this sampling process 100 times, approximately 95 of the resulting intervals would contain the true population mean.

Common misinterpretation: a 95% CI does NOT mean "there is a 95% probability the true mean is in this range." The true mean is fixed — it is either in the interval or it is not.

CI for the Difference Between Two Means

hourly  = payment_df[payment_df['parking_type'] == 'hourly']['amount']
monthly = payment_df[payment_df['parking_type'] == 'monthly']['amount']

# run t-test and get CI directly
result = stats.ttest_ind(hourly, monthly, equal_var=False)

diff   = hourly.mean() - monthly.mean()
se_diff = np.sqrt(hourly.sem()**2 + monthly.sem()**2)
df     = min(len(hourly), len(monthly)) - 1

ci_low, ci_high = stats.t.interval(0.95, df=df, loc=diff, scale=se_diff)
print(f"Difference: {diff:.2f}")
print(f"95% CI for difference: [{ci_low:.2f}, {ci_high:.2f}]")

If the CI for the difference does not include 0, the difference is statistically significant at α = 0.05 — consistent with p < 0.05.

CI Width and Sample Size

# larger sample → narrower CI (more precise estimate)
for n in [30, 100, 500, 1000]:
    sample = payment_df['amount'].sample(n)
    se     = stats.sem(sample)
    ci     = stats.t.interval(0.95, df=n-1, loc=sample.mean(), scale=se)
    print(f"n={n:5d}  CI width: {ci[1]-ci[0]:.1f}")

CI width shrinks with √n — quadrupling sample size halves the CI width.

p-value vs Confidence Interval

p-value Confidence Interval
What it answers Is the effect real? How large is the effect?
Output Single number (0–1) Range [lower, upper]
Threshold p < 0.05 = significant CI excludes 0 = significant
More informative No Yes — includes direction and magnitude

Prefer reporting confidence intervals over p-values alone when presenting findings to stakeholders.


Correlation

Correlation measures the strength and direction of the linear relationship between two numeric variables.

Pearson Correlation

# correlation between amount and duration
r, p = stats.pearsonr(payment_df['amount'], payment_df['duration_mins'])
print(f"r={r:.3f}, p={p:.4f}")

# full correlation matrix
payment_df[['amount', 'duration_mins']].corr()
r value Interpretation
0.9 to 1.0 Very strong positive
0.7 to 0.9 Strong positive
0.4 to 0.7 Moderate positive
0.1 to 0.4 Weak positive
≈ 0 No linear relationship
Negative Inverse relationship

Pearson assumes both variables are numeric and the relationship is linear.

Spearman Correlation — Non-parametric

r_spearman, p = stats.spearmanr(payment_df['amount'], payment_df['duration_mins'])

Spearman measures rank correlation — use it when the relationship is monotonic but not necessarily linear, or when the data contains outliers.

Correlation vs Causation

Correlation does not imply causation. Two variables can be correlated because: 1. A causes B 2. B causes A 3. A third variable C causes both 4. Pure coincidence

Example: ice cream sales and drowning rates are correlated — both increase in summer. The cause is hot weather, not ice cream.

Always ask: what is the mechanism? Is there a plausible causal path? Can we run an experiment to test it?


Simple Linear Regression

Regression quantifies how much one variable changes when another changes by one unit.

from scipy.stats import linregress

slope, intercept, r_value, p_value, std_err = linregress(
    payment_df['duration_mins'],
    payment_df['amount']
)

print(f"slope:     {slope:.3f}")      # amount increases by this much per minute
print(f"intercept: {intercept:.3f}")  # amount when duration = 0
print(f"R²:        {r_value**2:.3f}") # % of variance explained
print(f"p-value:   {p_value:.4f}")    # is the slope significantly different from 0?

Interpreting the Output

  • Slope: for every additional minute of parking, amount increases by slope units.
  • Intercept: the predicted amount when duration is 0 (baseline charge).
  • R² (R-squared): percentage of the variance in amount explained by duration. R² = 0.40 means duration explains 40% of the variation; the remaining 60% comes from other factors.
  • p-value: tests whether the slope is significantly different from 0. p < 0.05 means duration has a statistically significant relationship with amount.

For multivariate regression (multiple predictors), use scikit-learn — see Machine Learning Notebook.


A/B Testing

An A/B test is a controlled experiment that isolates the effect of one change. It is the standard method for testing product changes, pricing, UI design, and promotions.

Workflow

1. Define the metric (conversion rate, average amount, retention)
2. Calculate required sample size before running
3. Randomly assign users to control (A) and treatment (B)
4. Run the test until the sample size is reached
5. Analyse with a two-sample t-test or z-test
6. Report both statistical significance and practical significance

Sample Size Calculation

Run this before the experiment — do not stop early when you see a significant result.

from statsmodels.stats.power import TTestIndPower

analysis = TTestIndPower()
n = analysis.solve_power(
    effect_size=0.2,   # minimum detectable effect (Cohen's d)
    alpha=0.05,        # significance level
    power=0.80         # 1 - β, probability of detecting a real effect
)
print(f"Required sample per group: {int(np.ceil(n))}")

Effect size (Cohen's d): the minimum difference you care about, expressed in standard deviations. Use 0.2 (small), 0.5 (medium), or 0.8 (large) as benchmarks if you do not have a specific business requirement.

Run the Test

# control group: original pricing (group A)
# treatment group: new pricing (group B)
group_a = payment_df[payment_df['group'] == 'control']['amount']
group_b = payment_df[payment_df['group'] == 'treatment']['amount']

# two-sample t-test
stat, p = stats.ttest_ind(group_a, group_b, equal_var=False)

print(f"Group A mean: {group_a.mean():.2f}")
print(f"Group B mean: {group_b.mean():.2f}")
print(f"Difference:   {group_b.mean() - group_a.mean():.2f}")
print(f"p-value:      {p:.4f}")
print("Significant" if p < 0.05 else "Not significant")

Statistical vs Practical Significance

A result can be statistically significant but practically irrelevant.

# practical significance: effect size (Cohen's d)
pooled_std = np.sqrt((group_a.std()**2 + group_b.std()**2) / 2)
cohen_d    = (group_b.mean() - group_a.mean()) / pooled_std

print(f"Cohen's d: {cohen_d:.3f}")
# < 0.2 = negligible, 0.2–0.5 = small, 0.5–0.8 = medium, > 0.8 = large

Example: a new button colour increases click rate from 10.00% to 10.01%. With 1 million users, p < 0.001 — statistically significant. But is a 0.01% lift worth the engineering cost? That is a business decision, not a statistics question.

Always report both: p-value (is it real?) and effect size or absolute difference (does it matter?).


Common Statistical Mistakes

1. Using Mean on Skewed Data

Report median for income, transaction amounts, and response times. Use mean only when the distribution is symmetric.

2. Correlation ≠ Causation

Two correlated variables may both be driven by a third factor. Always propose a causal mechanism before claiming one variable causes another.

3. p-Hacking (Multiple Testing)

Running 20 tests and reporting the one that gives p < 0.05 is not a valid finding — 1 in 20 tests will be significant by chance at α = 0.05. Apply a Bonferroni correction when running multiple comparisons:

# Bonferroni correction: divide α by the number of tests
n_tests     = 10
alpha_adj   = 0.05 / n_tests   # 0.005

4. Stopping an A/B Test Early

If you stop when the result looks good, you inflate the false positive rate. Decide the sample size in advance and do not peek until it is reached.

5. Simpson's Paradox

A trend that appears in several groups can disappear or reverse when the groups are combined.

# example: overall conversion rate looks lower for Group B
# but Group B is better in every individual city
# because Group B happened to have more traffic from low-conversion cities

pd.crosstab(payment_df['city'], payment_df['group'],
            values=payment_df['converted'], aggfunc='mean')

Always segment your data and check if the aggregate result holds within subgroups.

6. Survivorship Bias

Analysing only successful or existing records leads to wrong conclusions.

Example: analysing only completed parking sessions misses all vehicles that left without paying — skewing average amount upward.


Quick Reference — Which Test to Use

Question Test
Is the mean equal to a known value? One-sample t-test
Are two group means different? Two-sample t-test (Welch's)
Are two group means different (non-normal)? Mann-Whitney U
Are 3+ group means different? One-way ANOVA → Tukey HSD
Are 3+ group means different (non-normal)? Kruskal-Wallis
Are category distributions different across groups? Chi-square test
How precise is my estimate? Confidence interval
Is the relationship between two numerics linear? Pearson correlation + linregress
Is the relationship monotonic (not necessarily linear)? Spearman correlation
Did a product change have a real effect? A/B test → two-sample t-test + CI