Statistics Notebook

Statistics is the reasoning layer under every analysis. SQL and Python show you how to compute a number. Statistics tells you what the number means and whether you can trust it. This notebook covers the ideas that show up most often in DA work and interviews.

All examples use Python with NumPy, Pandas, and SciPy. The dataset is the parking system: payment_df with columns amount, parking_type, station_code, payment_method, entry_time.

import numpy as np
import pandas as pd
from scipy import stats

See Pandas Notebook for data manipulation and ML Model Selection Guide for where statistics meets machine learning.


Descriptive Statistics

Descriptive statistics summarize a dataset with a small set of numbers. They are always the first step before any analysis.

Central Tendency — Where is the center?

Measure Formula When to use
Mean Sum ÷ count Symmetric data, no extreme outliers
Median The middle value when sorted Skewed data or when there are outliers
Mode The most frequent value Category data
amounts = payment_df['amount']

amounts.mean()      # average
amounts.median()    # middle value
amounts.mode()[0]   # most frequent value

Mean vs Median — when it matters:

# example: 9 transactions of 100, one of 10000
example = np.array([100]*9 + [10000])
print(example.mean())    # 1090 — pulled up by the outlier
print(np.median(example))  # 100 — not affected

Use the median when you report income, revenue per transaction, or any number where outliers exist. The mean is misleading when the distribution has a long tail.

Spread — How spread out is the data?

Measure Meaning
Range Max − Min
Variance Average of the squared distances from the mean
Standard deviation (SD) √Variance — same unit as the data
IQR Q3 − Q1 — the middle 50% of the data
amounts.std()                           # standard deviation
amounts.var()                           # variance
amounts.quantile(0.75) - amounts.quantile(0.25)  # IQR
amounts.describe()                      # all of the above at once

The standard deviation tells you the typical distance from the mean. An SD of 200 means most transactions are within ±200 of the average.

The IQR is more reliable than the SD when there are outliers. It ignores the top and bottom 25%.

Percentiles and Quartiles

amounts.quantile(0.25)    # Q1 — 25th percentile
amounts.quantile(0.50)    # Q2 — median
amounts.quantile(0.75)    # Q3 — 75th percentile
amounts.quantile(0.90)    # 90th percentile

np.percentile(amounts, [25, 50, 75, 90])   # all at once

Percentiles answer this: "what value is X% of the data below?" A transaction at the 90th percentile means 90% of all transactions are smaller than it.

Skewness

amounts.skew()
Value Shape What it means
≈ 0 Symmetric Mean ≈ Median
> 0 Right-skewed (long right tail) Mean > Median — outliers pulling the mean up
< 0 Left-skewed (long left tail) Mean < Median

Transaction amounts are almost always right-skewed — most transactions are small, a few are very large.


Distributions

Normal Distribution

The normal (Gaussian) distribution is the most important distribution in statistics. Many real-world numbers look like it, and many statistical tests are built on the assumption that the data is normal.

# check if a sample looks normal
from scipy.stats import shapiro

stat, p = shapiro(amounts.sample(50))   # Shapiro-Wilk test (works for n < 2000)
print(f"p-value: {p:.4f}")
# p > 0.05 → fail to reject normality (data looks normal)
# p < 0.05 → reject normality (data is not normal)

Properties of the normal distribution: - 68% of the data is within 1 SD of the mean - 95% within 2 SD - 99.7% within 3 SD

mean  = amounts.mean()
sd    = amounts.std()

within_1sd = amounts.between(mean - sd,   mean + sd).mean()    # ≈ 0.68
within_2sd = amounts.between(mean - 2*sd, mean + 2*sd).mean()  # ≈ 0.95

Central Limit Theorem (CLT)

The CLT says: the distribution of sample means gets close to a normal distribution as the sample size grows, no matter what the original distribution looks like.

# transaction amounts are right-skewed
# but the mean of 50 random samples will be roughly normal

sample_means = [payment_df['amount'].sample(50).mean() for _ in range(1000)]
pd.Series(sample_means).hist(bins=30)

Why this matters: most hypothesis tests (t-tests, etc.) assume the sampling distribution is normal, not the raw data. The CLT lets you use these tests on skewed data, as long as the sample is big enough (usually n ≥ 30).

Outlier Detection

# Method 1: IQR rule (the common default)
Q1 = amounts.quantile(0.25)
Q3 = amounts.quantile(0.75)
IQR = Q3 - Q1

lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

outliers = payment_df[(amounts < lower) | (amounts > upper)]

# Method 2: Z-score (assumes a normal distribution)
z_scores = np.abs(stats.zscore(amounts))
outliers_z = payment_df[z_scores > 3]   # more than 3 SD from the mean

Use the IQR method by default. It works on skewed data. Use Z-scores only when the data is roughly normal.


Hypothesis Testing

Hypothesis testing is a framework for deciding whether a difference you see is real or just random noise.

The Framework

1. State the null hypothesis (H₀): "there is no difference / no effect"
2. State the alternative hypothesis (H₁): "there is a difference"
3. Pick a significance level (α): usually 0.05
4. Compute the test statistic and the p-value
5. Decision: if p < α, reject H₀

p-value: the chance of seeing a result at least as extreme as yours, assuming H₀ is true. A small p-value means the data is unlikely under H₀.

Common misunderstanding: p < 0.05 does not prove H₁ is true. It only means the data does not fit H₀ at the chosen significance level.

Type I and Type II Errors

H₀ is actually true H₀ is actually false
Reject H₀ Type I error (false positive) — α Correct (power)
Fail to reject H₀ Correct Type II error (false negative) — β
  • Type I error (α): saying there is an effect when there is not. Controlled by setting α = 0.05.
  • Type II error (β): missing a real effect. Lowered by using a bigger sample size.

t-test — Compare Means

One-sample t-test: Is the mean equal to a known value?

# Is the average transaction amount different from 300?
stat, p = stats.ttest_1samp(payment_df['amount'], popmean=300)
print(f"t={stat:.3f}, p={p:.4f}")

Two-sample t-test: Are two group means different?

hourly  = payment_df[payment_df['parking_type'] == 'hourly']['amount']
monthly = payment_df[payment_df['parking_type'] == 'monthly']['amount']

stat, p = stats.ttest_ind(hourly, monthly)
print(f"t={stat:.3f}, p={p:.4f}")

if p < 0.05:
    print("Significant difference in average amount between hourly and monthly")
else:
    print("No significant difference detected")

ttest_ind assumes equal variance by default. Add equal_var=False (Welch's t-test) when the two groups have different variances. Welch's is the safer default in general.

stat, p = stats.ttest_ind(hourly, monthly, equal_var=False)

ANOVA — Compare Means Across 3 or More Groups

A t-test compares two groups. ANOVA (Analysis of Variance) extends this to three or more groups in one test.

hourly   = payment_df[payment_df['parking_type'] == 'hourly']['amount']
monthly  = payment_df[payment_df['parking_type'] == 'monthly']['amount']
daily    = payment_df[payment_df['parking_type'] == 'daily']['amount']

f_stat, p = stats.f_oneway(hourly, monthly, daily)
print(f"F={f_stat:.3f}, p={p:.4f}")

If p < 0.05: at least one group's mean is different from the others. ANOVA does not tell you which groups differ — run a post-hoc test for that.

Post-hoc Test — Which Groups Are Different?

from statsmodels.stats.multicomp import pairwise_tukeyhsd

result = pairwise_tukeyhsd(
    endog=payment_df['amount'],
    groups=payment_df['parking_type'],
    alpha=0.05
)
print(result)

Tukey's HSD compares every pair of groups while controlling for the multiple-comparison problem. The output shows which specific pairs are different.

Scenario Test
2 groups, normal data t-test
2 groups, non-normal or small n Mann-Whitney U
3+ groups, normal data One-way ANOVA → Tukey HSD
3+ groups, non-normal Kruskal-Wallis (non-parametric ANOVA)
# Kruskal-Wallis: non-parametric alternative to one-way ANOVA
stat, p = stats.kruskal(hourly, monthly, daily)
print(f"H={stat:.3f}, p={p:.4f}")

Chi-square Test — Compare Proportions / Category Distributions

Use the chi-square test when both variables are categories.

# Is the payment method distribution different across cities?
contingency = pd.crosstab(payment_df['city'], payment_df['payment_method'])

chi2, p, dof, expected = stats.chi2_contingency(contingency)
print(f"chi2={chi2:.3f}, p={p:.4f}, degrees of freedom={dof}")

If p < 0.05: the distribution of payment methods is different across cities.

Mann-Whitney U Test — Non-parametric Alternative to t-test

Use when the data is not normal and the sample is small.

stat, p = stats.mannwhitneyu(hourly, monthly, alternative='two-sided')
print(f"U={stat:.1f}, p={p:.4f}")

Mann-Whitney checks whether one group tends to have higher values than the other, with no assumption about normality. For large samples (n > 30), the t-test is usually robust enough thanks to the CLT.


Confidence Intervals

A confidence interval (CI) gives a range for the true population value, instead of a single number. It is more useful than a p-value alone, because it tells you both whether an effect is real and how big it is.

"The average transaction amount is 350. But how sure are we? A 95% CI of [340, 360] is precise. A CI of [200, 500] means the estimate is shaky."

95% Confidence Interval for a Mean

import scipy.stats as stats

sample = payment_df['amount']
n      = len(sample)
mean   = sample.mean()
se     = stats.sem(sample)   # standard error = SD / √n

ci_low, ci_high = stats.t.interval(
    confidence=0.95,
    df=n - 1,
    loc=mean,
    scale=se
)
print(f"Mean: {mean:.2f}")
print(f"95% CI: [{ci_low:.2f}, {ci_high:.2f}]")

How to read it: if you repeated the sampling 100 times, about 95 of the resulting intervals would contain the true population mean.

Common misreading: a 95% CI does NOT mean "there is a 95% chance the true mean is in this range." The true mean is fixed — it is either in the interval or it is not.

CI for the Difference Between Two Means

hourly  = payment_df[payment_df['parking_type'] == 'hourly']['amount']
monthly = payment_df[payment_df['parking_type'] == 'monthly']['amount']

# run a t-test and get the CI directly
result = stats.ttest_ind(hourly, monthly, equal_var=False)

diff   = hourly.mean() - monthly.mean()
se_diff = np.sqrt(hourly.sem()**2 + monthly.sem()**2)
df     = min(len(hourly), len(monthly)) - 1

ci_low, ci_high = stats.t.interval(0.95, df=df, loc=diff, scale=se_diff)
print(f"Difference: {diff:.2f}")
print(f"95% CI for the difference: [{ci_low:.2f}, {ci_high:.2f}]")

If the CI for the difference does not include 0, the difference is significant at α = 0.05 — same as p < 0.05.

CI Width and Sample Size

# bigger sample → narrower CI (more precise estimate)
for n in [30, 100, 500, 1000]:
    sample = payment_df['amount'].sample(n)
    se     = stats.sem(sample)
    ci     = stats.t.interval(0.95, df=n-1, loc=sample.mean(), scale=se)
    print(f"n={n:5d}  CI width: {ci[1]-ci[0]:.1f}")

The CI width shrinks with √n. Quadrupling the sample size cuts the CI width in half.

p-value vs Confidence Interval

p-value Confidence Interval
What it answers Is the effect real? How big is the effect?
Output One number (0–1) A range [lower, upper]
Threshold p < 0.05 = significant CI excludes 0 = significant
More useful No Yes — tells you direction and size

When you present results to non-technical people, prefer reporting the confidence interval over the p-value alone.


Correlation

Correlation measures how strong and in which direction two numeric variables move together.

Pearson Correlation

# correlation between amount and duration
r, p = stats.pearsonr(payment_df['amount'], payment_df['duration_mins'])
print(f"r={r:.3f}, p={p:.4f}")

# full correlation matrix
payment_df[['amount', 'duration_mins']].corr()
r value Meaning
0.9 to 1.0 Very strong positive
0.7 to 0.9 Strong positive
0.4 to 0.7 Medium positive
0.1 to 0.4 Weak positive
≈ 0 No straight-line link
Negative Inverse link

Pearson assumes both variables are numeric and the link is a straight line.

Spearman Correlation — Non-parametric

r_spearman, p = stats.spearmanr(payment_df['amount'], payment_df['duration_mins'])

Spearman measures rank correlation. Use it when the link goes in one direction but is not necessarily a straight line, or when the data has outliers.

Correlation vs Causation

Correlation does not mean causation. Two variables can move together because: 1. A causes B 2. B causes A 3. A third variable C causes both 4. Pure coincidence

Example: ice cream sales and drowning rates move together — both go up in summer. The cause is hot weather, not ice cream.

Always ask: what is the mechanism? Is there a believable cause-and-effect path? Can you run an experiment to test it?


Simple Linear Regression

Regression measures how much one variable changes when another changes by one unit.

from scipy.stats import linregress

slope, intercept, r_value, p_value, std_err = linregress(
    payment_df['duration_mins'],
    payment_df['amount']
)

print(f"slope:     {slope:.3f}")      # amount goes up by this much per minute
print(f"intercept: {intercept:.3f}")  # amount when duration = 0
print(f"R²:        {r_value**2:.3f}") # share of variance explained
print(f"p-value:   {p_value:.4f}")    # is the slope different from 0?

Reading the Output

  • Slope: for each extra minute of parking, the amount goes up by slope units.
  • Intercept: the predicted amount when duration is 0 (the base charge).
  • R² (R-squared): the share of variance in amount that duration explains. R² = 0.40 means duration explains 40%; the other 60% comes from other things.
  • p-value: tests whether the slope is different from 0. p < 0.05 means there is a real link between duration and amount.

For multivariate regression (several predictors), use scikit-learn — see Machine Learning Notebook.


A/B Testing

An A/B test is a controlled experiment that isolates the effect of one change. It is the standard way to test product changes, pricing, UI design, and promotions.

Workflow

1. Define the metric (conversion rate, average amount, retention)
2. Calculate the sample size you need before you start
3. Randomly put users into control (A) and treatment (B)
4. Run the test until you reach the sample size
5. Analyze with a two-sample t-test or z-test
6. Report both statistical significance and practical significance

Sample Size Calculation

Run this before the experiment. Do not stop early when you see a significant result.

from statsmodels.stats.power import TTestIndPower

analysis = TTestIndPower()
n = analysis.solve_power(
    effect_size=0.2,   # smallest effect you care about (Cohen's d)
    alpha=0.05,        # significance level
    power=0.80         # 1 - β, the chance you catch a real effect
)
print(f"Required sample per group: {int(np.ceil(n))}")

Effect size (Cohen's d): the smallest difference you care about, in standard deviations. Use 0.2 (small), 0.5 (medium), or 0.8 (large) as a rough guide if you do not have a specific business need.

Run the Test

# control group: original pricing (group A)
# treatment group: new pricing (group B)
group_a = payment_df[payment_df['group'] == 'control']['amount']
group_b = payment_df[payment_df['group'] == 'treatment']['amount']

# two-sample t-test
stat, p = stats.ttest_ind(group_a, group_b, equal_var=False)

print(f"Group A mean: {group_a.mean():.2f}")
print(f"Group B mean: {group_b.mean():.2f}")
print(f"Difference:   {group_b.mean() - group_a.mean():.2f}")
print(f"p-value:      {p:.4f}")
print("Significant" if p < 0.05 else "Not significant")

Statistical vs Practical Significance

A result can be statistically significant but not matter in practice.

# practical significance: effect size (Cohen's d)
pooled_std = np.sqrt((group_a.std()**2 + group_b.std()**2) / 2)
cohen_d    = (group_b.mean() - group_a.mean()) / pooled_std

print(f"Cohen's d: {cohen_d:.3f}")
# < 0.2 = tiny, 0.2–0.5 = small, 0.5–0.8 = medium, > 0.8 = large

Example: a new button color raises the click rate from 10.00% to 10.01%. With 1 million users, p < 0.001 — statistically significant. But is a 0.01% lift worth the engineering cost? That is a business question, not a statistics question.

Always report both: the p-value (is it real?) and the effect size or absolute difference (does it matter?).


Common Statistical Mistakes

1. Using the Mean on Skewed Data

Report the median for income, transaction amounts, and response times. Use the mean only when the distribution is symmetric.

2. Correlation ≠ Causation

Two correlated variables may both be driven by a third one. Always propose a believable cause-and-effect path before you say one variable causes another.

3. p-Hacking (Multiple Testing)

Running 20 tests and reporting the one with p < 0.05 is not a real finding. About 1 in 20 tests will be significant by luck at α = 0.05. Apply a Bonferroni correction when you run many comparisons:

# Bonferroni correction: divide α by the number of tests
n_tests     = 10
alpha_adj   = 0.05 / n_tests   # 0.005

4. Stopping an A/B Test Early

If you stop when the result looks good, your false positive rate goes up. Set the sample size in advance and do not peek before you reach it.

5. Simpson's Paradox

A trend that shows up in several groups can disappear or reverse when the groups are combined.

# example: the overall conversion rate looks lower for Group B
# but Group B is better in every individual city
# because Group B happened to get more traffic from low-conversion cities

pd.crosstab(payment_df['city'], payment_df['group'],
            values=payment_df['converted'], aggfunc='mean')

Always split your data and check whether the overall result still holds inside each subgroup.

6. Survivorship Bias

If you only look at successful or surviving records, you reach the wrong conclusion.

Example: looking only at completed parking sessions misses every car that left without paying. Your average amount looks higher than it really is.


Quick Reference — Which Test to Use

Question Test
Is the mean equal to a known value? One-sample t-test
Are two group means different? Two-sample t-test (Welch's)
Are two group means different (non-normal)? Mann-Whitney U
Are 3+ group means different? One-way ANOVA → Tukey HSD
Are 3+ group means different (non-normal)? Kruskal-Wallis
Are category distributions different across groups? Chi-square test
How precise is my estimate? Confidence interval
Is the link between two numeric variables a straight line? Pearson correlation + linregress
Is the link in one direction (not necessarily straight)? Spearman correlation
Did a product change have a real effect? A/B test → two-sample t-test + CI