Statistics Notebook
Statistics is the reasoning layer beneath every analysis. SQL and Python tell you how to compute a number; statistics tells you what that number means and whether you can trust it. This notebook covers the concepts that appear most often in DA work and interviews.
All examples use Python with NumPy, Pandas, and SciPy. The dataset is the parking system: payment_df with columns amount, parking_type, station_code, payment_method, entry_time.
import numpy as np
import pandas as pd
from scipy import stats
See Pandas Notebook for data manipulation and ML Model Selection Guide for where statistics meets machine learning.
Descriptive Statistics
Descriptive statistics summarise a dataset with a small set of numbers. They are always the first step before any analysis.
Central Tendency — Where is the centre?
| Measure | Formula | When to use |
|---|---|---|
| Mean | Sum ÷ count | Symmetric data, no extreme outliers |
| Median | Middle value when sorted | Skewed data or outliers present |
| Mode | Most frequent value | Categorical data |
amounts = payment_df['amount']
amounts.mean() # arithmetic average
amounts.median() # middle value
amounts.mode()[0] # most frequent value
Mean vs Median — when it matters:
# example: 9 transactions of 100, one of 10000
example = np.array([100]*9 + [10000])
print(example.mean()) # 1090 — pulled up by the outlier
print(np.median(example)) # 100 — unaffected
Use the median when reporting income, revenue per transaction, or any metric where outliers exist. Mean is misleading when the distribution has a long tail.
Spread — How spread out is the data?
| Measure | Meaning |
|---|---|
| Range | Max − Min |
| Variance | Average squared deviation from the mean |
| Standard deviation (SD) | √Variance — same unit as the data |
| IQR | Q3 − Q1 — middle 50% of the data |
amounts.std() # standard deviation
amounts.var() # variance
amounts.quantile(0.75) - amounts.quantile(0.25) # IQR
amounts.describe() # all at once
Standard deviation tells you the typical distance from the mean. An amount SD of 200 means most transactions are within ±200 of the average.
IQR is more robust than SD when outliers are present — it ignores the top and bottom 25%.
Percentiles and Quartiles
amounts.quantile(0.25) # Q1 — 25th percentile
amounts.quantile(0.50) # Q2 — median
amounts.quantile(0.75) # Q3 — 75th percentile
amounts.quantile(0.90) # 90th percentile
np.percentile(amounts, [25, 50, 75, 90]) # all at once
Percentiles answer: "what value is X% of the data below?" A transaction at the 90th percentile means 90% of all transactions are lower.
Skewness
amounts.skew()
| Value | Shape | Implication |
|---|---|---|
| ≈ 0 | Symmetric | Mean ≈ Median |
| > 0 | Right-skewed (long right tail) | Mean > Median — outliers pulling up |
| < 0 | Left-skewed (long left tail) | Mean < Median |
Transaction amounts are almost always right-skewed — most transactions are small, a few are very large.
Distributions
Normal Distribution
The normal (Gaussian) distribution is the most important distribution in statistics. Many natural quantities approximate it, and statistical tests are often built on the assumption of normality.
# check if a sample looks normal
from scipy.stats import shapiro
stat, p = shapiro(amounts.sample(50)) # Shapiro-Wilk test (works well for n < 2000)
print(f"p-value: {p:.4f}")
# p > 0.05 → fail to reject normality (data looks normal)
# p < 0.05 → reject normality (data is not normal)
Properties of the normal distribution: - 68% of data falls within 1 SD of the mean - 95% within 2 SD - 99.7% within 3 SD
mean = amounts.mean()
sd = amounts.std()
within_1sd = amounts.between(mean - sd, mean + sd).mean() # ≈ 0.68
within_2sd = amounts.between(mean - 2*sd, mean + 2*sd).mean() # ≈ 0.95
Central Limit Theorem (CLT)
The CLT states: the distribution of sample means approaches a normal distribution as sample size grows, regardless of the original distribution's shape.
# transaction amounts are right-skewed
# but the mean of 50 random samples will be approximately normal
sample_means = [payment_df['amount'].sample(50).mean() for _ in range(1000)]
pd.Series(sample_means).hist(bins=30)
Why this matters: most hypothesis tests (t-tests, etc.) assume normality of the sampling distribution, not the raw data. The CLT justifies using these tests even on skewed data as long as the sample is large enough (usually n ≥ 30).
Outlier Detection
# Method 1: IQR rule (common default)
Q1 = amounts.quantile(0.25)
Q3 = amounts.quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
outliers = payment_df[(amounts < lower) | (amounts > upper)]
# Method 2: Z-score (assumes normal distribution)
z_scores = np.abs(stats.zscore(amounts))
outliers_z = payment_df[z_scores > 3] # more than 3 SD from mean
Use the IQR method by default — it works on skewed data. Use Z-scores only if the data is roughly normal.
Hypothesis Testing
Hypothesis testing is a framework for deciding whether an observed difference is real or just random noise.
The Framework
1. State the null hypothesis (H₀): "there is no difference / no effect"
2. State the alternative hypothesis (H₁): "there is a difference"
3. Choose a significance level (α): typically 0.05
4. Compute the test statistic and p-value
5. Decision: if p < α, reject H₀
p-value: the probability of observing a result at least as extreme as yours, assuming H₀ is true. A small p-value means the data is unlikely under H₀.
Common misunderstanding: p < 0.05 does not prove H₁ is true. It only means the data is inconsistent with H₀ at the chosen significance level.
Type I and Type II Errors
| H₀ is actually true | H₀ is actually false | |
|---|---|---|
| Reject H₀ | Type I error (false positive) — α | Correct (power) |
| Fail to reject H₀ | Correct | Type II error (false negative) — β |
- Type I error (α): claiming an effect exists when it does not. Controlled by setting α = 0.05.
- Type II error (β): missing a real effect. Controlled by increasing sample size.
t-test — Compare Means
One-sample t-test: Is the mean equal to a known value?
# Is the average transaction amount different from 300?
stat, p = stats.ttest_1samp(payment_df['amount'], popmean=300)
print(f"t={stat:.3f}, p={p:.4f}")
Two-sample t-test: Are two group means different?
hourly = payment_df[payment_df['parking_type'] == 'hourly']['amount']
monthly = payment_df[payment_df['parking_type'] == 'monthly']['amount']
stat, p = stats.ttest_ind(hourly, monthly)
print(f"t={stat:.3f}, p={p:.4f}")
if p < 0.05:
print("Significant difference in average amount between hourly and monthly")
else:
print("No significant difference detected")
ttest_ind assumes equal variances by default. Add equal_var=False (Welch's t-test) when the two groups have different variances — this is generally the safer default.
stat, p = stats.ttest_ind(hourly, monthly, equal_var=False)
ANOVA — Compare Means Across 3 or More Groups
A t-test compares two groups. ANOVA (Analysis of Variance) extends this to three or more groups in one test.
hourly = payment_df[payment_df['parking_type'] == 'hourly']['amount']
monthly = payment_df[payment_df['parking_type'] == 'monthly']['amount']
daily = payment_df[payment_df['parking_type'] == 'daily']['amount']
f_stat, p = stats.f_oneway(hourly, monthly, daily)
print(f"F={f_stat:.3f}, p={p:.4f}")
If p < 0.05: at least one group's mean is significantly different from the others. ANOVA does not tell you which groups differ — run a post-hoc test for that.
Post-hoc Test — Which Groups Are Different?
from statsmodels.stats.multicomp import pairwise_tukeyhsd
result = pairwise_tukeyhsd(
endog=payment_df['amount'],
groups=payment_df['parking_type'],
alpha=0.05
)
print(result)
Tukey's HSD compares all pairs of groups while controlling for multiple comparisons. The output shows which specific pairs are significantly different.
| Scenario | Test |
|---|---|
| 2 groups, normal data | t-test |
| 2 groups, non-normal or small n | Mann-Whitney U |
| 3+ groups, normal data | One-way ANOVA → Tukey HSD |
| 3+ groups, non-normal | Kruskal-Wallis (non-parametric ANOVA) |
# Kruskal-Wallis: non-parametric alternative to one-way ANOVA
stat, p = stats.kruskal(hourly, monthly, daily)
print(f"H={stat:.3f}, p={p:.4f}")
Chi-square Test — Compare Proportions / Category Distributions
Use the chi-square test when both variables are categorical.
# Is payment method distribution different between cities?
contingency = pd.crosstab(payment_df['city'], payment_df['payment_method'])
chi2, p, dof, expected = stats.chi2_contingency(contingency)
print(f"chi2={chi2:.3f}, p={p:.4f}, degrees of freedom={dof}")
If p < 0.05: the distribution of payment methods differs significantly across cities.
Mann-Whitney U Test — Non-parametric Alternative to t-test
Use when the data is not normally distributed and the sample is small.
stat, p = stats.mannwhitneyu(hourly, monthly, alternative='two-sided')
print(f"U={stat:.1f}, p={p:.4f}")
Mann-Whitney tests whether one group tends to have higher values than the other, without assuming normality. For large samples (n > 30), the t-test is usually robust enough due to the CLT.
Confidence Intervals
A confidence interval (CI) gives a range for the true population parameter, rather than a single point estimate. It is more informative than a p-value alone — it tells you both whether an effect is significant and how large it is.
"The average transaction amount is 350. But how confident are we? A 95% CI of [340, 360] says we are quite precise. A CI of [200, 500] says the estimate is very uncertain."
95% Confidence Interval for a Mean
import scipy.stats as stats
sample = payment_df['amount']
n = len(sample)
mean = sample.mean()
se = stats.sem(sample) # standard error = SD / √n
ci_low, ci_high = stats.t.interval(
confidence=0.95,
df=n - 1,
loc=mean,
scale=se
)
print(f"Mean: {mean:.2f}")
print(f"95% CI: [{ci_low:.2f}, {ci_high:.2f}]")
Interpretation: if we repeated this sampling process 100 times, approximately 95 of the resulting intervals would contain the true population mean.
Common misinterpretation: a 95% CI does NOT mean "there is a 95% probability the true mean is in this range." The true mean is fixed — it is either in the interval or it is not.
CI for the Difference Between Two Means
hourly = payment_df[payment_df['parking_type'] == 'hourly']['amount']
monthly = payment_df[payment_df['parking_type'] == 'monthly']['amount']
# run t-test and get CI directly
result = stats.ttest_ind(hourly, monthly, equal_var=False)
diff = hourly.mean() - monthly.mean()
se_diff = np.sqrt(hourly.sem()**2 + monthly.sem()**2)
df = min(len(hourly), len(monthly)) - 1
ci_low, ci_high = stats.t.interval(0.95, df=df, loc=diff, scale=se_diff)
print(f"Difference: {diff:.2f}")
print(f"95% CI for difference: [{ci_low:.2f}, {ci_high:.2f}]")
If the CI for the difference does not include 0, the difference is statistically significant at α = 0.05 — consistent with p < 0.05.
CI Width and Sample Size
# larger sample → narrower CI (more precise estimate)
for n in [30, 100, 500, 1000]:
sample = payment_df['amount'].sample(n)
se = stats.sem(sample)
ci = stats.t.interval(0.95, df=n-1, loc=sample.mean(), scale=se)
print(f"n={n:5d} CI width: {ci[1]-ci[0]:.1f}")
CI width shrinks with √n — quadrupling sample size halves the CI width.
p-value vs Confidence Interval
| p-value | Confidence Interval | |
|---|---|---|
| What it answers | Is the effect real? | How large is the effect? |
| Output | Single number (0–1) | Range [lower, upper] |
| Threshold | p < 0.05 = significant | CI excludes 0 = significant |
| More informative | No | Yes — includes direction and magnitude |
Prefer reporting confidence intervals over p-values alone when presenting findings to stakeholders.
Correlation
Correlation measures the strength and direction of the linear relationship between two numeric variables.
Pearson Correlation
# correlation between amount and duration
r, p = stats.pearsonr(payment_df['amount'], payment_df['duration_mins'])
print(f"r={r:.3f}, p={p:.4f}")
# full correlation matrix
payment_df[['amount', 'duration_mins']].corr()
| r value | Interpretation |
|---|---|
| 0.9 to 1.0 | Very strong positive |
| 0.7 to 0.9 | Strong positive |
| 0.4 to 0.7 | Moderate positive |
| 0.1 to 0.4 | Weak positive |
| ≈ 0 | No linear relationship |
| Negative | Inverse relationship |
Pearson assumes both variables are numeric and the relationship is linear.
Spearman Correlation — Non-parametric
r_spearman, p = stats.spearmanr(payment_df['amount'], payment_df['duration_mins'])
Spearman measures rank correlation — use it when the relationship is monotonic but not necessarily linear, or when the data contains outliers.
Correlation vs Causation
Correlation does not imply causation. Two variables can be correlated because: 1. A causes B 2. B causes A 3. A third variable C causes both 4. Pure coincidence
Example: ice cream sales and drowning rates are correlated — both increase in summer. The cause is hot weather, not ice cream.
Always ask: what is the mechanism? Is there a plausible causal path? Can we run an experiment to test it?
Simple Linear Regression
Regression quantifies how much one variable changes when another changes by one unit.
from scipy.stats import linregress
slope, intercept, r_value, p_value, std_err = linregress(
payment_df['duration_mins'],
payment_df['amount']
)
print(f"slope: {slope:.3f}") # amount increases by this much per minute
print(f"intercept: {intercept:.3f}") # amount when duration = 0
print(f"R²: {r_value**2:.3f}") # % of variance explained
print(f"p-value: {p_value:.4f}") # is the slope significantly different from 0?
Interpreting the Output
- Slope: for every additional minute of parking, amount increases by
slopeunits. - Intercept: the predicted amount when duration is 0 (baseline charge).
- R² (R-squared): percentage of the variance in amount explained by duration. R² = 0.40 means duration explains 40% of the variation; the remaining 60% comes from other factors.
- p-value: tests whether the slope is significantly different from 0. p < 0.05 means duration has a statistically significant relationship with amount.
For multivariate regression (multiple predictors), use scikit-learn — see Machine Learning Notebook.
A/B Testing
An A/B test is a controlled experiment that isolates the effect of one change. It is the standard method for testing product changes, pricing, UI design, and promotions.
Workflow
1. Define the metric (conversion rate, average amount, retention)
2. Calculate required sample size before running
3. Randomly assign users to control (A) and treatment (B)
4. Run the test until the sample size is reached
5. Analyse with a two-sample t-test or z-test
6. Report both statistical significance and practical significance
Sample Size Calculation
Run this before the experiment — do not stop early when you see a significant result.
from statsmodels.stats.power import TTestIndPower
analysis = TTestIndPower()
n = analysis.solve_power(
effect_size=0.2, # minimum detectable effect (Cohen's d)
alpha=0.05, # significance level
power=0.80 # 1 - β, probability of detecting a real effect
)
print(f"Required sample per group: {int(np.ceil(n))}")
Effect size (Cohen's d): the minimum difference you care about, expressed in standard deviations. Use 0.2 (small), 0.5 (medium), or 0.8 (large) as benchmarks if you do not have a specific business requirement.
Run the Test
# control group: original pricing (group A)
# treatment group: new pricing (group B)
group_a = payment_df[payment_df['group'] == 'control']['amount']
group_b = payment_df[payment_df['group'] == 'treatment']['amount']
# two-sample t-test
stat, p = stats.ttest_ind(group_a, group_b, equal_var=False)
print(f"Group A mean: {group_a.mean():.2f}")
print(f"Group B mean: {group_b.mean():.2f}")
print(f"Difference: {group_b.mean() - group_a.mean():.2f}")
print(f"p-value: {p:.4f}")
print("Significant" if p < 0.05 else "Not significant")
Statistical vs Practical Significance
A result can be statistically significant but practically irrelevant.
# practical significance: effect size (Cohen's d)
pooled_std = np.sqrt((group_a.std()**2 + group_b.std()**2) / 2)
cohen_d = (group_b.mean() - group_a.mean()) / pooled_std
print(f"Cohen's d: {cohen_d:.3f}")
# < 0.2 = negligible, 0.2–0.5 = small, 0.5–0.8 = medium, > 0.8 = large
Example: a new button colour increases click rate from 10.00% to 10.01%. With 1 million users, p < 0.001 — statistically significant. But is a 0.01% lift worth the engineering cost? That is a business decision, not a statistics question.
Always report both: p-value (is it real?) and effect size or absolute difference (does it matter?).
Common Statistical Mistakes
1. Using Mean on Skewed Data
Report median for income, transaction amounts, and response times. Use mean only when the distribution is symmetric.
2. Correlation ≠ Causation
Two correlated variables may both be driven by a third factor. Always propose a causal mechanism before claiming one variable causes another.
3. p-Hacking (Multiple Testing)
Running 20 tests and reporting the one that gives p < 0.05 is not a valid finding — 1 in 20 tests will be significant by chance at α = 0.05. Apply a Bonferroni correction when running multiple comparisons:
# Bonferroni correction: divide α by the number of tests
n_tests = 10
alpha_adj = 0.05 / n_tests # 0.005
4. Stopping an A/B Test Early
If you stop when the result looks good, you inflate the false positive rate. Decide the sample size in advance and do not peek until it is reached.
5. Simpson's Paradox
A trend that appears in several groups can disappear or reverse when the groups are combined.
# example: overall conversion rate looks lower for Group B
# but Group B is better in every individual city
# because Group B happened to have more traffic from low-conversion cities
pd.crosstab(payment_df['city'], payment_df['group'],
values=payment_df['converted'], aggfunc='mean')
Always segment your data and check if the aggregate result holds within subgroups.
6. Survivorship Bias
Analysing only successful or existing records leads to wrong conclusions.
Example: analysing only completed parking sessions misses all vehicles that left without paying — skewing average amount upward.
Quick Reference — Which Test to Use
| Question | Test |
|---|---|
| Is the mean equal to a known value? | One-sample t-test |
| Are two group means different? | Two-sample t-test (Welch's) |
| Are two group means different (non-normal)? | Mann-Whitney U |
| Are 3+ group means different? | One-way ANOVA → Tukey HSD |
| Are 3+ group means different (non-normal)? | Kruskal-Wallis |
| Are category distributions different across groups? | Chi-square test |
| How precise is my estimate? | Confidence interval |
| Is the relationship between two numerics linear? | Pearson correlation + linregress |
| Is the relationship monotonic (not necessarily linear)? | Spearman correlation |
| Did a product change have a real effect? | A/B test → two-sample t-test + CI |