Statistics Notebook
Statistics is the reasoning layer under every analysis. SQL and Python show you how to compute a number. Statistics tells you what the number means and whether you can trust it. This notebook covers the ideas that show up most often in DA work and interviews.
All examples use Python with NumPy, Pandas, and SciPy. The dataset is the parking system: payment_df with columns amount, parking_type, station_code, payment_method, entry_time.
import numpy as np
import pandas as pd
from scipy import stats
See Pandas Notebook for data manipulation and ML Model Selection Guide for where statistics meets machine learning.
Descriptive Statistics
Descriptive statistics summarize a dataset with a small set of numbers. They are always the first step before any analysis.
Central Tendency — Where is the center?
| Measure | Formula | When to use |
|---|---|---|
| Mean | Sum ÷ count | Symmetric data, no extreme outliers |
| Median | The middle value when sorted | Skewed data or when there are outliers |
| Mode | The most frequent value | Category data |
amounts = payment_df['amount']
amounts.mean() # average
amounts.median() # middle value
amounts.mode()[0] # most frequent value
Mean vs Median — when it matters:
# example: 9 transactions of 100, one of 10000
example = np.array([100]*9 + [10000])
print(example.mean()) # 1090 — pulled up by the outlier
print(np.median(example)) # 100 — not affected
Use the median when you report income, revenue per transaction, or any number where outliers exist. The mean is misleading when the distribution has a long tail.
Spread — How spread out is the data?
| Measure | Meaning |
|---|---|
| Range | Max − Min |
| Variance | Average of the squared distances from the mean |
| Standard deviation (SD) | √Variance — same unit as the data |
| IQR | Q3 − Q1 — the middle 50% of the data |
amounts.std() # standard deviation
amounts.var() # variance
amounts.quantile(0.75) - amounts.quantile(0.25) # IQR
amounts.describe() # all of the above at once
The standard deviation tells you the typical distance from the mean. An SD of 200 means most transactions are within ±200 of the average.
The IQR is more reliable than the SD when there are outliers. It ignores the top and bottom 25%.
Percentiles and Quartiles
amounts.quantile(0.25) # Q1 — 25th percentile
amounts.quantile(0.50) # Q2 — median
amounts.quantile(0.75) # Q3 — 75th percentile
amounts.quantile(0.90) # 90th percentile
np.percentile(amounts, [25, 50, 75, 90]) # all at once
Percentiles answer this: "what value is X% of the data below?" A transaction at the 90th percentile means 90% of all transactions are smaller than it.
Skewness
amounts.skew()
| Value | Shape | What it means |
|---|---|---|
| ≈ 0 | Symmetric | Mean ≈ Median |
| > 0 | Right-skewed (long right tail) | Mean > Median — outliers pulling the mean up |
| < 0 | Left-skewed (long left tail) | Mean < Median |
Transaction amounts are almost always right-skewed — most transactions are small, a few are very large.
Distributions
Normal Distribution
The normal (Gaussian) distribution is the most important distribution in statistics. Many real-world numbers look like it, and many statistical tests are built on the assumption that the data is normal.
# check if a sample looks normal
from scipy.stats import shapiro
stat, p = shapiro(amounts.sample(50)) # Shapiro-Wilk test (works for n < 2000)
print(f"p-value: {p:.4f}")
# p > 0.05 → fail to reject normality (data looks normal)
# p < 0.05 → reject normality (data is not normal)
Properties of the normal distribution: - 68% of the data is within 1 SD of the mean - 95% within 2 SD - 99.7% within 3 SD
mean = amounts.mean()
sd = amounts.std()
within_1sd = amounts.between(mean - sd, mean + sd).mean() # ≈ 0.68
within_2sd = amounts.between(mean - 2*sd, mean + 2*sd).mean() # ≈ 0.95
Central Limit Theorem (CLT)
The CLT says: the distribution of sample means gets close to a normal distribution as the sample size grows, no matter what the original distribution looks like.
# transaction amounts are right-skewed
# but the mean of 50 random samples will be roughly normal
sample_means = [payment_df['amount'].sample(50).mean() for _ in range(1000)]
pd.Series(sample_means).hist(bins=30)
Why this matters: most hypothesis tests (t-tests, etc.) assume the sampling distribution is normal, not the raw data. The CLT lets you use these tests on skewed data, as long as the sample is big enough (usually n ≥ 30).
Outlier Detection
# Method 1: IQR rule (the common default)
Q1 = amounts.quantile(0.25)
Q3 = amounts.quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
outliers = payment_df[(amounts < lower) | (amounts > upper)]
# Method 2: Z-score (assumes a normal distribution)
z_scores = np.abs(stats.zscore(amounts))
outliers_z = payment_df[z_scores > 3] # more than 3 SD from the mean
Use the IQR method by default. It works on skewed data. Use Z-scores only when the data is roughly normal.
Hypothesis Testing
Hypothesis testing is a framework for deciding whether a difference you see is real or just random noise.
The Framework
1. State the null hypothesis (H₀): "there is no difference / no effect"
2. State the alternative hypothesis (H₁): "there is a difference"
3. Pick a significance level (α): usually 0.05
4. Compute the test statistic and the p-value
5. Decision: if p < α, reject H₀
p-value: the chance of seeing a result at least as extreme as yours, assuming H₀ is true. A small p-value means the data is unlikely under H₀.
Common misunderstanding: p < 0.05 does not prove H₁ is true. It only means the data does not fit H₀ at the chosen significance level.
Type I and Type II Errors
| H₀ is actually true | H₀ is actually false | |
|---|---|---|
| Reject H₀ | Type I error (false positive) — α | Correct (power) |
| Fail to reject H₀ | Correct | Type II error (false negative) — β |
- Type I error (α): saying there is an effect when there is not. Controlled by setting α = 0.05.
- Type II error (β): missing a real effect. Lowered by using a bigger sample size.
t-test — Compare Means
One-sample t-test: Is the mean equal to a known value?
# Is the average transaction amount different from 300?
stat, p = stats.ttest_1samp(payment_df['amount'], popmean=300)
print(f"t={stat:.3f}, p={p:.4f}")
Two-sample t-test: Are two group means different?
hourly = payment_df[payment_df['parking_type'] == 'hourly']['amount']
monthly = payment_df[payment_df['parking_type'] == 'monthly']['amount']
stat, p = stats.ttest_ind(hourly, monthly)
print(f"t={stat:.3f}, p={p:.4f}")
if p < 0.05:
print("Significant difference in average amount between hourly and monthly")
else:
print("No significant difference detected")
ttest_ind assumes equal variance by default. Add equal_var=False (Welch's t-test) when the two groups have different variances. Welch's is the safer default in general.
stat, p = stats.ttest_ind(hourly, monthly, equal_var=False)
ANOVA — Compare Means Across 3 or More Groups
A t-test compares two groups. ANOVA (Analysis of Variance) extends this to three or more groups in one test.
hourly = payment_df[payment_df['parking_type'] == 'hourly']['amount']
monthly = payment_df[payment_df['parking_type'] == 'monthly']['amount']
daily = payment_df[payment_df['parking_type'] == 'daily']['amount']
f_stat, p = stats.f_oneway(hourly, monthly, daily)
print(f"F={f_stat:.3f}, p={p:.4f}")
If p < 0.05: at least one group's mean is different from the others. ANOVA does not tell you which groups differ — run a post-hoc test for that.
Post-hoc Test — Which Groups Are Different?
from statsmodels.stats.multicomp import pairwise_tukeyhsd
result = pairwise_tukeyhsd(
endog=payment_df['amount'],
groups=payment_df['parking_type'],
alpha=0.05
)
print(result)
Tukey's HSD compares every pair of groups while controlling for the multiple-comparison problem. The output shows which specific pairs are different.
| Scenario | Test |
|---|---|
| 2 groups, normal data | t-test |
| 2 groups, non-normal or small n | Mann-Whitney U |
| 3+ groups, normal data | One-way ANOVA → Tukey HSD |
| 3+ groups, non-normal | Kruskal-Wallis (non-parametric ANOVA) |
# Kruskal-Wallis: non-parametric alternative to one-way ANOVA
stat, p = stats.kruskal(hourly, monthly, daily)
print(f"H={stat:.3f}, p={p:.4f}")
Chi-square Test — Compare Proportions / Category Distributions
Use the chi-square test when both variables are categories.
# Is the payment method distribution different across cities?
contingency = pd.crosstab(payment_df['city'], payment_df['payment_method'])
chi2, p, dof, expected = stats.chi2_contingency(contingency)
print(f"chi2={chi2:.3f}, p={p:.4f}, degrees of freedom={dof}")
If p < 0.05: the distribution of payment methods is different across cities.
Mann-Whitney U Test — Non-parametric Alternative to t-test
Use when the data is not normal and the sample is small.
stat, p = stats.mannwhitneyu(hourly, monthly, alternative='two-sided')
print(f"U={stat:.1f}, p={p:.4f}")
Mann-Whitney checks whether one group tends to have higher values than the other, with no assumption about normality. For large samples (n > 30), the t-test is usually robust enough thanks to the CLT.
Confidence Intervals
A confidence interval (CI) gives a range for the true population value, instead of a single number. It is more useful than a p-value alone, because it tells you both whether an effect is real and how big it is.
"The average transaction amount is 350. But how sure are we? A 95% CI of [340, 360] is precise. A CI of [200, 500] means the estimate is shaky."
95% Confidence Interval for a Mean
import scipy.stats as stats
sample = payment_df['amount']
n = len(sample)
mean = sample.mean()
se = stats.sem(sample) # standard error = SD / √n
ci_low, ci_high = stats.t.interval(
confidence=0.95,
df=n - 1,
loc=mean,
scale=se
)
print(f"Mean: {mean:.2f}")
print(f"95% CI: [{ci_low:.2f}, {ci_high:.2f}]")
How to read it: if you repeated the sampling 100 times, about 95 of the resulting intervals would contain the true population mean.
Common misreading: a 95% CI does NOT mean "there is a 95% chance the true mean is in this range." The true mean is fixed — it is either in the interval or it is not.
CI for the Difference Between Two Means
hourly = payment_df[payment_df['parking_type'] == 'hourly']['amount']
monthly = payment_df[payment_df['parking_type'] == 'monthly']['amount']
# run a t-test and get the CI directly
result = stats.ttest_ind(hourly, monthly, equal_var=False)
diff = hourly.mean() - monthly.mean()
se_diff = np.sqrt(hourly.sem()**2 + monthly.sem()**2)
df = min(len(hourly), len(monthly)) - 1
ci_low, ci_high = stats.t.interval(0.95, df=df, loc=diff, scale=se_diff)
print(f"Difference: {diff:.2f}")
print(f"95% CI for the difference: [{ci_low:.2f}, {ci_high:.2f}]")
If the CI for the difference does not include 0, the difference is significant at α = 0.05 — same as p < 0.05.
CI Width and Sample Size
# bigger sample → narrower CI (more precise estimate)
for n in [30, 100, 500, 1000]:
sample = payment_df['amount'].sample(n)
se = stats.sem(sample)
ci = stats.t.interval(0.95, df=n-1, loc=sample.mean(), scale=se)
print(f"n={n:5d} CI width: {ci[1]-ci[0]:.1f}")
The CI width shrinks with √n. Quadrupling the sample size cuts the CI width in half.
p-value vs Confidence Interval
| p-value | Confidence Interval | |
|---|---|---|
| What it answers | Is the effect real? | How big is the effect? |
| Output | One number (0–1) | A range [lower, upper] |
| Threshold | p < 0.05 = significant | CI excludes 0 = significant |
| More useful | No | Yes — tells you direction and size |
When you present results to non-technical people, prefer reporting the confidence interval over the p-value alone.
Correlation
Correlation measures how strong and in which direction two numeric variables move together.
Pearson Correlation
# correlation between amount and duration
r, p = stats.pearsonr(payment_df['amount'], payment_df['duration_mins'])
print(f"r={r:.3f}, p={p:.4f}")
# full correlation matrix
payment_df[['amount', 'duration_mins']].corr()
| r value | Meaning |
|---|---|
| 0.9 to 1.0 | Very strong positive |
| 0.7 to 0.9 | Strong positive |
| 0.4 to 0.7 | Medium positive |
| 0.1 to 0.4 | Weak positive |
| ≈ 0 | No straight-line link |
| Negative | Inverse link |
Pearson assumes both variables are numeric and the link is a straight line.
Spearman Correlation — Non-parametric
r_spearman, p = stats.spearmanr(payment_df['amount'], payment_df['duration_mins'])
Spearman measures rank correlation. Use it when the link goes in one direction but is not necessarily a straight line, or when the data has outliers.
Correlation vs Causation
Correlation does not mean causation. Two variables can move together because: 1. A causes B 2. B causes A 3. A third variable C causes both 4. Pure coincidence
Example: ice cream sales and drowning rates move together — both go up in summer. The cause is hot weather, not ice cream.
Always ask: what is the mechanism? Is there a believable cause-and-effect path? Can you run an experiment to test it?
Simple Linear Regression
Regression measures how much one variable changes when another changes by one unit.
from scipy.stats import linregress
slope, intercept, r_value, p_value, std_err = linregress(
payment_df['duration_mins'],
payment_df['amount']
)
print(f"slope: {slope:.3f}") # amount goes up by this much per minute
print(f"intercept: {intercept:.3f}") # amount when duration = 0
print(f"R²: {r_value**2:.3f}") # share of variance explained
print(f"p-value: {p_value:.4f}") # is the slope different from 0?
Reading the Output
- Slope: for each extra minute of parking, the amount goes up by
slopeunits. - Intercept: the predicted amount when duration is 0 (the base charge).
- R² (R-squared): the share of variance in amount that duration explains. R² = 0.40 means duration explains 40%; the other 60% comes from other things.
- p-value: tests whether the slope is different from 0. p < 0.05 means there is a real link between duration and amount.
For multivariate regression (several predictors), use scikit-learn — see Machine Learning Notebook.
A/B Testing
An A/B test is a controlled experiment that isolates the effect of one change. It is the standard way to test product changes, pricing, UI design, and promotions.
Workflow
1. Define the metric (conversion rate, average amount, retention)
2. Calculate the sample size you need before you start
3. Randomly put users into control (A) and treatment (B)
4. Run the test until you reach the sample size
5. Analyze with a two-sample t-test or z-test
6. Report both statistical significance and practical significance
Sample Size Calculation
Run this before the experiment. Do not stop early when you see a significant result.
from statsmodels.stats.power import TTestIndPower
analysis = TTestIndPower()
n = analysis.solve_power(
effect_size=0.2, # smallest effect you care about (Cohen's d)
alpha=0.05, # significance level
power=0.80 # 1 - β, the chance you catch a real effect
)
print(f"Required sample per group: {int(np.ceil(n))}")
Effect size (Cohen's d): the smallest difference you care about, in standard deviations. Use 0.2 (small), 0.5 (medium), or 0.8 (large) as a rough guide if you do not have a specific business need.
Run the Test
# control group: original pricing (group A)
# treatment group: new pricing (group B)
group_a = payment_df[payment_df['group'] == 'control']['amount']
group_b = payment_df[payment_df['group'] == 'treatment']['amount']
# two-sample t-test
stat, p = stats.ttest_ind(group_a, group_b, equal_var=False)
print(f"Group A mean: {group_a.mean():.2f}")
print(f"Group B mean: {group_b.mean():.2f}")
print(f"Difference: {group_b.mean() - group_a.mean():.2f}")
print(f"p-value: {p:.4f}")
print("Significant" if p < 0.05 else "Not significant")
Statistical vs Practical Significance
A result can be statistically significant but not matter in practice.
# practical significance: effect size (Cohen's d)
pooled_std = np.sqrt((group_a.std()**2 + group_b.std()**2) / 2)
cohen_d = (group_b.mean() - group_a.mean()) / pooled_std
print(f"Cohen's d: {cohen_d:.3f}")
# < 0.2 = tiny, 0.2–0.5 = small, 0.5–0.8 = medium, > 0.8 = large
Example: a new button color raises the click rate from 10.00% to 10.01%. With 1 million users, p < 0.001 — statistically significant. But is a 0.01% lift worth the engineering cost? That is a business question, not a statistics question.
Always report both: the p-value (is it real?) and the effect size or absolute difference (does it matter?).
Common Statistical Mistakes
1. Using the Mean on Skewed Data
Report the median for income, transaction amounts, and response times. Use the mean only when the distribution is symmetric.
2. Correlation ≠ Causation
Two correlated variables may both be driven by a third one. Always propose a believable cause-and-effect path before you say one variable causes another.
3. p-Hacking (Multiple Testing)
Running 20 tests and reporting the one with p < 0.05 is not a real finding. About 1 in 20 tests will be significant by luck at α = 0.05. Apply a Bonferroni correction when you run many comparisons:
# Bonferroni correction: divide α by the number of tests
n_tests = 10
alpha_adj = 0.05 / n_tests # 0.005
4. Stopping an A/B Test Early
If you stop when the result looks good, your false positive rate goes up. Set the sample size in advance and do not peek before you reach it.
5. Simpson's Paradox
A trend that shows up in several groups can disappear or reverse when the groups are combined.
# example: the overall conversion rate looks lower for Group B
# but Group B is better in every individual city
# because Group B happened to get more traffic from low-conversion cities
pd.crosstab(payment_df['city'], payment_df['group'],
values=payment_df['converted'], aggfunc='mean')
Always split your data and check whether the overall result still holds inside each subgroup.
6. Survivorship Bias
If you only look at successful or surviving records, you reach the wrong conclusion.
Example: looking only at completed parking sessions misses every car that left without paying. Your average amount looks higher than it really is.
Quick Reference — Which Test to Use
| Question | Test |
|---|---|
| Is the mean equal to a known value? | One-sample t-test |
| Are two group means different? | Two-sample t-test (Welch's) |
| Are two group means different (non-normal)? | Mann-Whitney U |
| Are 3+ group means different? | One-way ANOVA → Tukey HSD |
| Are 3+ group means different (non-normal)? | Kruskal-Wallis |
| Are category distributions different across groups? | Chi-square test |
| How precise is my estimate? | Confidence interval |
| Is the link between two numeric variables a straight line? | Pearson correlation + linregress |
| Is the link in one direction (not necessarily straight)? | Spearman correlation |
| Did a product change have a real effect? | A/B test → two-sample t-test + CI |