Pandas Notebook
Pandas is the primary Python library for data manipulation and analysis. This notebook covers the core operations and common DA workflows — from loading raw data to aggregating and reshaping it for reporting.
Many operations here have direct SQL equivalents. Where relevant, the SQL counterpart is noted. See SQL Notes and SQL Patterns for the SQL side.
All examples use a parking system dataset: parking_df, payment_df, and station_df.
import pandas as pd
import numpy as np
Reading & Inspecting Data
Read Files
df = pd.read_csv('parking_history.csv')
df = pd.read_excel('report.xlsx', sheet_name='Sheet1')
df = pd.read_csv('data.csv', parse_dates=['entry_time', 'exit_time'])
Always use parse_dates for date columns — they come in as strings by default.
Inspect a DataFrame
df.head(10) # first 10 rows (default 5)
df.tail(5) # last 5 rows
df.shape # (rows, columns)
df.dtypes # data type per column
df.info() # dtypes + non-null counts + memory usage
df.describe() # count, mean, std, min, max for numeric columns
df.describe(include='all') # include string columns too
Column Summary
df['parking_type'].value_counts() # frequency count
df['parking_type'].value_counts(normalize=True) # proportions (0–1)
df['license_plate'].nunique() # number of unique values
df.isnull().sum() # null count per column
df.duplicated().sum() # number of duplicate rows
Selecting & Filtering
Select Columns
df['station_code'] # single column → Series
df[['station_code', 'entry_time']] # multiple columns → DataFrame
Filter Rows (Boolean Indexing)
df[df['parking_type'] == 'hourly']
df[df['amount'] > 500]
df[df['exit_time'].isnull()] # WHERE exit_time IS NULL
Combine Conditions
# AND
df[(df['parking_type'] == 'hourly') & (df['amount'] > 500)]
# OR
df[(df['parking_type'] == 'hourly') | (df['parking_type'] == 'monthly')]
# NOT
df[~(df['parking_type'] == 'monthly')]
Always wrap each condition in () when using & or | — Python's operator precedence will cause bugs otherwise.
.loc and .iloc
df.loc[0:4, 'station_code':'entry_time'] # label-based slicing
df.iloc[0:5, 0:3] # position-based slicing
df.loc[df['amount'] > 500, 'license_plate'] # filter rows + select column
.locuses column names and index labels..ilocuses integer positions.
.query() (Readable Alternative to Boolean Indexing)
df.query("parking_type == 'hourly' and amount > 500")
df.query("exit_time.isnull()", engine='python')
limit = 500
df.query("amount > @limit") # reference a Python variable with @
.isin() and .between()
df[df['payment_method'].isin(['Credit_Card', 'Mobile_Pay'])] # IN (...)
df[df['amount'].between(100, 500)] # BETWEEN 100 AND 500
Handling Missing Values
Detect
df.isnull().sum() # null count per column
df.isnull().sum() / len(df) * 100 # null percentage per column
df[df['exit_time'].isnull()] # rows where exit_time is missing
Drop
df.dropna() # drop rows with ANY null
df.dropna(subset=['exit_time']) # drop only where exit_time is null
df.dropna(thresh=3) # keep rows with at least 3 non-null values
Fill
df['exit_time'].fillna('Still Parked')
df['amount'].fillna(0)
df['amount'].fillna(df['amount'].mean()) # fill with column mean
df['amount'].ffill() # forward fill — carry last value forward
df['amount'].bfill() # backward fill
All these return a new DataFrame by default. Add inplace=True only when you intentionally want to modify the original.
Sorting & Ranking
Sort
df.sort_values('entry_time') # ascending
df.sort_values('amount', ascending=False) # descending
df.sort_values(['station_code', 'amount'], ascending=[True, False]) # multi-column
Rank
df['rank'] = df['amount'].rank(ascending=False, method='dense')
| method | SQL equivalent |
|---|---|
'min' |
RANK() |
'dense' |
DENSE_RANK() |
'first' |
ROW_NUMBER() |
Aggregation & GroupBy
Basic GroupBy
df.groupby('station_code')['amount'].sum()
df.groupby('station_code')['amount'].mean()
df.groupby('station_code').size() # row count per group (COUNT(*))
Multiple Aggregations (Named)
df.groupby('station_code').agg(
total_revenue = ('amount', 'sum'),
avg_amount = ('amount', 'mean'),
max_amount = ('amount', 'max'),
visit_count = ('parking_id', 'count')
)
Named aggregation (col, func) gives clean column names without extra renaming afterward.
GroupBy Multiple Columns
df.groupby(['station_code', 'payment_method']).agg(
total = ('amount', 'sum'),
count = ('amount', 'count')
).reset_index()
.reset_index() converts group keys from the index back to regular columns — almost always what you want.
Custom Aggregation
df.groupby('station_code').agg(
revenue_range = ('amount', lambda x: x.max() - x.min())
)
Pivot Table (SQL Pivot Equivalent)
pd.pivot_table(
df,
values='amount',
index='station_code',
columns='payment_method',
aggfunc='sum',
fill_value=0
)
index = GROUP BY, columns = the column whose values become new column headers, aggfunc = the aggregation function.
Cross-tabulation (Count Distribution)
pd.crosstab(df['station_code'], df['payment_method'])
pd.crosstab(df['station_code'], df['payment_method'], normalize='index') # row %
Merging & Joining
merge() — SQL JOIN Equivalent
# INNER JOIN
pd.merge(parking_df, payment_df, on='parking_id')
# LEFT JOIN
pd.merge(parking_df, payment_df, on='parking_id', how='left')
# Different key names in each table
pd.merge(parking_df, station_df, left_on='station_code', right_on='code')
how |
SQL equivalent |
|---|---|
'inner' |
INNER JOIN |
'left' |
LEFT JOIN |
'right' |
RIGHT JOIN |
'outer' |
FULL OUTER JOIN |
concat() — Stack DataFrames (UNION ALL)
pd.concat([df_2024, df_2025]) # stack rows
pd.concat([df_2024, df_2025], ignore_index=True) # reset index after stacking
pd.concat([df_a, df_b], axis=1) # stack columns side by side
Anti Join — Records with No Match
merged = pd.merge(parking_df, payment_df,
on='parking_id', how='left', indicator=True)
unmatched = merged[merged['_merge'] == 'left_only'].drop(columns='_merge')
indicator=True adds a _merge column. 'left_only' rows had no match in the right table — the anti join result.
String Operations
df['license_plate'].str.upper()
df['license_plate'].str.lower()
df['license_plate'].str.strip() # remove leading/trailing spaces
df['license_plate'].str.len()
df['station_name'].str.contains('North') # returns boolean Series
df['station_name'].str.startswith('ST')
df['license_plate'].str.replace('-', '')
df['license_plate'].str.split('-', expand=True) # split into multiple columns
df['license_plate'].str[0:3] # slice first 3 characters
Use .str accessor methods directly inside boolean filters:
df[df['station_name'].str.contains('Xinyi')]
df[df['license_plate'].str.startswith('E')]
Date & Time Operations
Convert to Datetime
df['entry_time'] = pd.to_datetime(df['entry_time'])
df['entry_time'] = pd.to_datetime(df['entry_time'], format='%Y-%m-%d %H:%M:%S')
Always convert explicitly — date columns almost always arrive as strings.
Extract Components
df['year'] = df['entry_time'].dt.year
df['month'] = df['entry_time'].dt.month
df['day'] = df['entry_time'].dt.day
df['hour'] = df['entry_time'].dt.hour
df['weekday'] = df['entry_time'].dt.day_name() # 'Monday', 'Tuesday' ...
df['date'] = df['entry_time'].dt.date # strip time component
df['ym'] = df['entry_time'].dt.to_period('M') # '2025-06' period object
Date Arithmetic
df['duration_mins'] = (df['exit_time'] - df['entry_time']).dt.total_seconds() / 60
df['days_since'] = (pd.Timestamp.today() - df['entry_time']).dt.days
Filter by Date
df[df['entry_time'] >= '2025-01-01']
df[df['entry_time'].dt.year == 2025]
df[df['entry_time'].dt.month.isin([1, 2, 3])] # Q1 only
Type Conversion
df['amount'] = df['amount'].astype(float)
df['is_electric'] = df['is_electric'].astype(bool)
df['station_code'] = df['station_code'].astype(str)
pd.to_numeric(df['amount'], errors='coerce') # non-numeric → NaN (safe)
errors='coerce' is the safe default — converts failures to NaN instead of raising an error. Equivalent to SQL CAST with null fallback.
Common DA Workflows
1. Data Cleaning Pipeline
df = pd.read_csv('parking_history.csv', parse_dates=['entry_time', 'exit_time'])
# inspect
print(df.shape)
print(df.isnull().sum())
print(df.duplicated().sum())
# remove exact duplicates
df = df.drop_duplicates()
# fix types
df['amount'] = pd.to_numeric(df['amount'], errors='coerce')
# normalize strings
df['license_plate'] = df['license_plate'].str.strip().str.upper()
# flag or drop rows with critical nulls
df = df.dropna(subset=['license_plate', 'entry_time'])
2. EDA Workflow
# overview
df.info()
df.describe()
# categorical distributions
for col in ['station_code', 'parking_type', 'payment_method']:
print(f"\n{col}:")
print(df[col].value_counts())
# which columns have nulls?
df.isnull().sum()[df.isnull().sum() > 0]
# numeric correlations
df.select_dtypes(include='number').corr()
3. Monthly Trend with Resample
monthly = (
df.set_index('entry_time')
.resample('ME')['amount']
.agg(total='sum', count='count')
.reset_index()
)
.resample('ME') is the Pandas equivalent of GROUP BY DATE_FORMAT(date, '%Y-%m').
Common frequency aliases: 'D' = day, 'W' = week, 'ME' = month end, 'QE' = quarter end, 'YE' = year end.
4. Window Functions — Rolling & Shift
monthly['rolling_3m'] = monthly['total'].rolling(3).mean() # 3-month moving average
monthly['prev_month'] = monthly['total'].shift(1) # LAG(total, 1)
monthly['next_month'] = monthly['total'].shift(-1) # LEAD(total, 1)
monthly['mom_pct'] = monthly['total'].pct_change() * 100 # MoM % change
monthly['running_total'] = monthly['total'].cumsum() # running total
5. De-duplication — Keep Latest per Group
df_latest = (
df.sort_values('entry_time', ascending=False)
.drop_duplicates(subset='license_plate', keep='first')
)
Sort descending so keep='first' retains the latest row per group. SQL equivalent: ROW_NUMBER() OVER (PARTITION BY license_plate ORDER BY entry_time DESC) = 1.
6. Top N per Group
top3 = (
df.sort_values('amount', ascending=False)
.groupby('station_code', group_keys=False)
.head(3)
)
SQL equivalent: ROW_NUMBER() OVER (PARTITION BY station_code ORDER BY amount DESC) <= 3.
7. Conditional Column — CASE WHEN Equivalent
# pd.cut for numeric bins
df['duration_category'] = pd.cut(
df['duration_mins'],
bins=[0, 120, 480, float('inf')],
labels=['Short-term', 'Medium', 'Long-term']
)
# np.select for complex multi-condition logic
conditions = [
df['amount'] > 1000,
df['amount'].between(500, 1000),
]
choices = ['High', 'Medium']
df['spend_tier'] = np.select(conditions, choices, default='Low')
pd.cut— best for numeric ranges (BETWEEN style).np.select— best for multiple independent conditions (searched CASE WHEN).
8. Period-over-Period Comparison
monthly = (
df.groupby(df['entry_time'].dt.to_period('M'))['amount']
.sum()
.reset_index()
)
monthly.columns = ['month', 'revenue']
monthly['prev_revenue'] = monthly['revenue'].shift(1)
monthly['mom_change'] = monthly['revenue'] - monthly['prev_revenue']
monthly['mom_pct'] = monthly['revenue'].pct_change().mul(100).round(1)
9. Anti Join — Find Unmatched Records
merged = pd.merge(
parking_df, payment_df,
on='parking_id', how='left', indicator=True
)
unpaid = merged[merged['_merge'] == 'left_only'].drop(columns='_merge')
10. Conditional Aggregation
summary = df.groupby('station_code').agg(
total_visits = ('parking_id', 'count'),
hourly_count = ('parking_type', lambda x: (x == 'hourly').sum()),
credit_revenue = ('amount', lambda x: x[df.loc[x.index, 'payment_method'] == 'Credit_Card'].sum()),
avg_amount = ('amount', 'mean'),
)
Cleaner alternative — compute columns first, then aggregate:
df['is_hourly'] = df['parking_type'] == 'hourly'
df['credit_amount'] = df['amount'].where(df['payment_method'] == 'Credit_Card', 0)
summary = df.groupby('station_code').agg(
total_visits = ('parking_id', 'count'),
hourly_count = ('is_hourly', 'sum'),
credit_revenue = ('credit_amount', 'sum'),
avg_amount = ('amount', 'mean'),
)
Create boolean/conditional columns first, then aggregate them — cleaner and easier to debug than lambda inside agg.