Pandas Notebook
Pandas is the main Python library for cleaning and shaping data. This notebook covers the core operations and the workflows you will see most often in DA work — from loading raw data to grouping and reshaping it for a report.
Many of these have a direct SQL version. Where it helps, the SQL version is shown next to it. See SQL Notes and SQL Patterns for the SQL side. Once your data is ready, see Matplotlib / Seaborn Notebook for charts and Machine Learning Notebook for models.
If you are coming from Excel, the Excel Notebook has an Excel vs Pandas quick reference at the end.
For the NumPy layer that Pandas is built on, see NumPy Notebook.
All examples use a parking system dataset: parking_df, payment_df, and station_df.
import pandas as pd
import numpy as np
Reading & Inspecting Data
Read Files
df = pd.read_csv('parking_history.csv')
df = pd.read_excel('report.xlsx', sheet_name='Sheet1')
df = pd.read_csv('data.csv', parse_dates=['entry_time', 'exit_time'])
Always use parse_dates for date columns. They come in as strings by default.
Inspect a DataFrame
df.head(10) # first 10 rows (default 5)
df.tail(5) # last 5 rows
df.shape # (rows, columns)
df.dtypes # data type per column
df.info() # dtypes + non-null counts + memory usage
df.describe() # count, mean, std, min, max for numeric columns
df.describe(include='all') # include string columns too
Column Summary
df['parking_type'].value_counts() # how many of each value
df['parking_type'].value_counts(normalize=True) # share (0–1)
df['license_plate'].nunique() # number of unique values
df.isnull().sum() # nulls per column
df.duplicated().sum() # number of duplicate rows
Selecting & Filtering
Select Columns
df['station_code'] # one column → Series
df[['station_code', 'entry_time']] # several columns → DataFrame
Filter Rows (Boolean Indexing)
df[df['parking_type'] == 'hourly']
df[df['amount'] > 500]
df[df['exit_time'].isnull()] # WHERE exit_time IS NULL
Combine Conditions
# AND
df[(df['parking_type'] == 'hourly') & (df['amount'] > 500)]
# OR
df[(df['parking_type'] == 'hourly') | (df['parking_type'] == 'monthly')]
# NOT
df[~(df['parking_type'] == 'monthly')]
Always wrap each condition in () when you use & or |. Without the brackets, Python's operator order will cause bugs.
.loc and .iloc
df.loc[0:4, 'station_code':'entry_time'] # slice by labels
df.iloc[0:5, 0:3] # slice by integer position
df.loc[df['amount'] > 500, 'license_plate'] # filter rows + pick column
.locuses column names and index labels..ilocuses integer positions.
.query() (Easier to Read Than Boolean Indexing)
df.query("parking_type == 'hourly' and amount > 500")
df.query("exit_time.isnull()", engine='python')
limit = 500
df.query("amount > @limit") # use a Python variable with @
.isin() and .between()
df[df['payment_method'].isin(['Credit_Card', 'Mobile_Pay'])] # IN (...)
df[df['amount'].between(100, 500)] # BETWEEN 100 AND 500
Handling Missing Values
Detect
df.isnull().sum() # null count per column
df.isnull().sum() / len(df) * 100 # null percentage per column
df[df['exit_time'].isnull()] # rows where exit_time is missing
Drop
df.dropna() # drop rows with ANY null
df.dropna(subset=['exit_time']) # drop only where exit_time is null
df.dropna(thresh=3) # keep rows with at least 3 non-null values
Fill
df['exit_time'].fillna('Still Parked')
df['amount'].fillna(0)
df['amount'].fillna(df['amount'].mean()) # fill with the column mean
df['amount'].ffill() # forward fill — carry the last value forward
df['amount'].bfill() # backward fill
All of these return a new DataFrame by default. Add inplace=True only when you really mean to change the original.
Sorting & Ranking
Sort
df.sort_values('entry_time') # ascending
df.sort_values('amount', ascending=False) # descending
df.sort_values(['station_code', 'amount'], ascending=[True, False]) # several columns
Rank
df['rank'] = df['amount'].rank(ascending=False, method='dense')
| method | SQL equivalent |
|---|---|
'min' |
RANK() |
'dense' |
DENSE_RANK() |
'first' |
ROW_NUMBER() |
Aggregation & GroupBy
Basic GroupBy
df.groupby('station_code')['amount'].sum()
df.groupby('station_code')['amount'].mean()
df.groupby('station_code').size() # row count per group (COUNT(*))
Several Aggregations (Named)
df.groupby('station_code').agg(
total_revenue = ('amount', 'sum'),
avg_amount = ('amount', 'mean'),
max_amount = ('amount', 'max'),
visit_count = ('parking_id', 'count')
)
Named aggregation (col, func) gives you clean column names without an extra rename step.
GroupBy on Several Columns
df.groupby(['station_code', 'payment_method']).agg(
total = ('amount', 'sum'),
count = ('amount', 'count')
).reset_index()
.reset_index() turns the group keys back into normal columns — almost always what you want.
Custom Aggregation
df.groupby('station_code').agg(
revenue_range = ('amount', lambda x: x.max() - x.min())
)
Pivot Table (SQL Pivot Equivalent)
pd.pivot_table(
df,
values='amount',
index='station_code',
columns='payment_method',
aggfunc='sum',
fill_value=0
)
index = GROUP BY, columns = the column whose values become new column headers, aggfunc = the aggregate function.
Cross-tabulation (Count Distribution)
pd.crosstab(df['station_code'], df['payment_method'])
pd.crosstab(df['station_code'], df['payment_method'], normalize='index') # row %
Merging & Joining
merge() — SQL JOIN Equivalent
# INNER JOIN
pd.merge(parking_df, payment_df, on='parking_id')
# LEFT JOIN
pd.merge(parking_df, payment_df, on='parking_id', how='left')
# Different key names in each table
pd.merge(parking_df, station_df, left_on='station_code', right_on='code')
how |
SQL equivalent |
|---|---|
'inner' |
INNER JOIN |
'left' |
LEFT JOIN |
'right' |
RIGHT JOIN |
'outer' |
FULL OUTER JOIN |
concat() — Stack DataFrames (UNION ALL)
pd.concat([df_2024, df_2025]) # stack rows
pd.concat([df_2024, df_2025], ignore_index=True) # reset the index after stacking
pd.concat([df_a, df_b], axis=1) # stack columns side by side
Anti Join — Records with No Match
merged = pd.merge(parking_df, payment_df,
on='parking_id', how='left', indicator=True)
unmatched = merged[merged['_merge'] == 'left_only'].drop(columns='_merge')
indicator=True adds a _merge column. 'left_only' rows had no match in the right table — that is the anti join result.
String Operations
df['license_plate'].str.upper()
df['license_plate'].str.lower()
df['license_plate'].str.strip() # remove spaces at the start and end
df['license_plate'].str.len()
df['station_name'].str.contains('North') # returns a boolean Series
df['station_name'].str.startswith('ST')
df['license_plate'].str.replace('-', '')
df['license_plate'].str.split('-', expand=True) # split into several columns
df['license_plate'].str[0:3] # first 3 characters
Use .str accessor methods inside boolean filters too:
df[df['station_name'].str.contains('Xinyi')]
df[df['license_plate'].str.startswith('E')]
Date & Time Operations
Convert to Datetime
df['entry_time'] = pd.to_datetime(df['entry_time'])
df['entry_time'] = pd.to_datetime(df['entry_time'], format='%Y-%m-%d %H:%M:%S')
Always convert on purpose. Date columns almost always arrive as strings.
Extract Parts
df['year'] = df['entry_time'].dt.year
df['month'] = df['entry_time'].dt.month
df['day'] = df['entry_time'].dt.day
df['hour'] = df['entry_time'].dt.hour
df['weekday'] = df['entry_time'].dt.day_name() # 'Monday', 'Tuesday' ...
df['date'] = df['entry_time'].dt.date # drop the time
df['ym'] = df['entry_time'].dt.to_period('M') # '2025-06' period object
Date Math
df['duration_mins'] = (df['exit_time'] - df['entry_time']).dt.total_seconds() / 60
df['days_since'] = (pd.Timestamp.today() - df['entry_time']).dt.days
Filter by Date
df[df['entry_time'] >= '2025-01-01']
df[df['entry_time'].dt.year == 2025]
df[df['entry_time'].dt.month.isin([1, 2, 3])] # Q1 only
Type Conversion
df['amount'] = df['amount'].astype(float)
df['is_electric'] = df['is_electric'].astype(bool)
df['station_code'] = df['station_code'].astype(str)
pd.to_numeric(df['amount'], errors='coerce') # non-numeric → NaN (safe)
errors='coerce' is the safe default. It turns conversion failures into NaN instead of raising an error. The same idea as SQL CAST with a null fallback.
Common DA Workflows
1. Data Cleaning Pipeline
df = pd.read_csv('parking_history.csv', parse_dates=['entry_time', 'exit_time'])
# inspect
print(df.shape)
print(df.isnull().sum())
print(df.duplicated().sum())
# remove exact duplicates
df = df.drop_duplicates()
# fix types
df['amount'] = pd.to_numeric(df['amount'], errors='coerce')
# normalize strings
df['license_plate'] = df['license_plate'].str.strip().str.upper()
# flag or drop rows with critical nulls
df = df.dropna(subset=['license_plate', 'entry_time'])
2. EDA Workflow
# overview
df.info()
df.describe()
# category distributions
for col in ['station_code', 'parking_type', 'payment_method']:
print(f"\n{col}:")
print(df[col].value_counts())
# which columns have nulls?
df.isnull().sum()[df.isnull().sum() > 0]
# numeric correlations
df.select_dtypes(include='number').corr()
3. Monthly Trend with Resample
monthly = (
df.set_index('entry_time')
.resample('ME')['amount']
.agg(total='sum', count='count')
.reset_index()
)
.resample('ME') is the Pandas version of GROUP BY DATE_FORMAT(date, '%Y-%m').
Common frequency aliases: 'D' = day, 'W' = week, 'ME' = month end, 'QE' = quarter end, 'YE' = year end.
4. Window Functions — Rolling & Shift
monthly['rolling_3m'] = monthly['total'].rolling(3).mean() # 3-month moving average
monthly['prev_month'] = monthly['total'].shift(1) # LAG(total, 1)
monthly['next_month'] = monthly['total'].shift(-1) # LEAD(total, 1)
monthly['mom_pct'] = monthly['total'].pct_change() * 100 # MoM % change
monthly['running_total'] = monthly['total'].cumsum() # running total
5. De-duplication — Keep the Latest per Group
df_latest = (
df.sort_values('entry_time', ascending=False)
.drop_duplicates(subset='license_plate', keep='first')
)
Sort descending so keep='first' keeps the latest row per group. SQL version: ROW_NUMBER() OVER (PARTITION BY license_plate ORDER BY entry_time DESC) = 1.
6. Top N per Group
top3 = (
df.sort_values('amount', ascending=False)
.groupby('station_code', group_keys=False)
.head(3)
)
SQL version: ROW_NUMBER() OVER (PARTITION BY station_code ORDER BY amount DESC) <= 3.
7. Conditional Column — CASE WHEN Equivalent
# pd.cut for numeric bins
df['duration_category'] = pd.cut(
df['duration_mins'],
bins=[0, 120, 480, float('inf')],
labels=['Short-term', 'Medium', 'Long-term']
)
# np.select for several independent conditions
conditions = [
df['amount'] > 1000,
df['amount'].between(500, 1000),
]
choices = ['High', 'Medium']
df['spend_tier'] = np.select(conditions, choices, default='Low')
pd.cut— best for numeric ranges (BETWEEN style).np.select— best for several independent conditions (searched CASE WHEN).
8. Period-over-Period Comparison
monthly = (
df.groupby(df['entry_time'].dt.to_period('M'))['amount']
.sum()
.reset_index()
)
monthly.columns = ['month', 'revenue']
monthly['prev_revenue'] = monthly['revenue'].shift(1)
monthly['mom_change'] = monthly['revenue'] - monthly['prev_revenue']
monthly['mom_pct'] = monthly['revenue'].pct_change().mul(100).round(1)
9. Anti Join — Find Unmatched Records
merged = pd.merge(
parking_df, payment_df,
on='parking_id', how='left', indicator=True
)
unpaid = merged[merged['_merge'] == 'left_only'].drop(columns='_merge')
10. Conditional Aggregation
summary = df.groupby('station_code').agg(
total_visits = ('parking_id', 'count'),
hourly_count = ('parking_type', lambda x: (x == 'hourly').sum()),
credit_revenue = ('amount', lambda x: x[df.loc[x.index, 'payment_method'] == 'Credit_Card'].sum()),
avg_amount = ('amount', 'mean'),
)
A cleaner option — build the columns first, then aggregate:
df['is_hourly'] = df['parking_type'] == 'hourly'
df['credit_amount'] = df['amount'].where(df['payment_method'] == 'Credit_Card', 0)
summary = df.groupby('station_code').agg(
total_visits = ('parking_id', 'count'),
hourly_count = ('is_hourly', 'sum'),
credit_revenue = ('credit_amount', 'sum'),
avg_amount = ('amount', 'mean'),
)
Build boolean / conditional columns first, then aggregate them. Cleaner and easier to debug than a lambda inside agg.