Pandas Notebook

Pandas is the main Python library for cleaning and shaping data. This notebook covers the core operations and the workflows you will see most often in DA work — from loading raw data to grouping and reshaping it for a report.

Many of these have a direct SQL version. Where it helps, the SQL version is shown next to it. See SQL Notes and SQL Patterns for the SQL side. Once your data is ready, see Matplotlib / Seaborn Notebook for charts and Machine Learning Notebook for models.

If you are coming from Excel, the Excel Notebook has an Excel vs Pandas quick reference at the end.

For the NumPy layer that Pandas is built on, see NumPy Notebook.

All examples use a parking system dataset: parking_df, payment_df, and station_df.

import pandas as pd
import numpy as np

Reading & Inspecting Data

Read Files

df = pd.read_csv('parking_history.csv')
df = pd.read_excel('report.xlsx', sheet_name='Sheet1')
df = pd.read_csv('data.csv', parse_dates=['entry_time', 'exit_time'])

Always use parse_dates for date columns. They come in as strings by default.

Inspect a DataFrame

df.head(10)               # first 10 rows (default 5)
df.tail(5)                # last 5 rows
df.shape                  # (rows, columns)
df.dtypes                 # data type per column
df.info()                 # dtypes + non-null counts + memory usage
df.describe()             # count, mean, std, min, max for numeric columns
df.describe(include='all') # include string columns too

Column Summary

df['parking_type'].value_counts()               # how many of each value
df['parking_type'].value_counts(normalize=True) # share (0–1)
df['license_plate'].nunique()                   # number of unique values
df.isnull().sum()                               # nulls per column
df.duplicated().sum()                           # number of duplicate rows

Selecting & Filtering

Select Columns

df['station_code']                       # one column → Series
df[['station_code', 'entry_time']]       # several columns → DataFrame

Filter Rows (Boolean Indexing)

df[df['parking_type'] == 'hourly']
df[df['amount'] > 500]
df[df['exit_time'].isnull()]             # WHERE exit_time IS NULL

Combine Conditions

# AND
df[(df['parking_type'] == 'hourly') & (df['amount'] > 500)]
# OR
df[(df['parking_type'] == 'hourly') | (df['parking_type'] == 'monthly')]
# NOT
df[~(df['parking_type'] == 'monthly')]

Always wrap each condition in () when you use & or |. Without the brackets, Python's operator order will cause bugs.

.loc and .iloc

df.loc[0:4, 'station_code':'entry_time']         # slice by labels
df.iloc[0:5, 0:3]                                # slice by integer position
df.loc[df['amount'] > 500, 'license_plate']      # filter rows + pick column
  • .loc uses column names and index labels.
  • .iloc uses integer positions.

.query() (Easier to Read Than Boolean Indexing)

df.query("parking_type == 'hourly' and amount > 500")
df.query("exit_time.isnull()", engine='python')

limit = 500
df.query("amount > @limit")    # use a Python variable with @

.isin() and .between()

df[df['payment_method'].isin(['Credit_Card', 'Mobile_Pay'])]   # IN (...)
df[df['amount'].between(100, 500)]                              # BETWEEN 100 AND 500

Handling Missing Values

Detect

df.isnull().sum()                        # null count per column
df.isnull().sum() / len(df) * 100        # null percentage per column
df[df['exit_time'].isnull()]             # rows where exit_time is missing

Drop

df.dropna()                              # drop rows with ANY null
df.dropna(subset=['exit_time'])          # drop only where exit_time is null
df.dropna(thresh=3)                      # keep rows with at least 3 non-null values

Fill

df['exit_time'].fillna('Still Parked')
df['amount'].fillna(0)
df['amount'].fillna(df['amount'].mean())   # fill with the column mean
df['amount'].ffill()                       # forward fill — carry the last value forward
df['amount'].bfill()                       # backward fill

All of these return a new DataFrame by default. Add inplace=True only when you really mean to change the original.

Sorting & Ranking

Sort

df.sort_values('entry_time')                                        # ascending
df.sort_values('amount', ascending=False)                           # descending
df.sort_values(['station_code', 'amount'], ascending=[True, False]) # several columns

Rank

df['rank'] = df['amount'].rank(ascending=False, method='dense')
method SQL equivalent
'min' RANK()
'dense' DENSE_RANK()
'first' ROW_NUMBER()

Aggregation & GroupBy

Basic GroupBy

df.groupby('station_code')['amount'].sum()
df.groupby('station_code')['amount'].mean()
df.groupby('station_code').size()           # row count per group (COUNT(*))

Several Aggregations (Named)

df.groupby('station_code').agg(
    total_revenue = ('amount',     'sum'),
    avg_amount    = ('amount',     'mean'),
    max_amount    = ('amount',     'max'),
    visit_count   = ('parking_id', 'count')
)

Named aggregation (col, func) gives you clean column names without an extra rename step.

GroupBy on Several Columns

df.groupby(['station_code', 'payment_method']).agg(
    total = ('amount', 'sum'),
    count = ('amount', 'count')
).reset_index()

.reset_index() turns the group keys back into normal columns — almost always what you want.

Custom Aggregation

df.groupby('station_code').agg(
    revenue_range = ('amount', lambda x: x.max() - x.min())
)

Pivot Table (SQL Pivot Equivalent)

pd.pivot_table(
    df,
    values='amount',
    index='station_code',
    columns='payment_method',
    aggfunc='sum',
    fill_value=0
)

index = GROUP BY, columns = the column whose values become new column headers, aggfunc = the aggregate function.

Cross-tabulation (Count Distribution)

pd.crosstab(df['station_code'], df['payment_method'])
pd.crosstab(df['station_code'], df['payment_method'], normalize='index')  # row %

Merging & Joining

merge() — SQL JOIN Equivalent

# INNER JOIN
pd.merge(parking_df, payment_df, on='parking_id')

# LEFT JOIN
pd.merge(parking_df, payment_df, on='parking_id', how='left')

# Different key names in each table
pd.merge(parking_df, station_df, left_on='station_code', right_on='code')
how SQL equivalent
'inner' INNER JOIN
'left' LEFT JOIN
'right' RIGHT JOIN
'outer' FULL OUTER JOIN

concat() — Stack DataFrames (UNION ALL)

pd.concat([df_2024, df_2025])                     # stack rows
pd.concat([df_2024, df_2025], ignore_index=True)  # reset the index after stacking
pd.concat([df_a, df_b], axis=1)                   # stack columns side by side

Anti Join — Records with No Match

merged = pd.merge(parking_df, payment_df,
                  on='parking_id', how='left', indicator=True)
unmatched = merged[merged['_merge'] == 'left_only'].drop(columns='_merge')

indicator=True adds a _merge column. 'left_only' rows had no match in the right table — that is the anti join result.

String Operations

df['license_plate'].str.upper()
df['license_plate'].str.lower()
df['license_plate'].str.strip()                  # remove spaces at the start and end
df['license_plate'].str.len()
df['station_name'].str.contains('North')         # returns a boolean Series
df['station_name'].str.startswith('ST')
df['license_plate'].str.replace('-', '')
df['license_plate'].str.split('-', expand=True)  # split into several columns
df['license_plate'].str[0:3]                     # first 3 characters

Use .str accessor methods inside boolean filters too:

df[df['station_name'].str.contains('Xinyi')]
df[df['license_plate'].str.startswith('E')]

Date & Time Operations

Convert to Datetime

df['entry_time'] = pd.to_datetime(df['entry_time'])
df['entry_time'] = pd.to_datetime(df['entry_time'], format='%Y-%m-%d %H:%M:%S')

Always convert on purpose. Date columns almost always arrive as strings.

Extract Parts

df['year']    = df['entry_time'].dt.year
df['month']   = df['entry_time'].dt.month
df['day']     = df['entry_time'].dt.day
df['hour']    = df['entry_time'].dt.hour
df['weekday'] = df['entry_time'].dt.day_name()      # 'Monday', 'Tuesday' ...
df['date']    = df['entry_time'].dt.date            # drop the time
df['ym']      = df['entry_time'].dt.to_period('M')  # '2025-06' period object

Date Math

df['duration_mins'] = (df['exit_time'] - df['entry_time']).dt.total_seconds() / 60
df['days_since']    = (pd.Timestamp.today() - df['entry_time']).dt.days

Filter by Date

df[df['entry_time'] >= '2025-01-01']
df[df['entry_time'].dt.year == 2025]
df[df['entry_time'].dt.month.isin([1, 2, 3])]   # Q1 only

Type Conversion

df['amount']       = df['amount'].astype(float)
df['is_electric']  = df['is_electric'].astype(bool)
df['station_code'] = df['station_code'].astype(str)

pd.to_numeric(df['amount'], errors='coerce')   # non-numeric → NaN (safe)

errors='coerce' is the safe default. It turns conversion failures into NaN instead of raising an error. The same idea as SQL CAST with a null fallback.

Common DA Workflows

1. Data Cleaning Pipeline

df = pd.read_csv('parking_history.csv', parse_dates=['entry_time', 'exit_time'])

# inspect
print(df.shape)
print(df.isnull().sum())
print(df.duplicated().sum())

# remove exact duplicates
df = df.drop_duplicates()

# fix types
df['amount'] = pd.to_numeric(df['amount'], errors='coerce')

# normalize strings
df['license_plate'] = df['license_plate'].str.strip().str.upper()

# flag or drop rows with critical nulls
df = df.dropna(subset=['license_plate', 'entry_time'])

2. EDA Workflow

# overview
df.info()
df.describe()

# category distributions
for col in ['station_code', 'parking_type', 'payment_method']:
    print(f"\n{col}:")
    print(df[col].value_counts())

# which columns have nulls?
df.isnull().sum()[df.isnull().sum() > 0]

# numeric correlations
df.select_dtypes(include='number').corr()

3. Monthly Trend with Resample

monthly = (
    df.set_index('entry_time')
      .resample('ME')['amount']
      .agg(total='sum', count='count')
      .reset_index()
)

.resample('ME') is the Pandas version of GROUP BY DATE_FORMAT(date, '%Y-%m'). Common frequency aliases: 'D' = day, 'W' = week, 'ME' = month end, 'QE' = quarter end, 'YE' = year end.

4. Window Functions — Rolling & Shift

monthly['rolling_3m']  = monthly['total'].rolling(3).mean()  # 3-month moving average
monthly['prev_month']  = monthly['total'].shift(1)           # LAG(total, 1)
monthly['next_month']  = monthly['total'].shift(-1)          # LEAD(total, 1)
monthly['mom_pct']     = monthly['total'].pct_change() * 100 # MoM % change
monthly['running_total'] = monthly['total'].cumsum()         # running total

5. De-duplication — Keep the Latest per Group

df_latest = (
    df.sort_values('entry_time', ascending=False)
      .drop_duplicates(subset='license_plate', keep='first')
)

Sort descending so keep='first' keeps the latest row per group. SQL version: ROW_NUMBER() OVER (PARTITION BY license_plate ORDER BY entry_time DESC) = 1.

6. Top N per Group

top3 = (
    df.sort_values('amount', ascending=False)
      .groupby('station_code', group_keys=False)
      .head(3)
)

SQL version: ROW_NUMBER() OVER (PARTITION BY station_code ORDER BY amount DESC) <= 3.

7. Conditional Column — CASE WHEN Equivalent

# pd.cut for numeric bins
df['duration_category'] = pd.cut(
    df['duration_mins'],
    bins=[0, 120, 480, float('inf')],
    labels=['Short-term', 'Medium', 'Long-term']
)

# np.select for several independent conditions
conditions = [
    df['amount'] > 1000,
    df['amount'].between(500, 1000),
]
choices = ['High', 'Medium']
df['spend_tier'] = np.select(conditions, choices, default='Low')
  • pd.cut — best for numeric ranges (BETWEEN style).
  • np.select — best for several independent conditions (searched CASE WHEN).

8. Period-over-Period Comparison

monthly = (
    df.groupby(df['entry_time'].dt.to_period('M'))['amount']
      .sum()
      .reset_index()
)
monthly.columns = ['month', 'revenue']
monthly['prev_revenue'] = monthly['revenue'].shift(1)
monthly['mom_change']   = monthly['revenue'] - monthly['prev_revenue']
monthly['mom_pct']      = monthly['revenue'].pct_change().mul(100).round(1)

9. Anti Join — Find Unmatched Records

merged = pd.merge(
    parking_df, payment_df,
    on='parking_id', how='left', indicator=True
)
unpaid = merged[merged['_merge'] == 'left_only'].drop(columns='_merge')

10. Conditional Aggregation

summary = df.groupby('station_code').agg(
    total_visits     = ('parking_id',   'count'),
    hourly_count     = ('parking_type', lambda x: (x == 'hourly').sum()),
    credit_revenue   = ('amount',       lambda x: x[df.loc[x.index, 'payment_method'] == 'Credit_Card'].sum()),
    avg_amount       = ('amount',       'mean'),
)

A cleaner option — build the columns first, then aggregate:

df['is_hourly']       = df['parking_type'] == 'hourly'
df['credit_amount']   = df['amount'].where(df['payment_method'] == 'Credit_Card', 0)

summary = df.groupby('station_code').agg(
    total_visits   = ('parking_id',    'count'),
    hourly_count   = ('is_hourly',     'sum'),
    credit_revenue = ('credit_amount', 'sum'),
    avg_amount     = ('amount',        'mean'),
)

Build boolean / conditional columns first, then aggregate them. Cleaner and easier to debug than a lambda inside agg.