Pandas Notebook

Pandas is the primary Python library for data manipulation and analysis. This notebook covers the core operations and common DA workflows — from loading raw data to aggregating and reshaping it for reporting.

Many operations here have direct SQL equivalents. Where relevant, the SQL counterpart is noted. See SQL Notes and SQL Patterns for the SQL side.

All examples use a parking system dataset: parking_df, payment_df, and station_df.

import pandas as pd
import numpy as np

Reading & Inspecting Data

Read Files

df = pd.read_csv('parking_history.csv')
df = pd.read_excel('report.xlsx', sheet_name='Sheet1')
df = pd.read_csv('data.csv', parse_dates=['entry_time', 'exit_time'])

Always use parse_dates for date columns — they come in as strings by default.

Inspect a DataFrame

df.head(10)               # first 10 rows (default 5)
df.tail(5)                # last 5 rows
df.shape                  # (rows, columns)
df.dtypes                 # data type per column
df.info()                 # dtypes + non-null counts + memory usage
df.describe()             # count, mean, std, min, max for numeric columns
df.describe(include='all') # include string columns too

Column Summary

df['parking_type'].value_counts()               # frequency count
df['parking_type'].value_counts(normalize=True) # proportions (0–1)
df['license_plate'].nunique()                   # number of unique values
df.isnull().sum()                               # null count per column
df.duplicated().sum()                           # number of duplicate rows

Selecting & Filtering

Select Columns

df['station_code']                       # single column → Series
df[['station_code', 'entry_time']]       # multiple columns → DataFrame

Filter Rows (Boolean Indexing)

df[df['parking_type'] == 'hourly']
df[df['amount'] > 500]
df[df['exit_time'].isnull()]             # WHERE exit_time IS NULL

Combine Conditions

# AND
df[(df['parking_type'] == 'hourly') & (df['amount'] > 500)]
# OR
df[(df['parking_type'] == 'hourly') | (df['parking_type'] == 'monthly')]
# NOT
df[~(df['parking_type'] == 'monthly')]

Always wrap each condition in () when using & or | — Python's operator precedence will cause bugs otherwise.

.loc and .iloc

df.loc[0:4, 'station_code':'entry_time']         # label-based slicing
df.iloc[0:5, 0:3]                                # position-based slicing
df.loc[df['amount'] > 500, 'license_plate']      # filter rows + select column
  • .loc uses column names and index labels.
  • .iloc uses integer positions.

.query() (Readable Alternative to Boolean Indexing)

df.query("parking_type == 'hourly' and amount > 500")
df.query("exit_time.isnull()", engine='python')

limit = 500
df.query("amount > @limit")    # reference a Python variable with @

.isin() and .between()

df[df['payment_method'].isin(['Credit_Card', 'Mobile_Pay'])]   # IN (...)
df[df['amount'].between(100, 500)]                              # BETWEEN 100 AND 500

Handling Missing Values

Detect

df.isnull().sum()                        # null count per column
df.isnull().sum() / len(df) * 100        # null percentage per column
df[df['exit_time'].isnull()]             # rows where exit_time is missing

Drop

df.dropna()                              # drop rows with ANY null
df.dropna(subset=['exit_time'])          # drop only where exit_time is null
df.dropna(thresh=3)                      # keep rows with at least 3 non-null values

Fill

df['exit_time'].fillna('Still Parked')
df['amount'].fillna(0)
df['amount'].fillna(df['amount'].mean())   # fill with column mean
df['amount'].ffill()                       # forward fill — carry last value forward
df['amount'].bfill()                       # backward fill

All these return a new DataFrame by default. Add inplace=True only when you intentionally want to modify the original.

Sorting & Ranking

Sort

df.sort_values('entry_time')                                        # ascending
df.sort_values('amount', ascending=False)                           # descending
df.sort_values(['station_code', 'amount'], ascending=[True, False]) # multi-column

Rank

df['rank'] = df['amount'].rank(ascending=False, method='dense')
method SQL equivalent
'min' RANK()
'dense' DENSE_RANK()
'first' ROW_NUMBER()

Aggregation & GroupBy

Basic GroupBy

df.groupby('station_code')['amount'].sum()
df.groupby('station_code')['amount'].mean()
df.groupby('station_code').size()           # row count per group (COUNT(*))

Multiple Aggregations (Named)

df.groupby('station_code').agg(
    total_revenue = ('amount',     'sum'),
    avg_amount    = ('amount',     'mean'),
    max_amount    = ('amount',     'max'),
    visit_count   = ('parking_id', 'count')
)

Named aggregation (col, func) gives clean column names without extra renaming afterward.

GroupBy Multiple Columns

df.groupby(['station_code', 'payment_method']).agg(
    total = ('amount', 'sum'),
    count = ('amount', 'count')
).reset_index()

.reset_index() converts group keys from the index back to regular columns — almost always what you want.

Custom Aggregation

df.groupby('station_code').agg(
    revenue_range = ('amount', lambda x: x.max() - x.min())
)

Pivot Table (SQL Pivot Equivalent)

pd.pivot_table(
    df,
    values='amount',
    index='station_code',
    columns='payment_method',
    aggfunc='sum',
    fill_value=0
)

index = GROUP BY, columns = the column whose values become new column headers, aggfunc = the aggregation function.

Cross-tabulation (Count Distribution)

pd.crosstab(df['station_code'], df['payment_method'])
pd.crosstab(df['station_code'], df['payment_method'], normalize='index')  # row %

Merging & Joining

merge() — SQL JOIN Equivalent

# INNER JOIN
pd.merge(parking_df, payment_df, on='parking_id')

# LEFT JOIN
pd.merge(parking_df, payment_df, on='parking_id', how='left')

# Different key names in each table
pd.merge(parking_df, station_df, left_on='station_code', right_on='code')
how SQL equivalent
'inner' INNER JOIN
'left' LEFT JOIN
'right' RIGHT JOIN
'outer' FULL OUTER JOIN

concat() — Stack DataFrames (UNION ALL)

pd.concat([df_2024, df_2025])                     # stack rows
pd.concat([df_2024, df_2025], ignore_index=True)  # reset index after stacking
pd.concat([df_a, df_b], axis=1)                   # stack columns side by side

Anti Join — Records with No Match

merged = pd.merge(parking_df, payment_df,
                  on='parking_id', how='left', indicator=True)
unmatched = merged[merged['_merge'] == 'left_only'].drop(columns='_merge')

indicator=True adds a _merge column. 'left_only' rows had no match in the right table — the anti join result.

String Operations

df['license_plate'].str.upper()
df['license_plate'].str.lower()
df['license_plate'].str.strip()                  # remove leading/trailing spaces
df['license_plate'].str.len()
df['station_name'].str.contains('North')         # returns boolean Series
df['station_name'].str.startswith('ST')
df['license_plate'].str.replace('-', '')
df['license_plate'].str.split('-', expand=True)  # split into multiple columns
df['license_plate'].str[0:3]                     # slice first 3 characters

Use .str accessor methods directly inside boolean filters:

df[df['station_name'].str.contains('Xinyi')]
df[df['license_plate'].str.startswith('E')]

Date & Time Operations

Convert to Datetime

df['entry_time'] = pd.to_datetime(df['entry_time'])
df['entry_time'] = pd.to_datetime(df['entry_time'], format='%Y-%m-%d %H:%M:%S')

Always convert explicitly — date columns almost always arrive as strings.

Extract Components

df['year']    = df['entry_time'].dt.year
df['month']   = df['entry_time'].dt.month
df['day']     = df['entry_time'].dt.day
df['hour']    = df['entry_time'].dt.hour
df['weekday'] = df['entry_time'].dt.day_name()      # 'Monday', 'Tuesday' ...
df['date']    = df['entry_time'].dt.date            # strip time component
df['ym']      = df['entry_time'].dt.to_period('M')  # '2025-06' period object

Date Arithmetic

df['duration_mins'] = (df['exit_time'] - df['entry_time']).dt.total_seconds() / 60
df['days_since']    = (pd.Timestamp.today() - df['entry_time']).dt.days

Filter by Date

df[df['entry_time'] >= '2025-01-01']
df[df['entry_time'].dt.year == 2025]
df[df['entry_time'].dt.month.isin([1, 2, 3])]   # Q1 only

Type Conversion

df['amount']       = df['amount'].astype(float)
df['is_electric']  = df['is_electric'].astype(bool)
df['station_code'] = df['station_code'].astype(str)

pd.to_numeric(df['amount'], errors='coerce')   # non-numeric → NaN (safe)

errors='coerce' is the safe default — converts failures to NaN instead of raising an error. Equivalent to SQL CAST with null fallback.

Common DA Workflows

1. Data Cleaning Pipeline

df = pd.read_csv('parking_history.csv', parse_dates=['entry_time', 'exit_time'])

# inspect
print(df.shape)
print(df.isnull().sum())
print(df.duplicated().sum())

# remove exact duplicates
df = df.drop_duplicates()

# fix types
df['amount'] = pd.to_numeric(df['amount'], errors='coerce')

# normalize strings
df['license_plate'] = df['license_plate'].str.strip().str.upper()

# flag or drop rows with critical nulls
df = df.dropna(subset=['license_plate', 'entry_time'])

2. EDA Workflow

# overview
df.info()
df.describe()

# categorical distributions
for col in ['station_code', 'parking_type', 'payment_method']:
    print(f"\n{col}:")
    print(df[col].value_counts())

# which columns have nulls?
df.isnull().sum()[df.isnull().sum() > 0]

# numeric correlations
df.select_dtypes(include='number').corr()

3. Monthly Trend with Resample

monthly = (
    df.set_index('entry_time')
      .resample('ME')['amount']
      .agg(total='sum', count='count')
      .reset_index()
)

.resample('ME') is the Pandas equivalent of GROUP BY DATE_FORMAT(date, '%Y-%m').
Common frequency aliases: 'D' = day, 'W' = week, 'ME' = month end, 'QE' = quarter end, 'YE' = year end.

4. Window Functions — Rolling & Shift

monthly['rolling_3m']  = monthly['total'].rolling(3).mean()  # 3-month moving average
monthly['prev_month']  = monthly['total'].shift(1)           # LAG(total, 1)
monthly['next_month']  = monthly['total'].shift(-1)          # LEAD(total, 1)
monthly['mom_pct']     = monthly['total'].pct_change() * 100 # MoM % change
monthly['running_total'] = monthly['total'].cumsum()         # running total

5. De-duplication — Keep Latest per Group

df_latest = (
    df.sort_values('entry_time', ascending=False)
      .drop_duplicates(subset='license_plate', keep='first')
)

Sort descending so keep='first' retains the latest row per group. SQL equivalent: ROW_NUMBER() OVER (PARTITION BY license_plate ORDER BY entry_time DESC) = 1.

6. Top N per Group

top3 = (
    df.sort_values('amount', ascending=False)
      .groupby('station_code', group_keys=False)
      .head(3)
)

SQL equivalent: ROW_NUMBER() OVER (PARTITION BY station_code ORDER BY amount DESC) <= 3.

7. Conditional Column — CASE WHEN Equivalent

# pd.cut for numeric bins
df['duration_category'] = pd.cut(
    df['duration_mins'],
    bins=[0, 120, 480, float('inf')],
    labels=['Short-term', 'Medium', 'Long-term']
)

# np.select for complex multi-condition logic
conditions = [
    df['amount'] > 1000,
    df['amount'].between(500, 1000),
]
choices = ['High', 'Medium']
df['spend_tier'] = np.select(conditions, choices, default='Low')
  • pd.cut — best for numeric ranges (BETWEEN style).
  • np.select — best for multiple independent conditions (searched CASE WHEN).

8. Period-over-Period Comparison

monthly = (
    df.groupby(df['entry_time'].dt.to_period('M'))['amount']
      .sum()
      .reset_index()
)
monthly.columns = ['month', 'revenue']
monthly['prev_revenue'] = monthly['revenue'].shift(1)
monthly['mom_change']   = monthly['revenue'] - monthly['prev_revenue']
monthly['mom_pct']      = monthly['revenue'].pct_change().mul(100).round(1)

9. Anti Join — Find Unmatched Records

merged = pd.merge(
    parking_df, payment_df,
    on='parking_id', how='left', indicator=True
)
unpaid = merged[merged['_merge'] == 'left_only'].drop(columns='_merge')

10. Conditional Aggregation

summary = df.groupby('station_code').agg(
    total_visits     = ('parking_id',   'count'),
    hourly_count     = ('parking_type', lambda x: (x == 'hourly').sum()),
    credit_revenue   = ('amount',       lambda x: x[df.loc[x.index, 'payment_method'] == 'Credit_Card'].sum()),
    avg_amount       = ('amount',       'mean'),
)

Cleaner alternative — compute columns first, then aggregate:

df['is_hourly']       = df['parking_type'] == 'hourly'
df['credit_amount']   = df['amount'].where(df['payment_method'] == 'Credit_Card', 0)

summary = df.groupby('station_code').agg(
    total_visits   = ('parking_id',    'count'),
    hourly_count   = ('is_hourly',     'sum'),
    credit_revenue = ('credit_amount', 'sum'),
    avg_amount     = ('amount',        'mean'),
)

Create boolean/conditional columns first, then aggregate them — cleaner and easier to debug than lambda inside agg.