Regex Notebook

Regular expressions (regex) describe text patterns. One syntax works in three places a DA uses every day: Python's re module, Pandas .str methods, and SQL REGEXP. Learn the pattern language once and you can clean text anywhere.

See Pandas Notebook for the rest of the .str toolbox and SQL Advanced Notebook for the SQL string functions that work alongside REGEXP.

All examples use the parking system dataset — mostly the license_plate column (format ABC-1234) and messy imported text.

import re
import pandas as pd

Core Syntax

Metacharacters

Token Matches
. Any single character (except newline)
\d One digit (0–9)
\D One non-digit
\w One word character (letter, digit, underscore)
\W One non-word character
\s One whitespace character (space, tab, newline)
\S One non-whitespace character

Character Classes

Token Matches
[ABC] One character: A, B, or C
[A-Z] One uppercase letter
[a-z0-9] One lowercase letter or digit
[^0-9] One character that is NOT a digit
[-.] A literal dash or dot (most metacharacters lose their meaning inside [ ])

Avoid [A-z] — that range silently includes punctuation like [, \, and ^ that sit between Z and a in the character table. Write [A-Za-z].

Quantifiers

Token Meaning
* 0 or more of the previous token
+ 1 or more
? 0 or 1 (optional)
{4} Exactly 4
{2,5} Between 2 and 5
{2,} 2 or more

Anchors & Boundaries

Token Meaning
^ Start of the string
$ End of the string
\b Word boundary (edge between a word character and a non-word character)

Without anchors, a pattern matches anywhere inside the string. \d{4} finds four digits in 'ABC-12345' too — ^\d{4}$ only matches a string that is exactly four digits.

Groups & Alternation

(ABC)        # group: treat 'ABC' as one unit, and capture what it matched
(?:ABC)      # non-capturing group: grouping only, nothing captured
(hourly|monthly|daily)   # alternation: match any one of the three words
(?P<plate>[A-Z]{3})      # named group: capture and give it a name

Groups do two jobs at once: they bundle tokens so quantifiers apply to the whole unit, and they capture the matched text so you can extract or reuse it.

Greedy vs Lazy

text = '<b>Station A</b> and <b>Station B</b>'

re.findall(r'<b>.*</b>', text)    # ['<b>Station A</b> and <b>Station B</b>']  — greedy: as MUCH as possible
re.findall(r'<b>.*?</b>', text)   # ['<b>Station A</b>', '<b>Station B</b>']   — lazy: as little as possible

Quantifiers are greedy by default — they grab the longest possible match. Add ? after the quantifier (*?, +?, {2,5}?) to make it lazy. The classic symptom of a missing ?: one giant match instead of several small ones.

Reading a Pattern Step by Step

The license plate format ABC-1234:

pattern = r'^[A-Z]{3}-\d{4}$'
Piece Meaning
^ Start of the string
[A-Z]{3} Exactly 3 uppercase letters
- A literal dash
\d{4} Exactly 4 digits
$ End of the string

Read every pattern this way — left to right, one token at a time. Writing works the same: describe the format in words first, then translate each word into a token.

Python — the re Module

Find

text = 'Plate ABC-1234 entered at 09:15'

re.search(r'[A-Z]{3}-\d{4}', text)      # first match anywhere → Match object (or None)
re.match(r'[A-Z]{3}-\d{4}', text)       # match at the START only → None here ('Plate ...' starts the string)
re.fullmatch(r'[A-Z]{3}-\d{4}', text)   # the WHOLE string must match → None here
re.findall(r'\d+', text)                # every match → ['1234', '09', '15']

search scans anywhere, match anchors to the start, fullmatch anchors both ends. For validation, use fullmatch — it cannot be fooled by extra text around the value.

Use the Match Object

m = re.search(r'([A-Z]{3})-(\d{4})', text)
if m:
    m.group(0)    # 'ABC-1234'  — the whole match
    m.group(1)    # 'ABC'       — first capture group
    m.group(2)    # '1234'      — second capture group

# named groups read better in longer patterns
m = re.search(r'(?P<letters>[A-Z]{3})-(?P<digits>\d{4})', text)
m.group('letters')    # 'ABC'

Always check for None before calling .group() — a failed match returns None, not an empty match.

Replace & Split

re.sub(r'\s+', ' ', 'too   many    spaces')        # 'too many spaces'
re.sub(r'(\d{4})', r'****', 'ABC-1234')            # 'ABC-****'  — mask the digits
re.sub(r'([A-Z]{3})-(\d{4})', r'\2-\1', 'ABC-1234')  # '1234-ABC' — \1 \2 reuse the groups

re.split(r'[,;/]', 'A,B;C/D')                      # ['A', 'B', 'C', 'D'] — split on any of the three

Compile and Flags

plate_re = re.compile(r'^[A-Z]{3}-\d{4}$')   # compile once, reuse in a loop
plate_re.fullmatch('ABC-1234')

re.search(r'north', 'North Station', re.IGNORECASE)   # case-insensitive

Always write patterns as raw strings (r'...'). Without the r, Python eats the backslashes before regex ever sees them — '\d' happens to survive, '\b' silently becomes a backspace character.

Pandas — .str Methods

Pandas runs a regex against every row of a Series at once. Most .str methods accept regex by default.

Filter Rows

df[df['license_plate'].str.contains(r'^E', na=False)]            # plates starting with E
df[df['license_plate'].str.fullmatch(r'[A-Z]{3}-\d{4}', na=False)]   # valid format only
df[~df['license_plate'].str.fullmatch(r'[A-Z]{3}-\d{4}', na=False)]  # the invalid rows — data quality check

na=False makes NaN rows evaluate to False instead of raising or propagating NaN — almost always what you want inside a filter.

Extract into New Columns

# one column per capture group
df[['plate_letters', 'plate_digits']] = (
    df['license_plate'].str.extract(r'([A-Z]{3})-(\d{4})')
)

# named groups become the column names automatically
df['license_plate'].str.extract(r'(?P<letters>[A-Z]{3})-(?P<digits>\d{4})')

# every match per row (returns one row per match, MultiIndex)
df['notes'].str.extractall(r'(\d+)')

str.extract is the workhorse: pattern in, tidy columns out. Rows that do not match get NaN — filter or flag them afterwards.

Clean & Replace

df['plate_clean'] = df['license_plate'].str.replace(r'[^A-Z0-9]', '', regex=True)  # keep letters+digits only
df['amount_num']  = (
    df['amount_text']                          # 'NT$ 1,200'
      .str.replace(r'[^\d.]', '', regex=True)  # '1200'
      .astype(float)
)
df['masked'] = df['license_plate'].str.replace(r'\d{4}$', '****', regex=True)      # 'ABC-****'

str.replace needs regex=True to treat the pattern as regex — since pandas 2.0 the default is regex=False (plain text replacement).

Split

df['station_name'].str.split(r'\s*-\s*', expand=True)   # split on dash, eat surrounding spaces

SQL — REGEXP

Regex syntax inside the pattern is the same; the function names differ by database.

MySQL (8.0+)

-- filter: plates that match the standard format
SELECT license_plate
FROM parking_history
WHERE license_plate REGEXP '^[A-Z]{3}-[0-9]{4}$';

-- extract: the digit part of the plate
SELECT REGEXP_SUBSTR(license_plate, '[0-9]{4}') AS plate_digits
FROM parking_history;

-- replace: strip everything that is not a digit
SELECT REGEXP_REPLACE(amount_text, '[^0-9]', '') AS amount_clean
FROM raw_import_table;

MySQL 8 regex does not support \d / \w shorthand the way Python does — use explicit classes like [0-9], [A-Za-z0-9_], or POSIX classes like [[:digit:]].

PostgreSQL

-- filter: ~ is "matches regex", ~* is case-insensitive, !~ is "does not match"
SELECT license_plate
FROM parking_history
WHERE license_plate ~ '^[A-Z]{3}-\d{4}$';

-- extract: the first capture group
SELECT SUBSTRING(license_plate FROM '([A-Z]{3})') AS plate_letters
FROM parking_history;

-- replace ('g' flag = replace every match, not just the first)
SELECT REGEXP_REPLACE(amount_text, '[^0-9]', '', 'g') AS amount_clean
FROM raw_import_table;

MySQL vs PostgreSQL Quick Map

Task MySQL PostgreSQL
Match test col REGEXP 'pattern' col ~ 'pattern'
Case-insensitive match REGEXP_LIKE(col, 'p', 'i') col ~* 'pattern'
Extract a match REGEXP_SUBSTR(col, 'p') SUBSTRING(col FROM 'p')
Replace all matches REGEXP_REPLACE(col, 'p', 'x') REGEXP_REPLACE(col, 'p', 'x', 'g')
Digit shorthand [0-9] or [[:digit:]] \d works
Group backreference in replace $1 \1

Note the PostgreSQL replace default: without the 'g' flag it replaces only the first match — the opposite of what MySQL and Python re.sub do.

Common DA Recipes

1. Parse Money Text into Numbers

# 'NT$ 1,200' / '1200元' / ' 1,200 ' → 1200.0
df['amount'] = (
    df['amount_text']
      .str.replace(r'[^\d.]', '', regex=True)
      .replace('', pd.NA)          # rows with no digits at all
      .astype('Float64')
)

2. Validate a Format and Flag Bad Rows

valid = df['license_plate'].str.fullmatch(r'[A-Z]{3}-\d{4}', na=False)
df['plate_ok'] = valid
df[~valid]            # review the bad rows instead of silently dropping them

3. Extract a Date Buried in Text

# 'report_20250615_final.csv' → '20250615'
df['file_date'] = pd.to_datetime(
    df['filename'].str.extract(r'(\d{8})')[0],
    format='%Y%m%d', errors='coerce'
)

4. Extract an Email Domain

df['domain'] = df['email'].str.extract(r'@([\w.-]+)$')

5. Normalize Messy Whitespace and Case

df['station_name'] = (
    df['station_name']
      .str.strip()
      .str.replace(r'\s+', ' ', regex=True)   # collapse runs of spaces/tabs
      .str.title()
)

6. Mask Personal Data Before Sharing

df['plate_masked'] = df['license_plate'].str.replace(r'\d{4}$', '****', regex=True)
# 'ABC-1234' → 'ABC-****'

7. Find Rows Containing Any of Several Keywords

keywords = ['refund', 'error', 'dispute']
pattern  = '|'.join(keywords)                 # 'refund|error|dispute'
df[df['notes'].str.contains(pattern, case=False, na=False)]

Common Mistakes

1. Forgetting the Raw String

re.search('\borth\b', text)     # BROKEN: '\b' is a backspace character, not a boundary
re.search(r'\borth\b', text)    # CORRECT: raw string keeps the backslash

Make r'...' a habit for every pattern, even ones that work without it today.

2. Unescaped Dot

re.findall(r'1.5', 'rates: 1.5, 125, 1x5')    # ['1.5', '125', '1x5'] — . matches anything
re.findall(r'1\.5', 'rates: 1.5, 125, 1x5')   # ['1.5'] — escaped: a literal dot

The same applies to + * ? ( ) [ ] { } ^ $. When matching them literally, escape with a backslash — or use re.escape(user_input) when the text comes from a variable.

3. Validating Without Anchors

re.search(r'[A-Z]{3}-\d{4}', 'XXABC-12345')      # matches! the pattern is found INSIDE the junk
re.fullmatch(r'[A-Z]{3}-\d{4}', 'XXABC-12345')   # None — whole string must match

For validation use fullmatch (or ^...$). For searching use search. Mixing them up is the most common regex bug.

4. str.contains Crashing on Special Characters

df['notes'].str.contains('cost (NTD)')                 # BROKEN: ( ) are regex metacharacters
df['notes'].str.contains('cost (NTD)', regex=False)    # CORRECT: plain-text search
df['notes'].str.contains(re.escape('cost (NTD)'))      # also correct: escaped regex

str.contains treats the pattern as regex by default. For literal text — especially anything with brackets, dots, or dollar signs — pass regex=False.

5. Capture Groups Inside str.contains

df['notes'].str.contains(r'(refund|error)')     # works, but warns: "match groups"
df['notes'].str.contains(r'(?:refund|error)')   # clean: non-capturing group

contains only needs a yes/no, so capturing is wasted work — pandas warns about it. Use (?: ...) when grouping inside a filter.

6. Greedy Quantifier Eating Too Much

re.sub(r'".*"', 'X', 'say "a" then "b"')     # 'say X' — one greedy match ate both
re.sub(r'".*?"', 'X', 'say "a" then "b"')    # 'say X then X'

7. Using Regex When a Plain Method Is Enough

df['station_name'].str.contains('North', regex=False)   # simple substring — no regex needed
df['license_plate'].str[:3]                              # first 3 chars — slicing beats extract

Regex is for patterns. For fixed text, plain string methods are faster and impossible to get wrong. Reach for regex when the text varies in a way you can describe but not enumerate.


Where to use this next: the cleaning pipeline in Pandas Notebook (recipe 1) is where most of these patterns earn their keep, and SQL Advanced Notebook covers the string functions that pair with REGEXP when a pattern alone is not enough.