Regex Notebook
Regular expressions (regex) describe text patterns. One syntax works in three places a DA uses every day: Python's re module, Pandas .str methods, and SQL REGEXP. Learn the pattern language once and you can clean text anywhere.
See Pandas Notebook for the rest of the .str toolbox and SQL Advanced Notebook for the SQL string functions that work alongside REGEXP.
All examples use the parking system dataset — mostly the license_plate column (format ABC-1234) and messy imported text.
import re
import pandas as pd
Core Syntax
Metacharacters
| Token | Matches |
|---|---|
. |
Any single character (except newline) |
\d |
One digit (0–9) |
\D |
One non-digit |
\w |
One word character (letter, digit, underscore) |
\W |
One non-word character |
\s |
One whitespace character (space, tab, newline) |
\S |
One non-whitespace character |
Character Classes
| Token | Matches |
|---|---|
[ABC] |
One character: A, B, or C |
[A-Z] |
One uppercase letter |
[a-z0-9] |
One lowercase letter or digit |
[^0-9] |
One character that is NOT a digit |
[-.] |
A literal dash or dot (most metacharacters lose their meaning inside [ ]) |
Avoid [A-z] — that range silently includes punctuation like [, \, and ^ that sit between Z and a in the character table. Write [A-Za-z].
Quantifiers
| Token | Meaning |
|---|---|
* |
0 or more of the previous token |
+ |
1 or more |
? |
0 or 1 (optional) |
{4} |
Exactly 4 |
{2,5} |
Between 2 and 5 |
{2,} |
2 or more |
Anchors & Boundaries
| Token | Meaning |
|---|---|
^ |
Start of the string |
$ |
End of the string |
\b |
Word boundary (edge between a word character and a non-word character) |
Without anchors, a pattern matches anywhere inside the string. \d{4} finds four digits in 'ABC-12345' too — ^\d{4}$ only matches a string that is exactly four digits.
Groups & Alternation
(ABC) # group: treat 'ABC' as one unit, and capture what it matched
(?:ABC) # non-capturing group: grouping only, nothing captured
(hourly|monthly|daily) # alternation: match any one of the three words
(?P<plate>[A-Z]{3}) # named group: capture and give it a name
Groups do two jobs at once: they bundle tokens so quantifiers apply to the whole unit, and they capture the matched text so you can extract or reuse it.
Greedy vs Lazy
text = '<b>Station A</b> and <b>Station B</b>'
re.findall(r'<b>.*</b>', text) # ['<b>Station A</b> and <b>Station B</b>'] — greedy: as MUCH as possible
re.findall(r'<b>.*?</b>', text) # ['<b>Station A</b>', '<b>Station B</b>'] — lazy: as little as possible
Quantifiers are greedy by default — they grab the longest possible match. Add ? after the quantifier (*?, +?, {2,5}?) to make it lazy. The classic symptom of a missing ?: one giant match instead of several small ones.
Reading a Pattern Step by Step
The license plate format ABC-1234:
pattern = r'^[A-Z]{3}-\d{4}$'
| Piece | Meaning |
|---|---|
^ |
Start of the string |
[A-Z]{3} |
Exactly 3 uppercase letters |
- |
A literal dash |
\d{4} |
Exactly 4 digits |
$ |
End of the string |
Read every pattern this way — left to right, one token at a time. Writing works the same: describe the format in words first, then translate each word into a token.
Python — the re Module
Find
text = 'Plate ABC-1234 entered at 09:15'
re.search(r'[A-Z]{3}-\d{4}', text) # first match anywhere → Match object (or None)
re.match(r'[A-Z]{3}-\d{4}', text) # match at the START only → None here ('Plate ...' starts the string)
re.fullmatch(r'[A-Z]{3}-\d{4}', text) # the WHOLE string must match → None here
re.findall(r'\d+', text) # every match → ['1234', '09', '15']
search scans anywhere, match anchors to the start, fullmatch anchors both ends. For validation, use fullmatch — it cannot be fooled by extra text around the value.
Use the Match Object
m = re.search(r'([A-Z]{3})-(\d{4})', text)
if m:
m.group(0) # 'ABC-1234' — the whole match
m.group(1) # 'ABC' — first capture group
m.group(2) # '1234' — second capture group
# named groups read better in longer patterns
m = re.search(r'(?P<letters>[A-Z]{3})-(?P<digits>\d{4})', text)
m.group('letters') # 'ABC'
Always check for None before calling .group() — a failed match returns None, not an empty match.
Replace & Split
re.sub(r'\s+', ' ', 'too many spaces') # 'too many spaces'
re.sub(r'(\d{4})', r'****', 'ABC-1234') # 'ABC-****' — mask the digits
re.sub(r'([A-Z]{3})-(\d{4})', r'\2-\1', 'ABC-1234') # '1234-ABC' — \1 \2 reuse the groups
re.split(r'[,;/]', 'A,B;C/D') # ['A', 'B', 'C', 'D'] — split on any of the three
Compile and Flags
plate_re = re.compile(r'^[A-Z]{3}-\d{4}$') # compile once, reuse in a loop
plate_re.fullmatch('ABC-1234')
re.search(r'north', 'North Station', re.IGNORECASE) # case-insensitive
Always write patterns as raw strings (r'...'). Without the r, Python eats the backslashes before regex ever sees them — '\d' happens to survive, '\b' silently becomes a backspace character.
Pandas — .str Methods
Pandas runs a regex against every row of a Series at once. Most .str methods accept regex by default.
Filter Rows
df[df['license_plate'].str.contains(r'^E', na=False)] # plates starting with E
df[df['license_plate'].str.fullmatch(r'[A-Z]{3}-\d{4}', na=False)] # valid format only
df[~df['license_plate'].str.fullmatch(r'[A-Z]{3}-\d{4}', na=False)] # the invalid rows — data quality check
na=False makes NaN rows evaluate to False instead of raising or propagating NaN — almost always what you want inside a filter.
Extract into New Columns
# one column per capture group
df[['plate_letters', 'plate_digits']] = (
df['license_plate'].str.extract(r'([A-Z]{3})-(\d{4})')
)
# named groups become the column names automatically
df['license_plate'].str.extract(r'(?P<letters>[A-Z]{3})-(?P<digits>\d{4})')
# every match per row (returns one row per match, MultiIndex)
df['notes'].str.extractall(r'(\d+)')
str.extract is the workhorse: pattern in, tidy columns out. Rows that do not match get NaN — filter or flag them afterwards.
Clean & Replace
df['plate_clean'] = df['license_plate'].str.replace(r'[^A-Z0-9]', '', regex=True) # keep letters+digits only
df['amount_num'] = (
df['amount_text'] # 'NT$ 1,200'
.str.replace(r'[^\d.]', '', regex=True) # '1200'
.astype(float)
)
df['masked'] = df['license_plate'].str.replace(r'\d{4}$', '****', regex=True) # 'ABC-****'
str.replace needs regex=True to treat the pattern as regex — since pandas 2.0 the default is regex=False (plain text replacement).
Split
df['station_name'].str.split(r'\s*-\s*', expand=True) # split on dash, eat surrounding spaces
SQL — REGEXP
Regex syntax inside the pattern is the same; the function names differ by database.
MySQL (8.0+)
-- filter: plates that match the standard format
SELECT license_plate
FROM parking_history
WHERE license_plate REGEXP '^[A-Z]{3}-[0-9]{4}$';
-- extract: the digit part of the plate
SELECT REGEXP_SUBSTR(license_plate, '[0-9]{4}') AS plate_digits
FROM parking_history;
-- replace: strip everything that is not a digit
SELECT REGEXP_REPLACE(amount_text, '[^0-9]', '') AS amount_clean
FROM raw_import_table;
MySQL 8 regex does not support \d / \w shorthand the way Python does — use explicit classes like [0-9], [A-Za-z0-9_], or POSIX classes like [[:digit:]].
PostgreSQL
-- filter: ~ is "matches regex", ~* is case-insensitive, !~ is "does not match"
SELECT license_plate
FROM parking_history
WHERE license_plate ~ '^[A-Z]{3}-\d{4}$';
-- extract: the first capture group
SELECT SUBSTRING(license_plate FROM '([A-Z]{3})') AS plate_letters
FROM parking_history;
-- replace ('g' flag = replace every match, not just the first)
SELECT REGEXP_REPLACE(amount_text, '[^0-9]', '', 'g') AS amount_clean
FROM raw_import_table;
MySQL vs PostgreSQL Quick Map
| Task | MySQL | PostgreSQL |
|---|---|---|
| Match test | col REGEXP 'pattern' |
col ~ 'pattern' |
| Case-insensitive match | REGEXP_LIKE(col, 'p', 'i') |
col ~* 'pattern' |
| Extract a match | REGEXP_SUBSTR(col, 'p') |
SUBSTRING(col FROM 'p') |
| Replace all matches | REGEXP_REPLACE(col, 'p', 'x') |
REGEXP_REPLACE(col, 'p', 'x', 'g') |
| Digit shorthand | [0-9] or [[:digit:]] |
\d works |
| Group backreference in replace | $1 |
\1 |
Note the PostgreSQL replace default: without the 'g' flag it replaces only the first match — the opposite of what MySQL and Python re.sub do.
Common DA Recipes
1. Parse Money Text into Numbers
# 'NT$ 1,200' / '1200元' / ' 1,200 ' → 1200.0
df['amount'] = (
df['amount_text']
.str.replace(r'[^\d.]', '', regex=True)
.replace('', pd.NA) # rows with no digits at all
.astype('Float64')
)
2. Validate a Format and Flag Bad Rows
valid = df['license_plate'].str.fullmatch(r'[A-Z]{3}-\d{4}', na=False)
df['plate_ok'] = valid
df[~valid] # review the bad rows instead of silently dropping them
3. Extract a Date Buried in Text
# 'report_20250615_final.csv' → '20250615'
df['file_date'] = pd.to_datetime(
df['filename'].str.extract(r'(\d{8})')[0],
format='%Y%m%d', errors='coerce'
)
4. Extract an Email Domain
df['domain'] = df['email'].str.extract(r'@([\w.-]+)$')
5. Normalize Messy Whitespace and Case
df['station_name'] = (
df['station_name']
.str.strip()
.str.replace(r'\s+', ' ', regex=True) # collapse runs of spaces/tabs
.str.title()
)
6. Mask Personal Data Before Sharing
df['plate_masked'] = df['license_plate'].str.replace(r'\d{4}$', '****', regex=True)
# 'ABC-1234' → 'ABC-****'
7. Find Rows Containing Any of Several Keywords
keywords = ['refund', 'error', 'dispute']
pattern = '|'.join(keywords) # 'refund|error|dispute'
df[df['notes'].str.contains(pattern, case=False, na=False)]
Common Mistakes
1. Forgetting the Raw String
re.search('\borth\b', text) # BROKEN: '\b' is a backspace character, not a boundary
re.search(r'\borth\b', text) # CORRECT: raw string keeps the backslash
Make r'...' a habit for every pattern, even ones that work without it today.
2. Unescaped Dot
re.findall(r'1.5', 'rates: 1.5, 125, 1x5') # ['1.5', '125', '1x5'] — . matches anything
re.findall(r'1\.5', 'rates: 1.5, 125, 1x5') # ['1.5'] — escaped: a literal dot
The same applies to + * ? ( ) [ ] { } ^ $. When matching them literally, escape with a backslash — or use re.escape(user_input) when the text comes from a variable.
3. Validating Without Anchors
re.search(r'[A-Z]{3}-\d{4}', 'XXABC-12345') # matches! the pattern is found INSIDE the junk
re.fullmatch(r'[A-Z]{3}-\d{4}', 'XXABC-12345') # None — whole string must match
For validation use fullmatch (or ^...$). For searching use search. Mixing them up is the most common regex bug.
4. str.contains Crashing on Special Characters
df['notes'].str.contains('cost (NTD)') # BROKEN: ( ) are regex metacharacters
df['notes'].str.contains('cost (NTD)', regex=False) # CORRECT: plain-text search
df['notes'].str.contains(re.escape('cost (NTD)')) # also correct: escaped regex
str.contains treats the pattern as regex by default. For literal text — especially anything with brackets, dots, or dollar signs — pass regex=False.
5. Capture Groups Inside str.contains
df['notes'].str.contains(r'(refund|error)') # works, but warns: "match groups"
df['notes'].str.contains(r'(?:refund|error)') # clean: non-capturing group
contains only needs a yes/no, so capturing is wasted work — pandas warns about it. Use (?: ...) when grouping inside a filter.
6. Greedy Quantifier Eating Too Much
re.sub(r'".*"', 'X', 'say "a" then "b"') # 'say X' — one greedy match ate both
re.sub(r'".*?"', 'X', 'say "a" then "b"') # 'say X then X'
7. Using Regex When a Plain Method Is Enough
df['station_name'].str.contains('North', regex=False) # simple substring — no regex needed
df['license_plate'].str[:3] # first 3 chars — slicing beats extract
Regex is for patterns. For fixed text, plain string methods are faster and impossible to get wrong. Reach for regex when the text varies in a way you can describe but not enumerate.
Where to use this next: the cleaning pipeline in Pandas Notebook (recipe 1) is where most of these patterns earn their keep, and SQL Advanced Notebook covers the string functions that pair with REGEXP when a pattern alone is not enough.