Python Course
Regular Expressions
Every application that handles text — web forms, log files, search engines, data pipelines — eventually needs to find patterns rather than exact strings. Regular expressions (regex) are a compact language for describing those patterns. Python's built-in re module puts the full power of regex at your fingertips in just a few function calls.
This lesson builds from basic matching all the way to groups, substitutions, and flags — the complete toolkit you need to handle real-world text confidently.
Importing the re Module
Everything in this lesson comes from Python's standard library re module — no installation needed.
import re # built-in — always available, nothing to installThe Core Functions
The re module provides a small set of functions that cover almost every use case. Understanding what each one returns is the key to using them correctly.
re.match(pattern, string)— checks for a match only at the beginning of the stringre.search(pattern, string)— scans the entire string and returns the first match anywherere.findall(pattern, string)— returns a list of all non-overlapping matchesre.finditer(pattern, string)— returns an iterator of match objects for all matchesre.sub(pattern, replacement, string)— replaces all matches with a new stringre.split(pattern, string)— splits the string at every matchre.compile(pattern)— compiles a pattern into a reusable regex object
Your First Pattern — re.search and re.match
Why it exists: most text tasks require finding whether something is present before doing anything with it. re.search is the most flexible starting point — it scans the whole string.
Real-world use: validating that a submitted form field contains an email address, phone number, or ZIP code before accepting the input.
# re.search — find a pattern anywhere in a string
# re.match — only checks from the very start
import re
text = "Order #4821 was shipped on 2024-03-15"
# Search anywhere in the string
result = re.search(r"\d+", text) # \d+ means one or more digits
if result:
print("Found:", result.group()) # .group() returns the matched text
print("Position:", result.start(), "to", result.end())
# match only checks the START — won't find the number here
m = re.match(r"\d+", text)
print("match result:", m) # None — string starts with "Order", not a digitPosition: 7 to 11
match result: None
- Always use
if result:before calling.group()— the function returnsNoneif nothing matches .group()returns the full matched string.start()and.end()return the index positions of the match- Use raw strings
r"..."for patterns — backslashes are not treated as escape sequences
Pattern Syntax — The Building Blocks
Regex patterns are built from special characters called metacharacters. Learning these unlocks everything else.
| Pattern | Meaning | Example Match |
|---|---|---|
\d | Any digit 0–9 | 5, 9 |
\w | Word character (letter, digit, underscore) | a, 3, _ |
\s | Any whitespace (space, tab, newline) | , \t |
. | Any character except newline | a, !, 5 |
^ | Start of string | ^Hello matches "Hello world" |
$ | End of string | end$ matches "the end" |
* | 0 or more of the preceding | ab* matches a, ab, abb |
+ | 1 or more of the preceding | \d+ matches 1, 42, 999 |
? | 0 or 1 of the preceding (optional) | colou?r matches color and colour |
{n} | Exactly n repetitions | \d{4} matches 2024 |
{n,m} | Between n and m repetitions | \d{2,4} matches 12, 123, 1234 |
[abc] | Any one character from the set | [aeiou] matches any vowel |
[^abc] | Any character NOT in the set | [^0-9] matches any non-digit |
a|b | Either a or b | cat|dog matches cat or dog |
Finding All Matches — re.findall
re.findall() is probably the function you will use most often. It scans the entire string and returns a plain list of every match — no match objects, no loops needed.
Real-world use: extracting all dollar amounts from a financial report, all dates from a log file, or all hashtags from a social media post.
# re.findall — returns a list of all matches
import re
log = "Errors on 2024-01-05, 2024-02-18, and 2024-03-22"
# Extract all dates in YYYY-MM-DD format
dates = re.findall(r"\d{4}-\d{2}-\d{2}", log)
print("Dates:", dates)
receipt = "Total: $12.99 Tax: $1.04 Tip: $2.50"
# Extract all dollar amounts
amounts = re.findall(r"\$\d+\.\d{2}", receipt)
print("Amounts:", amounts)Amounts: ['$12.99', '$1.04', '$2.50']
re.findall()returns an empty list[]if nothing matches — safe to use without anifcheck- If the pattern contains groups
(), findall returns a list of tuples instead of strings - Use
re.finditer()instead when you need match positions alongside the matched text
Capturing Groups
Wrapping part of a pattern in parentheses creates a group. Groups let you extract specific portions of a match rather than the whole thing. This is one of the most powerful features in regex.
Real-world use: parsing a log line to capture the timestamp, log level, and message as separate fields rather than the whole raw line.
# Capturing groups — extract parts of a match
import re
# Extract year, month, day separately from a date
date_str = "Invoice date: 2024-07-19"
m = re.search(r"(\d{4})-(\d{2})-(\d{2})", date_str)
if m:
print("Full match:", m.group(0)) # entire match
print("Year:", m.group(1)) # first group
print("Month:", m.group(2)) # second group
print("Day:", m.group(3)) # third group
# Named groups — even clearer
m2 = re.search(r"(?P\d{4})-(?P\d{2})-(?P\d{2})", date_str)
if m2:
print("Named year:", m2.group("year"))
print("Named month:", m2.group("month")) Year: 2024
Month: 07
Day: 19
Named year: 2024
Named month: 07
group(0)orgroup()always returns the entire matchgroup(1),group(2), etc. return individual capturing groups left to right- Named groups
(?P<name>...)let you reference captures by name instead of number m.groups()returns all groups as a tuple in one call
Substitution — re.sub
re.sub() finds all matches and replaces them with a new string. It is the regex equivalent of str.replace() but with the full power of pattern matching.
Real-world use: sanitizing user input by replacing phone numbers, emails, or credit card numbers with masked placeholders before storing or displaying them.
# re.sub — replace all pattern matches
import re
# Mask all digits in a string (e.g. hide sensitive numbers)
text = "Call us at 555-867-5309 or 555-246-1357"
masked = re.sub(r"\d", "*", text)
print(masked)
# Normalize spacing — replace multiple spaces with a single space
messy = "Name: Alice Age: 30"
clean = re.sub(r"\s+", " ", messy)
print(clean)
# Replace with a backreference — wrap matched text in brackets
tagged = re.sub(r"(\d+)", r"[\1]", "Order 4821 has 3 items")
print(tagged)Name: Alice Age: 30
Order [4821] has [3] items
re.sub(pattern, replacement, string)returns a new string — the original is unchanged- Use
\1,\2in the replacement to insert captured group content back - Pass
count=nas a keyword argument to limit the number of replacements
Splitting — re.split
re.split() divides a string at every match of the pattern, just like str.split() — but with regex power to split on complex delimiters.
Real-world use: splitting a sentence on any punctuation mark, or splitting a CSV-like string that uses inconsistent delimiters such as commas, semicolons, or tabs.
# re.split — split on a pattern instead of a fixed character
import re
# Split on any punctuation followed by optional whitespace
sentence = "First.Second! Third? Fourth"
parts = re.split(r"[.!?]\s*", sentence)
print(parts)
# Split on one or more whitespace characters (spaces, tabs, newlines)
data = "alice bob\tcarol\ndave"
names = re.split(r"\s+", data)
print(names)['alice', 'bob', 'carol', 'dave']
re.split()returns a list, just likestr.split()- If the pattern contains a capturing group, the matched delimiters are included in the result list
- Pass
maxsplit=nto limit how many splits occur
Compiling Patterns — re.compile
If you use the same pattern many times — in a loop or across many strings — compile it once with re.compile(). The compiled object stores the pattern internally and runs faster on repeated use.
# re.compile — compile once, reuse many times
import re
# Compile the pattern once
email_pattern = re.compile(r"[\w\.-]+@[\w\.-]+\.\w{2,}")
emails = [
"contact@dataplexa.com",
"not-an-email",
"support@example.org",
"hello@"
]
for addr in emails:
if email_pattern.search(addr): # reuse compiled pattern
print("Valid:", addr)
else:
print("Invalid:", addr)Invalid: not-an-email
Valid: support@example.org
Invalid: hello@
re.compile()returns a pattern object with the same methods:.search(),.findall(),.sub(), etc.- Compiling is especially worthwhile inside loops or functions called frequently
- Store compiled patterns as module-level constants when reused across a whole file
Flags — Modifying Match Behavior
Flags change how a pattern is interpreted. Pass them as a third argument to any re function, or include them directly in the pattern string.
# Common flags — re.IGNORECASE and re.MULTILINE
import re
text = "Python is great. PYTHON is fast. python is fun."
# Case-insensitive matching
matches = re.findall(r"python", text, re.IGNORECASE)
print(matches) # finds all three regardless of case
# re.MULTILINE — ^ and $ match start/end of each LINE
multi = "first line\nsecond line\nthird line"
starts = re.findall(r"^\w+", multi, re.MULTILINE)
print(starts) # first word of every line['first', 'second', 'third']
re.IGNORECASE(orre.I) — case-insensitive matchingre.MULTILINE(orre.M) —^and$match at line boundaries, not just string boundariesre.DOTALL(orre.S) — makes.match newlines as well- Combine flags with
|:re.IGNORECASE | re.MULTILINE
Summary Table
| Function | Returns | Best Used For |
|---|---|---|
re.search() |
First match object or None | Checking if a pattern exists anywhere |
re.match() |
Match object at start or None | Validating string format from the start |
re.findall() |
List of all matches | Extracting all occurrences at once |
re.finditer() |
Iterator of match objects | Extracting matches with positions |
re.sub() |
New string with replacements | Cleaning, masking, reformatting text |
re.split() |
List of substrings | Splitting on complex delimiters |
re.compile() |
Compiled pattern object | Reusing the same pattern efficiently |
Practice Questions
Practice 1. Which re function scans the entire string and returns only the first match?
Practice 2. What regex pattern matches one or more digits?
Practice 3. What type does re.findall() always return?
Practice 4. What flag makes a regex pattern case-insensitive?
Practice 5. What string prefix should you use when writing regex patterns in Python to avoid backslash issues?
Quiz
Quiz 1. What is the difference between re.match() and re.search()?
Quiz 2. What does the pattern \d{4}-\d{2}-\d{2} match?
Quiz 3. What does re.sub(r"\s+", " ", text) do?
Quiz 4. In a regex pattern, what does ? mean when placed after a character?
Quiz 5. Why is re.compile() recommended when using the same pattern repeatedly?
Next up — Iterators explains how Python's for loops actually work under the hood, and teaches you to build your own iterable objects using __iter__ and __next__.