Python Lesson 24 – Regular Expressions | Dataplexa

Regular Expressions

Every application that handles text — web forms, log files, search engines, data pipelines — eventually needs to find patterns rather than exact strings. Regular expressions (regex) are a compact language for describing those patterns. Python's built-in re module puts the full power of regex at your fingertips in just a few function calls.

This lesson builds from basic matching all the way to groups, substitutions, and flags — the complete toolkit you need to handle real-world text confidently.

Importing the re Module

Everything in this lesson comes from Python's standard library re module — no installation needed.

import re   # built-in — always available, nothing to install

The Core Functions

The re module provides a small set of functions that cover almost every use case. Understanding what each one returns is the key to using them correctly.

  • re.match(pattern, string) — checks for a match only at the beginning of the string
  • re.search(pattern, string) — scans the entire string and returns the first match anywhere
  • re.findall(pattern, string) — returns a list of all non-overlapping matches
  • re.finditer(pattern, string) — returns an iterator of match objects for all matches
  • re.sub(pattern, replacement, string)replaces all matches with a new string
  • re.split(pattern, string)splits the string at every match
  • re.compile(pattern) — compiles a pattern into a reusable regex object

Your First Pattern — re.search and re.match

Why it exists: most text tasks require finding whether something is present before doing anything with it. re.search is the most flexible starting point — it scans the whole string.

Real-world use: validating that a submitted form field contains an email address, phone number, or ZIP code before accepting the input.

# re.search — find a pattern anywhere in a string
# re.match — only checks from the very start

import re

text = "Order #4821 was shipped on 2024-03-15"

# Search anywhere in the string
result = re.search(r"\d+", text)   # \d+ means one or more digits

if result:
    print("Found:", result.group())   # .group() returns the matched text
    print("Position:", result.start(), "to", result.end())

# match only checks the START — won't find the number here
m = re.match(r"\d+", text)
print("match result:", m)   # None — string starts with "Order", not a digit
Found: 4821
Position: 7 to 11
match result: None
  • Always use if result: before calling .group() — the function returns None if nothing matches
  • .group() returns the full matched string
  • .start() and .end() return the index positions of the match
  • Use raw strings r"..." for patterns — backslashes are not treated as escape sequences

Pattern Syntax — The Building Blocks

Regex patterns are built from special characters called metacharacters. Learning these unlocks everything else.

Pattern Meaning Example Match
\dAny digit 0–95, 9
\wWord character (letter, digit, underscore)a, 3, _
\sAny whitespace (space, tab, newline) , \t
.Any character except newlinea, !, 5
^Start of string^Hello matches "Hello world"
$End of stringend$ matches "the end"
*0 or more of the precedingab* matches a, ab, abb
+1 or more of the preceding\d+ matches 1, 42, 999
?0 or 1 of the preceding (optional)colou?r matches color and colour
{n}Exactly n repetitions\d{4} matches 2024
{n,m}Between n and m repetitions\d{2,4} matches 12, 123, 1234
[abc]Any one character from the set[aeiou] matches any vowel
[^abc]Any character NOT in the set[^0-9] matches any non-digit
a|bEither a or bcat|dog matches cat or dog

Finding All Matches — re.findall

re.findall() is probably the function you will use most often. It scans the entire string and returns a plain list of every match — no match objects, no loops needed.

Real-world use: extracting all dollar amounts from a financial report, all dates from a log file, or all hashtags from a social media post.

# re.findall — returns a list of all matches

import re

log = "Errors on 2024-01-05, 2024-02-18, and 2024-03-22"

# Extract all dates in YYYY-MM-DD format
dates = re.findall(r"\d{4}-\d{2}-\d{2}", log)
print("Dates:", dates)

receipt = "Total: $12.99  Tax: $1.04  Tip: $2.50"

# Extract all dollar amounts
amounts = re.findall(r"\$\d+\.\d{2}", receipt)
print("Amounts:", amounts)
Dates: ['2024-01-05', '2024-02-18', '2024-03-22']
Amounts: ['$12.99', '$1.04', '$2.50']
  • re.findall() returns an empty list [] if nothing matches — safe to use without an if check
  • If the pattern contains groups (), findall returns a list of tuples instead of strings
  • Use re.finditer() instead when you need match positions alongside the matched text

Capturing Groups

Wrapping part of a pattern in parentheses creates a group. Groups let you extract specific portions of a match rather than the whole thing. This is one of the most powerful features in regex.

Real-world use: parsing a log line to capture the timestamp, log level, and message as separate fields rather than the whole raw line.

# Capturing groups — extract parts of a match

import re

# Extract year, month, day separately from a date
date_str = "Invoice date: 2024-07-19"
m = re.search(r"(\d{4})-(\d{2})-(\d{2})", date_str)

if m:
    print("Full match:", m.group(0))   # entire match
    print("Year:",       m.group(1))   # first group
    print("Month:",      m.group(2))   # second group
    print("Day:",        m.group(3))   # third group

# Named groups — even clearer
m2 = re.search(r"(?P\d{4})-(?P\d{2})-(?P\d{2})", date_str)
if m2:
    print("Named year:", m2.group("year"))
    print("Named month:", m2.group("month"))
Full match: 2024-07-19
Year: 2024
Month: 07
Day: 19
Named year: 2024
Named month: 07
  • group(0) or group() always returns the entire match
  • group(1), group(2), etc. return individual capturing groups left to right
  • Named groups (?P<name>...) let you reference captures by name instead of number
  • m.groups() returns all groups as a tuple in one call

Substitution — re.sub

re.sub() finds all matches and replaces them with a new string. It is the regex equivalent of str.replace() but with the full power of pattern matching.

Real-world use: sanitizing user input by replacing phone numbers, emails, or credit card numbers with masked placeholders before storing or displaying them.

# re.sub — replace all pattern matches

import re

# Mask all digits in a string (e.g. hide sensitive numbers)
text = "Call us at 555-867-5309 or 555-246-1357"
masked = re.sub(r"\d", "*", text)
print(masked)

# Normalize spacing — replace multiple spaces with a single space
messy = "Name:    Alice     Age:   30"
clean = re.sub(r"\s+", " ", messy)
print(clean)

# Replace with a backreference — wrap matched text in brackets
tagged = re.sub(r"(\d+)", r"[\1]", "Order 4821 has 3 items")
print(tagged)
Call us at ***-***-**** or ***-***-****
Name: Alice Age: 30
Order [4821] has [3] items
  • re.sub(pattern, replacement, string) returns a new string — the original is unchanged
  • Use \1, \2 in the replacement to insert captured group content back
  • Pass count=n as a keyword argument to limit the number of replacements

Splitting — re.split

re.split() divides a string at every match of the pattern, just like str.split() — but with regex power to split on complex delimiters.

Real-world use: splitting a sentence on any punctuation mark, or splitting a CSV-like string that uses inconsistent delimiters such as commas, semicolons, or tabs.

# re.split — split on a pattern instead of a fixed character

import re

# Split on any punctuation followed by optional whitespace
sentence = "First.Second! Third? Fourth"
parts = re.split(r"[.!?]\s*", sentence)
print(parts)

# Split on one or more whitespace characters (spaces, tabs, newlines)
data = "alice   bob\tcarol\ndave"
names = re.split(r"\s+", data)
print(names)
['First', 'Second', 'Third', 'Fourth']
['alice', 'bob', 'carol', 'dave']
  • re.split() returns a list, just like str.split()
  • If the pattern contains a capturing group, the matched delimiters are included in the result list
  • Pass maxsplit=n to limit how many splits occur

Compiling Patterns — re.compile

If you use the same pattern many times — in a loop or across many strings — compile it once with re.compile(). The compiled object stores the pattern internally and runs faster on repeated use.

# re.compile — compile once, reuse many times

import re

# Compile the pattern once
email_pattern = re.compile(r"[\w\.-]+@[\w\.-]+\.\w{2,}")

emails = [
    "contact@dataplexa.com",
    "not-an-email",
    "support@example.org",
    "hello@"
]

for addr in emails:
    if email_pattern.search(addr):    # reuse compiled pattern
        print("Valid:", addr)
    else:
        print("Invalid:", addr)
Valid: contact@dataplexa.com
Invalid: not-an-email
Valid: support@example.org
Invalid: hello@
  • re.compile() returns a pattern object with the same methods: .search(), .findall(), .sub(), etc.
  • Compiling is especially worthwhile inside loops or functions called frequently
  • Store compiled patterns as module-level constants when reused across a whole file

Flags — Modifying Match Behavior

Flags change how a pattern is interpreted. Pass them as a third argument to any re function, or include them directly in the pattern string.

# Common flags — re.IGNORECASE and re.MULTILINE

import re

text = "Python is great. PYTHON is fast. python is fun."

# Case-insensitive matching
matches = re.findall(r"python", text, re.IGNORECASE)
print(matches)   # finds all three regardless of case

# re.MULTILINE — ^ and $ match start/end of each LINE
multi = "first line\nsecond line\nthird line"
starts = re.findall(r"^\w+", multi, re.MULTILINE)
print(starts)    # first word of every line
['Python', 'PYTHON', 'python']
['first', 'second', 'third']
  • re.IGNORECASE (or re.I) — case-insensitive matching
  • re.MULTILINE (or re.M) — ^ and $ match at line boundaries, not just string boundaries
  • re.DOTALL (or re.S) — makes . match newlines as well
  • Combine flags with |: re.IGNORECASE | re.MULTILINE

Summary Table

Function Returns Best Used For
re.search() First match object or None Checking if a pattern exists anywhere
re.match() Match object at start or None Validating string format from the start
re.findall() List of all matches Extracting all occurrences at once
re.finditer() Iterator of match objects Extracting matches with positions
re.sub() New string with replacements Cleaning, masking, reformatting text
re.split() List of substrings Splitting on complex delimiters
re.compile() Compiled pattern object Reusing the same pattern efficiently

Practice Questions

Practice 1. Which re function scans the entire string and returns only the first match?



Practice 2. What regex pattern matches one or more digits?



Practice 3. What type does re.findall() always return?



Practice 4. What flag makes a regex pattern case-insensitive?



Practice 5. What string prefix should you use when writing regex patterns in Python to avoid backslash issues?



Quiz

Quiz 1. What is the difference between re.match() and re.search()?






Quiz 2. What does the pattern \d{4}-\d{2}-\d{2} match?






Quiz 3. What does re.sub(r"\s+", " ", text) do?






Quiz 4. In a regex pattern, what does ? mean when placed after a character?






Quiz 5. Why is re.compile() recommended when using the same pattern repeatedly?






Next up — Iterators explains how Python's for loops actually work under the hood, and teaches you to build your own iterable objects using __iter__ and __next__.