Regular Expressions (Regex)
In the previous lessons, you learned how text is broken into tokens. But before tokenization, we often need to clean, filter, or extract specific patterns from text.
This is where Regular Expressions (Regex) become extremely powerful. Regex allows us to search, match, and manipulate text using patterns.
Regex is used not only in NLP, but also in programming, data analysis, log processing, validation, and competitive exams.
What Is a Regular Expression?
A regular expression is a pattern used to match specific sequences of characters in text.
Instead of checking text character by character, regex lets us describe what we want in a compact and powerful way.
Example:
- Email pattern
- Phone numbers
- Dates
- Only alphabets or digits
Why Regex Is Important in NLP
Real-world text is messy. It contains punctuation, numbers, symbols, emojis, URLs, and unwanted noise.
Regex helps us:
- Remove unwanted characters
- Extract useful information
- Normalize text before tokenization
- Prepare clean input for NLP models
Without regex, text preprocessing becomes slow and error-prone.
Basic Regex Symbols You Must Know
These symbols form the foundation of regex. You will see them repeatedly in NLP pipelines and exams.
| Symbol | Meaning | Example |
|---|---|---|
| . | Any single character | a.c → abc, a1c |
| ^ | Start of string | ^Hello |
| $ | End of string | world$ |
| [a-z] | Lowercase letters | cat, dog |
| [0-9] | Digits | 123 |
| \d | Any digit | 0–9 |
| \w | Word character | a–z, A–Z, 0–9, _ |
| \s | Whitespace | space, tab |
Regex Quantifiers (How Many Times)
Quantifiers specify how often a pattern should appear.
- * → 0 or more times
- + → 1 or more times
- ? → 0 or 1 time
- {n} → exactly n times
These are heavily used in validation and extraction tasks.
Regex in Python
Python provides a built-in module called re to work with regular expressions.
We commonly use:
- re.findall() → find all matches
- re.sub() → replace patterns
- re.search() → search for a pattern
Code Example: Removing Digits from Text
Let us clean text by removing numbers. This is a very common NLP preprocessing step.
You can run this code in:
- Google Colab (recommended for beginners)
- Jupyter Notebook
- VS Code / PyCharm (Python environment)
import re
text = "NLP in 2024 is powerful and useful"
clean_text = re.sub(r'\d+', '', text)
print(clean_text)
Output:
NLP in is powerful and useful
Understanding the Output
Here:
- \d+ means one or more digits
- re.sub() replaces digits with an empty string
This leaves only meaningful words behind. Such cleaning is usually done before tokenization.
Extracting Words Using Regex
Regex can also extract information instead of removing it.
import re
text = "Email me at support@dataplexa.com"
words = re.findall(r'\w+', text)
print(words)
Output:
['Email', 'me', 'at', 'support', 'dataplexa', 'com']
Regex in Real-Life NLP Applications
- Removing URLs from text
- Cleaning social media data
- Extracting hashtags or mentions
- Normalizing chat messages
Regex acts as the filter before deeper NLP processing.
Common Mistakes to Avoid
- Overusing regex for complex language understanding
- Forgetting to escape special characters
- Not testing regex patterns properly
Regex is powerful, but it should be used wisely.
Homework Practice (Important)
This practice is mandatory for building confidence.
👉 Where to practice:
- Google Colab (free, no setup)
- Jupyter Notebook
👉 Your tasks:
- Write regex to remove punctuation from a sentence
- Extract all email IDs from a paragraph
- Remove extra spaces from text
- Extract only words with more than 4 letters
Try different inputs and observe how regex behaves.
Quick Quiz
Q1. What does regex mainly help with?
Q2. Which regex symbol represents digits?
Quick Recap
- Regex uses patterns to process text
- It is essential for NLP preprocessing
- Python uses the
remodule - Regex cleans and extracts text efficiently
- Practice is the key to mastering regex