NLP Lesson 5 – Reg Expressions | Dataplexa

Regular Expressions (Regex)

In the previous lessons, you learned how text is broken into tokens. But before tokenization, we often need to clean, filter, or extract specific patterns from text.

This is where Regular Expressions (Regex) become extremely powerful. Regex allows us to search, match, and manipulate text using patterns.

Regex is used not only in NLP, but also in programming, data analysis, log processing, validation, and competitive exams.

What Is a Regular Expression?

A regular expression is a pattern used to match specific sequences of characters in text.

Instead of checking text character by character, regex lets us describe what we want in a compact and powerful way.

Example:

Email pattern
Phone numbers
Dates
Only alphabets or digits

Why Regex Is Important in NLP

Real-world text is messy. It contains punctuation, numbers, symbols, emojis, URLs, and unwanted noise.

Regex helps us:

Remove unwanted characters
Extract useful information
Normalize text before tokenization
Prepare clean input for NLP models

Without regex, text preprocessing becomes slow and error-prone.

Basic Regex Symbols You Must Know

These symbols form the foundation of regex. You will see them repeatedly in NLP pipelines and exams.

Symbol	Meaning	Example
.	Any single character	a.c → abc, a1c
^	Start of string	^Hello
$	End of string	world$
[a-z]	Lowercase letters	cat, dog
[0-9]	Digits	123
\d	Any digit	0–9
\w	Word character	a–z, A–Z, 0–9, _
\s	Whitespace	space, tab

Regex Quantifiers (How Many Times)

Quantifiers specify how often a pattern should appear.

* → 0 or more times
+ → 1 or more times
? → 0 or 1 time
{n} → exactly n times

These are heavily used in validation and extraction tasks.

Regex in Python

Python provides a built-in module called re to work with regular expressions.

We commonly use:

re.findall() → find all matches
re.sub() → replace patterns
re.search() → search for a pattern

Code Example: Removing Digits from Text

Let us clean text by removing numbers. This is a very common NLP preprocessing step.

You can run this code in:

Google Colab (recommended for beginners)
Jupyter Notebook
VS Code / PyCharm (Python environment)

Python Example: Remove Numbers Using Regex

import re

text = "NLP in 2024 is powerful and useful"

clean_text = re.sub(r'\d+', '', text)
print(clean_text)

Output:

Output

NLP in  is powerful and useful

Understanding the Output

Here:

\d+ means one or more digits
re.sub() replaces digits with an empty string

This leaves only meaningful words behind. Such cleaning is usually done before tokenization.

Extracting Words Using Regex

Regex can also extract information instead of removing it.

Python Example: Extract Words

import re

text = "Email me at support@dataplexa.com"

words = re.findall(r'\w+', text)
print(words)

Output:

Output

['Email', 'me', 'at', 'support', 'dataplexa', 'com']

Regex in Real-Life NLP Applications

Removing URLs from text
Cleaning social media data
Extracting hashtags or mentions
Normalizing chat messages

Regex acts as the filter before deeper NLP processing.

Common Mistakes to Avoid

Overusing regex for complex language understanding
Forgetting to escape special characters
Not testing regex patterns properly

Regex is powerful, but it should be used wisely.

Homework Practice (Important)

This practice is mandatory for building confidence.

👉 Where to practice:

Google Colab (free, no setup)
Jupyter Notebook

👉 Your tasks:

Write regex to remove punctuation from a sentence
Extract all email IDs from a paragraph
Remove extra spaces from text
Extract only words with more than 4 letters

Try different inputs and observe how regex behaves.

Quick Quiz

Q1. What does regex mainly help with?

Pattern matching and text manipulation.

Q2. Which regex symbol represents digits?

Quick Recap

Regex uses patterns to process text
It is essential for NLP preprocessing
Python uses the re module
Regex cleans and extracts text efficiently
Practice is the key to mastering regex

← Previous Course Index Next →