NLP Lesson 5 – Reg Expressions | Dataplexa

Regular Expressions (Regex)

In the previous lessons, you learned how text is broken into tokens. But before tokenization, we often need to clean, filter, or extract specific patterns from text.

This is where Regular Expressions (Regex) become extremely powerful. Regex allows us to search, match, and manipulate text using patterns.

Regex is used not only in NLP, but also in programming, data analysis, log processing, validation, and competitive exams.


What Is a Regular Expression?

A regular expression is a pattern used to match specific sequences of characters in text.

Instead of checking text character by character, regex lets us describe what we want in a compact and powerful way.

Example:

  • Email pattern
  • Phone numbers
  • Dates
  • Only alphabets or digits

Why Regex Is Important in NLP

Real-world text is messy. It contains punctuation, numbers, symbols, emojis, URLs, and unwanted noise.

Regex helps us:

  • Remove unwanted characters
  • Extract useful information
  • Normalize text before tokenization
  • Prepare clean input for NLP models

Without regex, text preprocessing becomes slow and error-prone.


Basic Regex Symbols You Must Know

These symbols form the foundation of regex. You will see them repeatedly in NLP pipelines and exams.

Symbol Meaning Example
. Any single character a.c → abc, a1c
^ Start of string ^Hello
$ End of string world$
[a-z] Lowercase letters cat, dog
[0-9] Digits 123
\d Any digit 0–9
\w Word character a–z, A–Z, 0–9, _
\s Whitespace space, tab

Regex Quantifiers (How Many Times)

Quantifiers specify how often a pattern should appear.

  • * → 0 or more times
  • + → 1 or more times
  • ? → 0 or 1 time
  • {n} → exactly n times

These are heavily used in validation and extraction tasks.


Regex in Python

Python provides a built-in module called re to work with regular expressions.

We commonly use:

  • re.findall() → find all matches
  • re.sub() → replace patterns
  • re.search() → search for a pattern

Code Example: Removing Digits from Text

Let us clean text by removing numbers. This is a very common NLP preprocessing step.

You can run this code in:

  • Google Colab (recommended for beginners)
  • Jupyter Notebook
  • VS Code / PyCharm (Python environment)
Python Example: Remove Numbers Using Regex
import re

text = "NLP in 2024 is powerful and useful"

clean_text = re.sub(r'\d+', '', text)
print(clean_text)

Output:

Output
NLP in  is powerful and useful

Understanding the Output

Here:

  • \d+ means one or more digits
  • re.sub() replaces digits with an empty string

This leaves only meaningful words behind. Such cleaning is usually done before tokenization.


Extracting Words Using Regex

Regex can also extract information instead of removing it.

Python Example: Extract Words
import re

text = "Email me at support@dataplexa.com"

words = re.findall(r'\w+', text)
print(words)

Output:

Output
['Email', 'me', 'at', 'support', 'dataplexa', 'com']

Regex in Real-Life NLP Applications

  • Removing URLs from text
  • Cleaning social media data
  • Extracting hashtags or mentions
  • Normalizing chat messages

Regex acts as the filter before deeper NLP processing.


Common Mistakes to Avoid

  • Overusing regex for complex language understanding
  • Forgetting to escape special characters
  • Not testing regex patterns properly

Regex is powerful, but it should be used wisely.


Homework Practice (Important)

This practice is mandatory for building confidence.

👉 Where to practice:

  • Google Colab (free, no setup)
  • Jupyter Notebook

👉 Your tasks:

  • Write regex to remove punctuation from a sentence
  • Extract all email IDs from a paragraph
  • Remove extra spaces from text
  • Extract only words with more than 4 letters

Try different inputs and observe how regex behaves.


Quick Quiz

Q1. What does regex mainly help with?

Pattern matching and text manipulation.

Q2. Which regex symbol represents digits?

\d

Quick Recap

  • Regex uses patterns to process text
  • It is essential for NLP preprocessing
  • Python uses the re module
  • Regex cleans and extracts text efficiently
  • Practice is the key to mastering regex