NLP Lesson 12 – Text Cleaning | Dataplexa

Text Cleaning Techniques

Before text can be converted into features like Bag of Words, TF-IDF, or N-grams, it must be cleaned.

Raw text from the real world is usually messy: it contains symbols, numbers, mixed cases, extra spaces, emojis, and noise.

Text cleaning is the process of preparing raw text into a clean, consistent format so machines can learn meaningful patterns.

Why Is Text Cleaning Important?

Machine learning models are very sensitive to small differences.

For example, the words below look similar to humans, but are completely different to a machine:

“NLP”
“nlp”
“Nlp!”

Without cleaning:

Vocabulary size increases unnecessarily
Noise reduces model accuracy
Training becomes slower

Clean text = better features = better models.

Common Text Cleaning Steps

Text cleaning is not one fixed rule. The steps depend on the problem.

However, most NLP pipelines use these core techniques:

Step	Purpose
Lowercasing	Normalize text
Removing punctuation	Remove noise
Removing numbers	Remove irrelevant tokens
Removing extra spaces	Clean formatting
Removing stopwords	Focus on meaningful words

Step 1: Lowercasing Text

Lowercasing converts all characters to lowercase.

This ensures that words like “Data” and “data” are treated as the same token.

Lowercasing Example

text = "Natural Language Processing Is Powerful"

clean_text = text.lower()
print(clean_text)

Output:

Output

natural language processing is powerful

Step 2: Removing Punctuation

Punctuation usually does not add meaning for most NLP tasks.

We remove it to reduce noise.

Removing Punctuation

import string

text = "Hello!!! NLP, is awesome."

clean_text = text.translate(str.maketrans('', '', string.punctuation))
print(clean_text)

Output:

Output

Hello NLP is awesome

Step 3: Removing Numbers

Numbers are sometimes useful (prices, dates), but often they add noise.

We remove them when the task is purely language-based.

Removing Numbers

import re

text = "I bought 3 books in 2024"

clean_text = re.sub(r'\d+', '', text)
print(clean_text)

Output:

Output

I bought  books in

Step 4: Removing Extra Whitespaces

After cleaning, text may contain extra spaces.

These should be normalized.

Removing Extra Spaces

text = "NLP    is   very    powerful"

clean_text = " ".join(text.split())
print(clean_text)

Output:

Output

NLP is very powerful

Combining All Cleaning Steps

In real projects, we combine multiple steps into one pipeline.

Where to run this code:

Google Colab (recommended)
Jupyter Notebook
VS Code with Python

Complete Text Cleaning Function

import re
import string

def clean_text(text):
    text = text.lower()
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = re.sub(r'\d+', '', text)
    text = " ".join(text.split())
    return text

sample = "NLP in 2024 is AWESOME!!!"
print(clean_text(sample))

Output:

Output

nlp in is awesome

When Should You Clean Text?

Before vectorization
Before N-grams
Before ML/DL training
Before clustering or similarity

Text cleaning is the foundation of all NLP pipelines.

Assignment / Homework

Practice Environment:

Google Colab
Jupyter Notebook

Tasks:

Clean a paragraph from a news article
Compare vocabulary size before and after cleaning
Apply cleaning before TF-IDF
Experiment with keeping vs removing numbers

Practice Questions

Q1. Why is lowercasing important?

It normalizes text so words are treated consistently.

Q2. Should numbers always be removed?

No. It depends on the problem and data.

Quick Quiz

Q1. What happens if text is not cleaned?

Noise increases and model accuracy decreases.

Q2. Which step removes symbols like “!” and “?”

Removing punctuation.

Quick Recap

Raw text is noisy
Cleaning improves model performance
Lowercasing, punctuation removal are core steps
Cleaning happens before vectorization
Strong cleaning leads to strong NLP models

← Previous Course Index Next →