NLP Lesson 12 – Text Cleaning | Dataplexa

Text Cleaning Techniques

Before text can be converted into features like Bag of Words, TF-IDF, or N-grams, it must be cleaned.

Raw text from the real world is usually messy: it contains symbols, numbers, mixed cases, extra spaces, emojis, and noise.

Text cleaning is the process of preparing raw text into a clean, consistent format so machines can learn meaningful patterns.


Why Is Text Cleaning Important?

Machine learning models are very sensitive to small differences.

For example, the words below look similar to humans, but are completely different to a machine:

  • “NLP”
  • “nlp”
  • “Nlp!”

Without cleaning:

  • Vocabulary size increases unnecessarily
  • Noise reduces model accuracy
  • Training becomes slower

Clean text = better features = better models.


Common Text Cleaning Steps

Text cleaning is not one fixed rule. The steps depend on the problem.

However, most NLP pipelines use these core techniques:

Step Purpose
Lowercasing Normalize text
Removing punctuation Remove noise
Removing numbers Remove irrelevant tokens
Removing extra spaces Clean formatting
Removing stopwords Focus on meaningful words

Step 1: Lowercasing Text

Lowercasing converts all characters to lowercase.

This ensures that words like “Data” and “data” are treated as the same token.

Lowercasing Example
text = "Natural Language Processing Is Powerful"

clean_text = text.lower()
print(clean_text)

Output:

Output
natural language processing is powerful

Step 2: Removing Punctuation

Punctuation usually does not add meaning for most NLP tasks.

We remove it to reduce noise.

Removing Punctuation
import string

text = "Hello!!! NLP, is awesome."

clean_text = text.translate(str.maketrans('', '', string.punctuation))
print(clean_text)

Output:

Output
Hello NLP is awesome

Step 3: Removing Numbers

Numbers are sometimes useful (prices, dates), but often they add noise.

We remove them when the task is purely language-based.

Removing Numbers
import re

text = "I bought 3 books in 2024"

clean_text = re.sub(r'\d+', '', text)
print(clean_text)

Output:

Output
I bought  books in 

Step 4: Removing Extra Whitespaces

After cleaning, text may contain extra spaces.

These should be normalized.

Removing Extra Spaces
text = "NLP    is   very    powerful"

clean_text = " ".join(text.split())
print(clean_text)

Output:

Output
NLP is very powerful

Combining All Cleaning Steps

In real projects, we combine multiple steps into one pipeline.

Where to run this code:

  • Google Colab (recommended)
  • Jupyter Notebook
  • VS Code with Python
Complete Text Cleaning Function
import re
import string

def clean_text(text):
    text = text.lower()
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = re.sub(r'\d+', '', text)
    text = " ".join(text.split())
    return text

sample = "NLP in 2024 is AWESOME!!!"
print(clean_text(sample))

Output:

Output
nlp in is awesome

When Should You Clean Text?

  • Before vectorization
  • Before N-grams
  • Before ML/DL training
  • Before clustering or similarity

Text cleaning is the foundation of all NLP pipelines.


Assignment / Homework

Practice Environment:

  • Google Colab
  • Jupyter Notebook

Tasks:

  • Clean a paragraph from a news article
  • Compare vocabulary size before and after cleaning
  • Apply cleaning before TF-IDF
  • Experiment with keeping vs removing numbers

Practice Questions

Q1. Why is lowercasing important?

It normalizes text so words are treated consistently.

Q2. Should numbers always be removed?

No. It depends on the problem and data.

Quick Quiz

Q1. What happens if text is not cleaned?

Noise increases and model accuracy decreases.

Q2. Which step removes symbols like “!” and “?”

Removing punctuation.

Quick Recap

  • Raw text is noisy
  • Cleaning improves model performance
  • Lowercasing, punctuation removal are core steps
  • Cleaning happens before vectorization
  • Strong cleaning leads to strong NLP models