NLP Lesson 7 – Stemming | Dataplexa

Stemming

In the previous lesson, you learned how to remove unnecessary words using stopwords. Now, we go one step deeper and reduce words to their base form.

This process is called stemming. It helps machines treat similar words as the same, even when their grammatical forms differ.

In this lesson, you will clearly understand what stemming is, how it works, when to use it, and its limitations.

What Is Stemming?

Stemming is a text preprocessing technique where words are reduced to their root form by removing suffixes.

The goal is not to produce a real dictionary word, but to bring related words to a common base.

Examples:

playing → play
played → play
connection → connect
connected → connect

After stemming, these words are treated as the same by NLP models.

Why Do We Use Stemming?

Stemming reduces vocabulary size and helps models generalize better by grouping related words.

Main advantages:

Reduces number of unique words
Improves efficiency of classic NLP models
Helps in search and document matching

Without stemming, models may treat play and playing as completely different words.

Real-Life Intuition

Think like a human:

When you hear the words:

run
running
ran

You immediately understand they relate to the same action. Stemming helps machines do the same.

How Stemming Works

Stemming works by applying rule-based suffix stripping. It does not understand grammar or meaning.

It simply removes common endings such as:

-ing
-ed
-ly
-es
-s

Because of this, stemming can sometimes produce non-real or incomplete words.

Popular Stemming Algorithms

Different stemming algorithms exist. The most commonly used ones are:

Porter Stemmer – most popular, balanced
Snowball Stemmer – improved version of Porter
Lancaster Stemmer – aggressive, faster but rough

For most NLP tasks, Porter or Snowball stemmers are preferred.

Example: Text Without Stemming

Sentence:

"He is playing and played very well"

Tokens without stemming:

he
is
playing
and
played
very
well

Here, playing and played are treated as different words.

Example: Text With Stemming

After stemming:

play
play

Now both forms point to the same concept.

Code Example: Stemming Using NLTK

You can run this code in:

Google Colab (recommended)
Jupyter Notebook
VS Code / PyCharm

Python Example: Porter Stemmer

from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

words = ["playing", "played", "plays", "connection", "connected"]

stemmed_words = [stemmer.stem(word) for word in words]

print(stemmed_words)

Output:

Output

['play', 'play', 'play', 'connect', 'connect']

Understanding the Output

Here:

Different word forms are reduced to the same stem
Vocabulary size becomes smaller
Models can learn patterns more effectively

This is especially useful in classic NLP pipelines like Bag of Words and TF-IDF.

Disadvantages of Stemming

Stemming is fast, but not perfect.

Main limitations:

Can produce non-real words
Ignores grammar and context
May reduce meaning in some tasks

Example:

university → univers (not a real word)

Stemming vs Stopword Removal

These two techniques solve different problems.

Aspect	Stopwords	Stemming
Purpose	Remove unnecessary words	Reduce words to base form
Example	is, the, and	playing → play
Output quality	Clean text	May be non-dictionary words

Where Is Stemming Used?

Search engines
Document similarity
Topic modeling
Spam filtering
Classic ML-based NLP systems

Assignment / Homework

Where to practice:

Google Colab
Jupyter Notebook

Your tasks:

Apply stemming on 10 different sentences
Compare text before and after stemming
Try Snowball stemmer and compare results
Note down incorrect stems you observe

Practice Questions

Q1. What is the goal of stemming?

To reduce related words to a common base form.

Q2. Does stemming always produce real words?

No, stemming may produce non-dictionary words.

Q3. Name one popular stemming algorithm.

Porter Stemmer.

Quick Quiz

Q1. Which is more aggressive: Porter or Lancaster?

Lancaster stemmer.

Q2. Should stemming always be used in NLP?

No, it depends on the task.

Quick Recap

Stemming reduces words to their base form
It is rule-based and fast
Helps classic NLP models
May produce imperfect words
Choose based on task requirements

← Previous Course Index Next →