NLP Lesson 7 – Stemming | Dataplexa

Stemming

In the previous lesson, you learned how to remove unnecessary words using stopwords. Now, we go one step deeper and reduce words to their base form.

This process is called stemming. It helps machines treat similar words as the same, even when their grammatical forms differ.

In this lesson, you will clearly understand what stemming is, how it works, when to use it, and its limitations.


What Is Stemming?

Stemming is a text preprocessing technique where words are reduced to their root form by removing suffixes.

The goal is not to produce a real dictionary word, but to bring related words to a common base.

Examples:

  • playing → play
  • played → play
  • connection → connect
  • connected → connect

After stemming, these words are treated as the same by NLP models.


Why Do We Use Stemming?

Stemming reduces vocabulary size and helps models generalize better by grouping related words.

Main advantages:

  • Reduces number of unique words
  • Improves efficiency of classic NLP models
  • Helps in search and document matching

Without stemming, models may treat play and playing as completely different words.


Real-Life Intuition

Think like a human:

When you hear the words:

  • run
  • running
  • ran

You immediately understand they relate to the same action. Stemming helps machines do the same.


How Stemming Works

Stemming works by applying rule-based suffix stripping. It does not understand grammar or meaning.

It simply removes common endings such as:

  • -ing
  • -ed
  • -ly
  • -es
  • -s

Because of this, stemming can sometimes produce non-real or incomplete words.


Popular Stemming Algorithms

Different stemming algorithms exist. The most commonly used ones are:

  • Porter Stemmer – most popular, balanced
  • Snowball Stemmer – improved version of Porter
  • Lancaster Stemmer – aggressive, faster but rough

For most NLP tasks, Porter or Snowball stemmers are preferred.


Example: Text Without Stemming

Sentence:

"He is playing and played very well"

Tokens without stemming:

  • he
  • is
  • playing
  • and
  • played
  • very
  • well

Here, playing and played are treated as different words.


Example: Text With Stemming

After stemming:

  • play
  • play

Now both forms point to the same concept.


Code Example: Stemming Using NLTK

You can run this code in:

  • Google Colab (recommended)
  • Jupyter Notebook
  • VS Code / PyCharm
Python Example: Porter Stemmer
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

words = ["playing", "played", "plays", "connection", "connected"]

stemmed_words = [stemmer.stem(word) for word in words]

print(stemmed_words)

Output:

Output
['play', 'play', 'play', 'connect', 'connect']

Understanding the Output

Here:

  • Different word forms are reduced to the same stem
  • Vocabulary size becomes smaller
  • Models can learn patterns more effectively

This is especially useful in classic NLP pipelines like Bag of Words and TF-IDF.


Disadvantages of Stemming

Stemming is fast, but not perfect.

Main limitations:

  • Can produce non-real words
  • Ignores grammar and context
  • May reduce meaning in some tasks

Example:

university → univers (not a real word)


Stemming vs Stopword Removal

These two techniques solve different problems.

Aspect Stopwords Stemming
Purpose Remove unnecessary words Reduce words to base form
Example is, the, and playing → play
Output quality Clean text May be non-dictionary words

Where Is Stemming Used?

  • Search engines
  • Document similarity
  • Topic modeling
  • Spam filtering
  • Classic ML-based NLP systems

Assignment / Homework

Where to practice:

  • Google Colab
  • Jupyter Notebook

Your tasks:

  • Apply stemming on 10 different sentences
  • Compare text before and after stemming
  • Try Snowball stemmer and compare results
  • Note down incorrect stems you observe

Practice Questions

Q1. What is the goal of stemming?

To reduce related words to a common base form.

Q2. Does stemming always produce real words?

No, stemming may produce non-dictionary words.

Q3. Name one popular stemming algorithm.

Porter Stemmer.

Quick Quiz

Q1. Which is more aggressive: Porter or Lancaster?

Lancaster stemmer.

Q2. Should stemming always be used in NLP?

No, it depends on the task.

Quick Recap

  • Stemming reduces words to their base form
  • It is rule-based and fast
  • Helps classic NLP models
  • May produce imperfect words
  • Choose based on task requirements