Stemming
In the previous lesson, you learned how to remove unnecessary words using stopwords. Now, we go one step deeper and reduce words to their base form.
This process is called stemming. It helps machines treat similar words as the same, even when their grammatical forms differ.
In this lesson, you will clearly understand what stemming is, how it works, when to use it, and its limitations.
What Is Stemming?
Stemming is a text preprocessing technique where words are reduced to their root form by removing suffixes.
The goal is not to produce a real dictionary word, but to bring related words to a common base.
Examples:
- playing → play
- played → play
- connection → connect
- connected → connect
After stemming, these words are treated as the same by NLP models.
Why Do We Use Stemming?
Stemming reduces vocabulary size and helps models generalize better by grouping related words.
Main advantages:
- Reduces number of unique words
- Improves efficiency of classic NLP models
- Helps in search and document matching
Without stemming, models may treat play and playing as completely different words.
Real-Life Intuition
Think like a human:
When you hear the words:
- run
- running
- ran
You immediately understand they relate to the same action. Stemming helps machines do the same.
How Stemming Works
Stemming works by applying rule-based suffix stripping. It does not understand grammar or meaning.
It simply removes common endings such as:
- -ing
- -ed
- -ly
- -es
- -s
Because of this, stemming can sometimes produce non-real or incomplete words.
Popular Stemming Algorithms
Different stemming algorithms exist. The most commonly used ones are:
- Porter Stemmer – most popular, balanced
- Snowball Stemmer – improved version of Porter
- Lancaster Stemmer – aggressive, faster but rough
For most NLP tasks, Porter or Snowball stemmers are preferred.
Example: Text Without Stemming
Sentence:
"He is playing and played very well"
Tokens without stemming:
- he
- is
- playing
- and
- played
- very
- well
Here, playing and played are treated as different words.
Example: Text With Stemming
After stemming:
- play
- play
Now both forms point to the same concept.
Code Example: Stemming Using NLTK
You can run this code in:
- Google Colab (recommended)
- Jupyter Notebook
- VS Code / PyCharm
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
words = ["playing", "played", "plays", "connection", "connected"]
stemmed_words = [stemmer.stem(word) for word in words]
print(stemmed_words)
Output:
['play', 'play', 'play', 'connect', 'connect']
Understanding the Output
Here:
- Different word forms are reduced to the same stem
- Vocabulary size becomes smaller
- Models can learn patterns more effectively
This is especially useful in classic NLP pipelines like Bag of Words and TF-IDF.
Disadvantages of Stemming
Stemming is fast, but not perfect.
Main limitations:
- Can produce non-real words
- Ignores grammar and context
- May reduce meaning in some tasks
Example:
university → univers (not a real word)
Stemming vs Stopword Removal
These two techniques solve different problems.
| Aspect | Stopwords | Stemming |
|---|---|---|
| Purpose | Remove unnecessary words | Reduce words to base form |
| Example | is, the, and | playing → play |
| Output quality | Clean text | May be non-dictionary words |
Where Is Stemming Used?
- Search engines
- Document similarity
- Topic modeling
- Spam filtering
- Classic ML-based NLP systems
Assignment / Homework
Where to practice:
- Google Colab
- Jupyter Notebook
Your tasks:
- Apply stemming on 10 different sentences
- Compare text before and after stemming
- Try Snowball stemmer and compare results
- Note down incorrect stems you observe
Practice Questions
Q1. What is the goal of stemming?
Q2. Does stemming always produce real words?
Q3. Name one popular stemming algorithm.
Quick Quiz
Q1. Which is more aggressive: Porter or Lancaster?
Q2. Should stemming always be used in NLP?
Quick Recap
- Stemming reduces words to their base form
- It is rule-based and fast
- Helps classic NLP models
- May produce imperfect words
- Choose based on task requirements