AI Course
Stopwords, Stemming, and Lemmatization
After tokenization, text still contains many words that may not contribute much meaning to analysis. Words like “is”, “the”, “and”, or “of” appear very frequently but usually do not help in understanding intent or sentiment. NLP uses techniques like stopword removal, stemming, and lemmatization to reduce noise and improve clarity.
This lesson explains why these techniques are used, how they differ, and how they are applied in real NLP systems.
Real-World Connection
When a search engine processes your query, it focuses on important words and often ignores filler words. For example, in the query “best phone to buy in 2026”, words like “to” and “in” are ignored, while “best”, “phone”, and “buy” matter. This filtering is done using stopwords and word normalization techniques.
What Are Stopwords?
Stopwords are common words that usually carry little meaning on their own. Removing them helps models focus on important terms.
- Examples: is, the, and, a, an, of, to
- Reduces vocabulary size
- Improves processing speed
Stopword Removal Example
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
sentence = "This is a simple example of stopword removal"
words = sentence.lower().split()
filtered_words = [word for word in words if word not in ENGLISH_STOP_WORDS]
print(filtered_words)
What the Code Is Doing
The sentence is split into words, converted to lowercase, and common stopwords are removed. The remaining words carry more meaningful information.
What Is Stemming?
Stemming reduces words to their base form by removing suffixes. The resulting word may not always be a real dictionary word.
- Fast and simple
- May produce incorrect word forms
- Useful for basic text analysis
Stemming Example
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
words = ["running", "runner", "ran", "runs"]
stems = [stemmer.stem(word) for word in words]
print(stems)
Understanding Stemming Output
Stemming cuts words mechanically. While fast, it may not always preserve the actual meaning of the word.
What Is Lemmatization?
Lemmatization reduces words to their meaningful base form, called a lemma. Unlike stemming, lemmatization considers grammar and context.
- Produces real dictionary words
- More accurate than stemming
- Requires linguistic knowledge
Lemmatization Example
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
words = ["running", "better", "cars"]
lemmas = [lemmatizer.lemmatize(word) for word in words]
print(lemmas)
Stemming vs Lemmatization
- Stemming is rule-based and fast
- Lemmatization is grammar-aware and accurate
- Stemming may produce non-words
- Lemmatization produces valid words
When to Use Each Technique
The choice depends on the application:
- Use stopwords for noise reduction
- Use stemming for speed-focused tasks
- Use lemmatization for semantic understanding
Practice Questions
Practice 1: What do we call common words removed from text?
Practice 2: Which technique removes suffixes mechanically?
Practice 3: Which technique returns dictionary words?
Quick Quiz
Quiz 1: Why are stopwords removed?
Quiz 2: Which method may produce non-dictionary words?
Quiz 3: Which technique considers grammar and meaning?
Coming up next: Bag of Words and TF-IDF — converting text into numerical features.