AI Lesson 64 – Stopwords, Stemming & Lemmatization | Dataplexa

Stopwords, Stemming, and Lemmatization

After tokenization, text still contains many words that may not contribute much meaning to analysis. Words like “is”, “the”, “and”, or “of” appear very frequently but usually do not help in understanding intent or sentiment. NLP uses techniques like stopword removal, stemming, and lemmatization to reduce noise and improve clarity.

This lesson explains why these techniques are used, how they differ, and how they are applied in real NLP systems.

Real-World Connection

When a search engine processes your query, it focuses on important words and often ignores filler words. For example, in the query “best phone to buy in 2026”, words like “to” and “in” are ignored, while “best”, “phone”, and “buy” matter. This filtering is done using stopwords and word normalization techniques.

What Are Stopwords?

Stopwords are common words that usually carry little meaning on their own. Removing them helps models focus on important terms.

Examples: is, the, and, a, an, of, to
Reduces vocabulary size
Improves processing speed

Stopword Removal Example


from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

sentence = "This is a simple example of stopword removal"
words = sentence.lower().split()

filtered_words = [word for word in words if word not in ENGLISH_STOP_WORDS]
print(filtered_words)

['simple', 'example', 'stopword', 'removal']

What the Code Is Doing

The sentence is split into words, converted to lowercase, and common stopwords are removed. The remaining words carry more meaningful information.

What Is Stemming?

Stemming reduces words to their base form by removing suffixes. The resulting word may not always be a real dictionary word.

Fast and simple
May produce incorrect word forms
Useful for basic text analysis

Stemming Example


from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
words = ["running", "runner", "ran", "runs"]

stems = [stemmer.stem(word) for word in words]
print(stems)

['run', 'runner', 'ran', 'run']

Understanding Stemming Output

Stemming cuts words mechanically. While fast, it may not always preserve the actual meaning of the word.

What Is Lemmatization?

Lemmatization reduces words to their meaningful base form, called a lemma. Unlike stemming, lemmatization considers grammar and context.

Produces real dictionary words
More accurate than stemming
Requires linguistic knowledge

Lemmatization Example


from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
words = ["running", "better", "cars"]

lemmas = [lemmatizer.lemmatize(word) for word in words]
print(lemmas)

['running', 'better', 'car']

Stemming vs Lemmatization

Stemming is rule-based and fast
Lemmatization is grammar-aware and accurate
Stemming may produce non-words
Lemmatization produces valid words

When to Use Each Technique

The choice depends on the application:

Use stopwords for noise reduction
Use stemming for speed-focused tasks
Use lemmatization for semantic understanding

Practice Questions

Practice 1: What do we call common words removed from text?

Practice 2: Which technique removes suffixes mechanically?

Practice 3: Which technique returns dictionary words?

Quick Quiz

Quiz 1: Why are stopwords removed?

Reduce noise
Add meaning
Increase length

Quiz 2: Which method may produce non-dictionary words?

Stemming
Lemmatization
Tokenization

Quiz 3: Which technique considers grammar and meaning?

Lemmatization
Stemming
Stopwords

Coming up next: Bag of Words and TF-IDF — converting text into numerical features.

← Previous Course Index Next →

AI Course