AI Lesson 64 – Stopwords, Stemming & Lemmatization | Dataplexa

Stopwords, Stemming, and Lemmatization

After tokenization, text still contains many words that may not contribute much meaning to analysis. Words like “is”, “the”, “and”, or “of” appear very frequently but usually do not help in understanding intent or sentiment. NLP uses techniques like stopword removal, stemming, and lemmatization to reduce noise and improve clarity.

This lesson explains why these techniques are used, how they differ, and how they are applied in real NLP systems.

Real-World Connection

When a search engine processes your query, it focuses on important words and often ignores filler words. For example, in the query “best phone to buy in 2026”, words like “to” and “in” are ignored, while “best”, “phone”, and “buy” matter. This filtering is done using stopwords and word normalization techniques.

What Are Stopwords?

Stopwords are common words that usually carry little meaning on their own. Removing them helps models focus on important terms.

  • Examples: is, the, and, a, an, of, to
  • Reduces vocabulary size
  • Improves processing speed

Stopword Removal Example


from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

sentence = "This is a simple example of stopword removal"
words = sentence.lower().split()

filtered_words = [word for word in words if word not in ENGLISH_STOP_WORDS]
print(filtered_words)
  
['simple', 'example', 'stopword', 'removal']

What the Code Is Doing

The sentence is split into words, converted to lowercase, and common stopwords are removed. The remaining words carry more meaningful information.

What Is Stemming?

Stemming reduces words to their base form by removing suffixes. The resulting word may not always be a real dictionary word.

  • Fast and simple
  • May produce incorrect word forms
  • Useful for basic text analysis

Stemming Example


from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
words = ["running", "runner", "ran", "runs"]

stems = [stemmer.stem(word) for word in words]
print(stems)
  
['run', 'runner', 'ran', 'run']

Understanding Stemming Output

Stemming cuts words mechanically. While fast, it may not always preserve the actual meaning of the word.

What Is Lemmatization?

Lemmatization reduces words to their meaningful base form, called a lemma. Unlike stemming, lemmatization considers grammar and context.

  • Produces real dictionary words
  • More accurate than stemming
  • Requires linguistic knowledge

Lemmatization Example


from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
words = ["running", "better", "cars"]

lemmas = [lemmatizer.lemmatize(word) for word in words]
print(lemmas)
  
['running', 'better', 'car']

Stemming vs Lemmatization

  • Stemming is rule-based and fast
  • Lemmatization is grammar-aware and accurate
  • Stemming may produce non-words
  • Lemmatization produces valid words

When to Use Each Technique

The choice depends on the application:

  • Use stopwords for noise reduction
  • Use stemming for speed-focused tasks
  • Use lemmatization for semantic understanding

Practice Questions

Practice 1: What do we call common words removed from text?



Practice 2: Which technique removes suffixes mechanically?



Practice 3: Which technique returns dictionary words?



Quick Quiz

Quiz 1: Why are stopwords removed?





Quiz 2: Which method may produce non-dictionary words?





Quiz 3: Which technique considers grammar and meaning?





Coming up next: Bag of Words and TF-IDF — converting text into numerical features.