NLP Lesson 10 – TF-IDF | Dataplexa

TF-IDF (Term Frequency – Inverse Document Frequency)

In the previous lesson, we learned Bag of Words (BoW). BoW was simple and powerful, but it had one big weakness:

All words were treated equally.

Common words like “is”, “the”, “and” got the same importance as meaningful words like “fraud”, “cancer”, “nlp”.

TF-IDF solves this exact problem by giving less importance to common words and more importance to rare but meaningful words.

Why Do We Need TF-IDF?

Let us understand the problem first.

Imagine a document collection:

The word “the” appears in almost every document
The word “neural” appears only in NLP documents

Which word is more useful for classification?

Obviously, “neural”. TF-IDF mathematically captures this idea.

What Is TF-IDF?

TF-IDF is a numerical statistic that reflects:

How important a word is in a document
Relative to the entire document collection

It is a combination of two parts:

TF – Term Frequency
IDF – Inverse Document Frequency

Term Frequency (TF)

Term Frequency measures how often a word appears in a document.

Basic idea:

More occurrences → higher importance inside that document

Simple formula:

TF(word) = (Number of times word appears in document)

Example:

Sentence: “NLP NLP is powerful”
TF(NLP) = 2

Inverse Document Frequency (IDF)

IDF measures how rare a word is across all documents.

If a word appears in many documents, it is less useful for distinguishing them.

Intuition:

Rare word → high IDF
Common word → low IDF

Conceptual formula:

IDF(word) = log(Total documents / Documents containing the word)

So common words automatically get penalized.

TF-IDF = TF × IDF

TF-IDF combines both ideas:

Word must be important in the document (TF)
Word must be rare across documents (IDF)

Only words satisfying both get high scores.

BoW vs TF-IDF (Quick Comparison)

Aspect	Bag of Words	TF-IDF
Word importance	Equal for all words	Weighted importance
Common words	High influence	Low influence
Rare words	Same weight	Higher weight
Use case	Baseline models	Better text classification

TF-IDF Using Python (TfidfVectorizer)

You can run this code using:

Google Colab (recommended)
Jupyter Notebook
VS Code with Python

We will:

Convert text into TF-IDF vectors
Observe how weights differ from BoW

Python Example: TF-IDF Vectorization

from sklearn.feature_extraction.text import TfidfVectorizer

sentences = [
    "I love NLP",
    "NLP is powerful",
    "I love learning NLP"
]

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(sentences)

print("Vocabulary:")
print(vectorizer.get_feature_names_out())

print("\nTF-IDF Matrix:")
print(X.toarray())

Output:

Output

Vocabulary:
['is' 'learning' 'love' 'nlp' 'powerful']

TF-IDF Matrix:
[[0.    0.    0.707 0.707 0.   ]
 [0.605 0.    0.    0.605 0.517]
 [0.    0.605 0.605 0.605 0.   ]]

How to Understand This Output

Unlike BoW, values are no longer simple counts.

Each number is a weight
Higher value = more important word
Common words receive lower weights

Notice:

“nlp” appears in all documents → weight is controlled
“learning” appears only once → higher relative weight

This is why TF-IDF performs better than BoW.

Where TF-IDF Is Used

Search engines
Document classification
Information retrieval
Spam detection
Resume screening systems

Limitations of TF-IDF

Even TF-IDF has limitations:

Does not capture word meaning
Does not understand context
Word order is still ignored

These problems are later solved using word embeddings and transformers.

Assignment / Homework

Where to practice:

Google Colab
Jupyter Notebook

Tasks:

Apply TF-IDF on 10 sentences
Compare BoW vs TF-IDF weights
Remove stopwords and observe changes
Try different ngram ranges

Practice Questions

Q1. What problem does TF-IDF solve?

It reduces the importance of common words and increases the importance of rare words.

Q2. What does IDF measure?

How rare a word is across all documents.

Q3. Is TF-IDF better than BoW?

Yes, because it assigns importance weights instead of raw counts.

Quick Quiz

Q1. Does TF-IDF understand word meaning?

No, it only uses statistical frequency.

Q2. Which sklearn class is used for TF-IDF?

TfidfVectorizer.

Quick Recap

TF-IDF improves Bag of Words
TF measures importance within a document
IDF penalizes common words
Widely used in search and classification
Foundation for embeddings and deep NLP

← Previous Course Index Next →