TF-IDF (Term Frequency – Inverse Document Frequency)
In the previous lesson, we learned Bag of Words (BoW). BoW was simple and powerful, but it had one big weakness:
All words were treated equally.
Common words like “is”, “the”, “and” got the same importance as meaningful words like “fraud”, “cancer”, “nlp”.
TF-IDF solves this exact problem by giving less importance to common words and more importance to rare but meaningful words.
Why Do We Need TF-IDF?
Let us understand the problem first.
Imagine a document collection:
- The word “the” appears in almost every document
- The word “neural” appears only in NLP documents
Which word is more useful for classification?
Obviously, “neural”. TF-IDF mathematically captures this idea.
What Is TF-IDF?
TF-IDF is a numerical statistic that reflects:
- How important a word is in a document
- Relative to the entire document collection
It is a combination of two parts:
- TF – Term Frequency
- IDF – Inverse Document Frequency
Term Frequency (TF)
Term Frequency measures how often a word appears in a document.
Basic idea:
More occurrences → higher importance inside that document
Simple formula:
TF(word) = (Number of times word appears in document)
Example:
- Sentence: “NLP NLP is powerful”
- TF(NLP) = 2
Inverse Document Frequency (IDF)
IDF measures how rare a word is across all documents.
If a word appears in many documents, it is less useful for distinguishing them.
Intuition:
- Rare word → high IDF
- Common word → low IDF
Conceptual formula:
IDF(word) = log(Total documents / Documents containing the word)
So common words automatically get penalized.
TF-IDF = TF × IDF
TF-IDF combines both ideas:
- Word must be important in the document (TF)
- Word must be rare across documents (IDF)
Only words satisfying both get high scores.
BoW vs TF-IDF (Quick Comparison)
| Aspect | Bag of Words | TF-IDF |
|---|---|---|
| Word importance | Equal for all words | Weighted importance |
| Common words | High influence | Low influence |
| Rare words | Same weight | Higher weight |
| Use case | Baseline models | Better text classification |
TF-IDF Using Python (TfidfVectorizer)
You can run this code using:
- Google Colab (recommended)
- Jupyter Notebook
- VS Code with Python
We will:
- Convert text into TF-IDF vectors
- Observe how weights differ from BoW
from sklearn.feature_extraction.text import TfidfVectorizer
sentences = [
"I love NLP",
"NLP is powerful",
"I love learning NLP"
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(sentences)
print("Vocabulary:")
print(vectorizer.get_feature_names_out())
print("\nTF-IDF Matrix:")
print(X.toarray())
Output:
Vocabulary:
['is' 'learning' 'love' 'nlp' 'powerful']
TF-IDF Matrix:
[[0. 0. 0.707 0.707 0. ]
[0.605 0. 0. 0.605 0.517]
[0. 0.605 0.605 0.605 0. ]]
How to Understand This Output
Unlike BoW, values are no longer simple counts.
- Each number is a weight
- Higher value = more important word
- Common words receive lower weights
Notice:
- “nlp” appears in all documents → weight is controlled
- “learning” appears only once → higher relative weight
This is why TF-IDF performs better than BoW.
Where TF-IDF Is Used
- Search engines
- Document classification
- Information retrieval
- Spam detection
- Resume screening systems
Limitations of TF-IDF
Even TF-IDF has limitations:
- Does not capture word meaning
- Does not understand context
- Word order is still ignored
These problems are later solved using word embeddings and transformers.
Assignment / Homework
Where to practice:
- Google Colab
- Jupyter Notebook
Tasks:
- Apply TF-IDF on 10 sentences
- Compare BoW vs TF-IDF weights
- Remove stopwords and observe changes
- Try different ngram ranges
Practice Questions
Q1. What problem does TF-IDF solve?
Q2. What does IDF measure?
Q3. Is TF-IDF better than BoW?
Quick Quiz
Q1. Does TF-IDF understand word meaning?
Q2. Which sklearn class is used for TF-IDF?
Quick Recap
- TF-IDF improves Bag of Words
- TF measures importance within a document
- IDF penalizes common words
- Widely used in search and classification
- Foundation for embeddings and deep NLP