NLP Lesson 10 – TF-IDF | Dataplexa

TF-IDF (Term Frequency – Inverse Document Frequency)

In the previous lesson, we learned Bag of Words (BoW). BoW was simple and powerful, but it had one big weakness:

All words were treated equally.

Common words like “is”, “the”, “and” got the same importance as meaningful words like “fraud”, “cancer”, “nlp”.

TF-IDF solves this exact problem by giving less importance to common words and more importance to rare but meaningful words.


Why Do We Need TF-IDF?

Let us understand the problem first.

Imagine a document collection:

  • The word “the” appears in almost every document
  • The word “neural” appears only in NLP documents

Which word is more useful for classification?

Obviously, “neural”. TF-IDF mathematically captures this idea.


What Is TF-IDF?

TF-IDF is a numerical statistic that reflects:

  • How important a word is in a document
  • Relative to the entire document collection

It is a combination of two parts:

  • TF – Term Frequency
  • IDF – Inverse Document Frequency

Term Frequency (TF)

Term Frequency measures how often a word appears in a document.

Basic idea:

More occurrences → higher importance inside that document

Simple formula:

TF(word) = (Number of times word appears in document)

Example:

  • Sentence: “NLP NLP is powerful”
  • TF(NLP) = 2

Inverse Document Frequency (IDF)

IDF measures how rare a word is across all documents.

If a word appears in many documents, it is less useful for distinguishing them.

Intuition:

  • Rare word → high IDF
  • Common word → low IDF

Conceptual formula:

IDF(word) = log(Total documents / Documents containing the word)

So common words automatically get penalized.


TF-IDF = TF × IDF

TF-IDF combines both ideas:

  • Word must be important in the document (TF)
  • Word must be rare across documents (IDF)

Only words satisfying both get high scores.


BoW vs TF-IDF (Quick Comparison)

Aspect Bag of Words TF-IDF
Word importance Equal for all words Weighted importance
Common words High influence Low influence
Rare words Same weight Higher weight
Use case Baseline models Better text classification

TF-IDF Using Python (TfidfVectorizer)

You can run this code using:

  • Google Colab (recommended)
  • Jupyter Notebook
  • VS Code with Python

We will:

  • Convert text into TF-IDF vectors
  • Observe how weights differ from BoW
Python Example: TF-IDF Vectorization
from sklearn.feature_extraction.text import TfidfVectorizer

sentences = [
    "I love NLP",
    "NLP is powerful",
    "I love learning NLP"
]

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(sentences)

print("Vocabulary:")
print(vectorizer.get_feature_names_out())

print("\nTF-IDF Matrix:")
print(X.toarray())

Output:

Output
Vocabulary:
['is' 'learning' 'love' 'nlp' 'powerful']

TF-IDF Matrix:
[[0.    0.    0.707 0.707 0.   ]
 [0.605 0.    0.    0.605 0.517]
 [0.    0.605 0.605 0.605 0.   ]]

How to Understand This Output

Unlike BoW, values are no longer simple counts.

  • Each number is a weight
  • Higher value = more important word
  • Common words receive lower weights

Notice:

  • “nlp” appears in all documents → weight is controlled
  • “learning” appears only once → higher relative weight

This is why TF-IDF performs better than BoW.


Where TF-IDF Is Used

  • Search engines
  • Document classification
  • Information retrieval
  • Spam detection
  • Resume screening systems

Limitations of TF-IDF

Even TF-IDF has limitations:

  • Does not capture word meaning
  • Does not understand context
  • Word order is still ignored

These problems are later solved using word embeddings and transformers.


Assignment / Homework

Where to practice:

  • Google Colab
  • Jupyter Notebook

Tasks:

  • Apply TF-IDF on 10 sentences
  • Compare BoW vs TF-IDF weights
  • Remove stopwords and observe changes
  • Try different ngram ranges

Practice Questions

Q1. What problem does TF-IDF solve?

It reduces the importance of common words and increases the importance of rare words.

Q2. What does IDF measure?

How rare a word is across all documents.

Q3. Is TF-IDF better than BoW?

Yes, because it assigns importance weights instead of raw counts.

Quick Quiz

Q1. Does TF-IDF understand word meaning?

No, it only uses statistical frequency.

Q2. Which sklearn class is used for TF-IDF?

TfidfVectorizer.

Quick Recap

  • TF-IDF improves Bag of Words
  • TF measures importance within a document
  • IDF penalizes common words
  • Widely used in search and classification
  • Foundation for embeddings and deep NLP