AI Course
Bag of Words (BoW) and TF-IDF
So far, we learned how to clean text using tokenization, stopwords, stemming, and lemmatization. However, machine learning models cannot directly understand words or sentences. They only understand numbers. This lesson explains how text is converted into numerical form using Bag of Words and TF-IDF.
These techniques are foundational in NLP and are still widely used in search engines, spam filters, and text classification systems.
Real-World Connection
When Gmail decides whether an email is spam, it does not read the message like a human. Instead, it converts the email text into numbers and checks patterns learned from previous spam emails. Bag of Words and TF-IDF play a key role in this transformation.
What Is Bag of Words?
Bag of Words is a simple technique that represents text by counting how often each word appears. Grammar and word order are ignored. Only word frequency matters.
- Each document becomes a vector of word counts
- Vocabulary is built from all documents
- Word order is ignored
Bag of Words Example
from sklearn.feature_extraction.text import CountVectorizer
documents = [
"AI is transforming technology",
"Technology is evolving with AI",
"AI and data drive innovation"
]
vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(documents)
print(vectorizer.get_feature_names_out())
print(bow_matrix.toarray())
Understanding the Output
Each row represents a document. Each column represents a word from the vocabulary. The numbers indicate how many times a word appears in a document. This numerical matrix is what machine learning models use as input.
Limitations of Bag of Words
While simple and fast, Bag of Words has limitations:
- Common words dominate the representation
- Important rare words may get ignored
- No understanding of word importance
To solve this, we use TF-IDF.
What Is TF-IDF?
TF-IDF stands for Term Frequency – Inverse Document Frequency. It assigns higher weight to words that are important in a document but not common across all documents.
- TF measures word frequency in a document
- IDF reduces weight of common words
- Highlights meaningful terms
TF-IDF Example
from sklearn.feature_extraction.text import TfidfVectorizer
documents = [
"AI is transforming technology",
"Technology is evolving with AI",
"AI and data drive innovation"
]
tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(documents)
print(tfidf.get_feature_names_out())
print(tfidf_matrix.toarray())
Why TF-IDF Is Better
TF-IDF reduces the impact of very common words like “is” and increases the importance of words like “innovation” or “transforming”. This results in better model performance for classification and search tasks.
BoW vs TF-IDF
- BoW uses raw word counts
- TF-IDF uses weighted importance
- BoW is simpler and faster
- TF-IDF gives better semantic signals
Practice Questions
Practice 1: Which technique represents text using word counts?
Practice 2: Which technique reduces the importance of common words?
Practice 3: Why is text converted into numbers?
Quick Quiz
Quiz 1: What does Bag of Words focus on?
Quiz 2: What does IDF stand for?
Quiz 3: Which technique is better for highlighting important words?
Coming up next: Word Embeddings — representing words using meaning and context instead of counts.