AI Lesson 65 – Bag of Words (BoW) & TF-IDF | Dataplexa

Bag of Words (BoW) and TF-IDF

So far, we learned how to clean text using tokenization, stopwords, stemming, and lemmatization. However, machine learning models cannot directly understand words or sentences. They only understand numbers. This lesson explains how text is converted into numerical form using Bag of Words and TF-IDF.

These techniques are foundational in NLP and are still widely used in search engines, spam filters, and text classification systems.

Real-World Connection

When Gmail decides whether an email is spam, it does not read the message like a human. Instead, it converts the email text into numbers and checks patterns learned from previous spam emails. Bag of Words and TF-IDF play a key role in this transformation.

What Is Bag of Words?

Bag of Words is a simple technique that represents text by counting how often each word appears. Grammar and word order are ignored. Only word frequency matters.

Each document becomes a vector of word counts
Vocabulary is built from all documents
Word order is ignored

Bag of Words Example


from sklearn.feature_extraction.text import CountVectorizer

documents = [
    "AI is transforming technology",
    "Technology is evolving with AI",
    "AI and data drive innovation"
]

vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(documents)

print(vectorizer.get_feature_names_out())
print(bow_matrix.toarray())

['ai' 'and' 'data' 'drive' 'evolving' 'innovation' 'is' 'technology' 'transforming' 'with'] [[1 0 0 0 0 0 1 1 1 0] [1 0 0 0 1 0 1 1 0 1] [1 1 1 1 0 1 0 0 0 0]]

Understanding the Output

Each row represents a document. Each column represents a word from the vocabulary. The numbers indicate how many times a word appears in a document. This numerical matrix is what machine learning models use as input.

Limitations of Bag of Words

While simple and fast, Bag of Words has limitations:

Common words dominate the representation
Important rare words may get ignored
No understanding of word importance

To solve this, we use TF-IDF.

What Is TF-IDF?

TF-IDF stands for Term Frequency – Inverse Document Frequency. It assigns higher weight to words that are important in a document but not common across all documents.

TF measures word frequency in a document
IDF reduces weight of common words
Highlights meaningful terms

TF-IDF Example


from sklearn.feature_extraction.text import TfidfVectorizer

documents = [
    "AI is transforming technology",
    "Technology is evolving with AI",
    "AI and data drive innovation"
]

tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(documents)

print(tfidf.get_feature_names_out())
print(tfidf_matrix.toarray())

[[0.33 0. 0. 0. 0. 0. 0.33 0.33 0.44 0. ] [0.33 0. 0. 0. 0.44 0. 0.33 0.33 0. 0.44] [0.32 0.43 0.43 0.43 0. 0.43 0. 0. 0. 0. ]]

Why TF-IDF Is Better

TF-IDF reduces the impact of very common words like “is” and increases the importance of words like “innovation” or “transforming”. This results in better model performance for classification and search tasks.

BoW vs TF-IDF

BoW uses raw word counts
TF-IDF uses weighted importance
BoW is simpler and faster
TF-IDF gives better semantic signals

Practice Questions

Practice 1: Which technique represents text using word counts?

Practice 2: Which technique reduces the importance of common words?

Practice 3: Why is text converted into numbers?

Quick Quiz

Quiz 1: What does Bag of Words focus on?

Word order
Word frequency
Grammar

Quiz 2: What does IDF stand for?

Inverse Document Frequency
Internal Data Factor
Index Data Field

Quiz 3: Which technique is better for highlighting important words?

Bag of Words
TF-IDF
Tokenization

Coming up next: Word Embeddings — representing words using meaning and context instead of counts.

← Previous Course Index Next →

AI Course