AI Lesson 65 – Bag of Words (BoW) & TF-IDF | Dataplexa

Bag of Words (BoW) and TF-IDF

So far, we learned how to clean text using tokenization, stopwords, stemming, and lemmatization. However, machine learning models cannot directly understand words or sentences. They only understand numbers. This lesson explains how text is converted into numerical form using Bag of Words and TF-IDF.

These techniques are foundational in NLP and are still widely used in search engines, spam filters, and text classification systems.

Real-World Connection

When Gmail decides whether an email is spam, it does not read the message like a human. Instead, it converts the email text into numbers and checks patterns learned from previous spam emails. Bag of Words and TF-IDF play a key role in this transformation.

What Is Bag of Words?

Bag of Words is a simple technique that represents text by counting how often each word appears. Grammar and word order are ignored. Only word frequency matters.

  • Each document becomes a vector of word counts
  • Vocabulary is built from all documents
  • Word order is ignored

Bag of Words Example


from sklearn.feature_extraction.text import CountVectorizer

documents = [
    "AI is transforming technology",
    "Technology is evolving with AI",
    "AI and data drive innovation"
]

vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(documents)

print(vectorizer.get_feature_names_out())
print(bow_matrix.toarray())
  
['ai' 'and' 'data' 'drive' 'evolving' 'innovation' 'is' 'technology' 'transforming' 'with'] [[1 0 0 0 0 0 1 1 1 0] [1 0 0 0 1 0 1 1 0 1] [1 1 1 1 0 1 0 0 0 0]]

Understanding the Output

Each row represents a document. Each column represents a word from the vocabulary. The numbers indicate how many times a word appears in a document. This numerical matrix is what machine learning models use as input.

Limitations of Bag of Words

While simple and fast, Bag of Words has limitations:

  • Common words dominate the representation
  • Important rare words may get ignored
  • No understanding of word importance

To solve this, we use TF-IDF.

What Is TF-IDF?

TF-IDF stands for Term Frequency – Inverse Document Frequency. It assigns higher weight to words that are important in a document but not common across all documents.

  • TF measures word frequency in a document
  • IDF reduces weight of common words
  • Highlights meaningful terms

TF-IDF Example


from sklearn.feature_extraction.text import TfidfVectorizer

documents = [
    "AI is transforming technology",
    "Technology is evolving with AI",
    "AI and data drive innovation"
]

tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(documents)

print(tfidf.get_feature_names_out())
print(tfidf_matrix.toarray())
  
[[0.33 0. 0. 0. 0. 0. 0.33 0.33 0.44 0. ] [0.33 0. 0. 0. 0.44 0. 0.33 0.33 0. 0.44] [0.32 0.43 0.43 0.43 0. 0.43 0. 0. 0. 0. ]]

Why TF-IDF Is Better

TF-IDF reduces the impact of very common words like “is” and increases the importance of words like “innovation” or “transforming”. This results in better model performance for classification and search tasks.

BoW vs TF-IDF

  • BoW uses raw word counts
  • TF-IDF uses weighted importance
  • BoW is simpler and faster
  • TF-IDF gives better semantic signals

Practice Questions

Practice 1: Which technique represents text using word counts?



Practice 2: Which technique reduces the importance of common words?



Practice 3: Why is text converted into numbers?



Quick Quiz

Quiz 1: What does Bag of Words focus on?





Quiz 2: What does IDF stand for?





Quiz 3: Which technique is better for highlighting important words?





Coming up next: Word Embeddings — representing words using meaning and context instead of counts.