NLP Lesson 9 – Bag of Words | Dataplexa

Bag of Words (BoW)

After learning text cleaning, stopwords, stemming, and lemmatization, we now reach a very important turning point in NLP.

So far, we prepared text. Now we answer the big question:

How does a machine understand text?

The answer is simple but powerful: machines do not understand words — they understand numbers.

The Bag of Words (BoW) model is the first and most fundamental method that converts text into numbers.


What Is Bag of Words?

Bag of Words is a text representation technique that converts text into numerical vectors based on word frequency.

Important idea:

  • Text is treated as a collection (bag) of words
  • Grammar and word order are ignored
  • Only word occurrence or frequency matters

That is why it is called a "bag" — order does not matter.


Why Do We Need Bag of Words?

Machine learning algorithms cannot work directly with raw text.

BoW helps by:

  • Converting text into numbers
  • Creating fixed-length vectors
  • Making text usable for ML models

Once text becomes numbers, we can apply:

  • Naive Bayes
  • Logistic Regression
  • SVM
  • Any ML algorithm

Core Idea of Bag of Words

BoW works in three main steps:

  1. Create a vocabulary (unique words)
  2. Count how many times each word appears
  3. Represent each document as a numeric vector

Let us understand this with a simple example.


Example Sentences

Consider the following sentences:

  • Sentence 1: I love NLP
  • Sentence 2: NLP is powerful
  • Sentence 3: I love learning NLP

Step 1: Build Vocabulary

We collect all unique words from all sentences.

Vocabulary:

  • i
  • love
  • nlp
  • is
  • powerful
  • learning

Each word becomes one column in the vector.


Step 2: Convert Sentences to Vectors

Each sentence is converted into numbers based on how many times each vocabulary word appears.

Sentence i love nlp is powerful learning
I love NLP 111000
NLP is powerful 001110
I love learning NLP 111001

This numeric matrix is the Bag of Words representation.


Important Observations

  • Word order is ignored
  • Only frequency matters
  • Vector length = vocabulary size

This simplicity is both BoW’s strength and weakness.


Bag of Words Using Python (CountVectorizer)

You can run this code using:

  • Google Colab (recommended)
  • Jupyter Notebook
  • VS Code with Python
Python Example: Bag of Words
from sklearn.feature_extraction.text import CountVectorizer

sentences = [
    "I love NLP",
    "NLP is powerful",
    "I love learning NLP"
]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(sentences)

print("Vocabulary:")
print(vectorizer.get_feature_names_out())

print("\nBoW Matrix:")
print(X.toarray())

Output:

Output
Vocabulary:
['is' 'learning' 'love' 'nlp' 'powerful']

BoW Matrix:
[[0 0 1 1 0]
 [1 0 0 1 1]
 [0 1 1 1 0]]

How to Understand This Output

Each row represents a sentence. Each column represents a word.

For example:

  • Sentence 1 has words: love, nlp
  • So their columns contain 1
  • Other words are 0

This numeric form is what ML models actually learn from.


Advantages of Bag of Words

  • Very simple to understand
  • Easy to implement
  • Works well for small datasets
  • Strong baseline for text classification

Limitations of Bag of Words

  • Ignores word order
  • Ignores meaning and context
  • Vocabulary can become very large
  • Does not handle semantics

Because of these limitations, BoW is later improved using TF-IDF and embeddings.


Where Bag of Words Is Used

  • Spam detection
  • Sentiment analysis (basic)
  • Document classification
  • Text clustering

Even today, BoW is widely used as a baseline model.


Assignment / Homework

Where to practice:

  • Google Colab
  • Jupyter Notebook

Tasks:

  • Create BoW vectors for 5 sentences
  • Apply stopword removal before BoW
  • Compare BoW before and after lemmatization
  • Note vocabulary size changes

Practice Questions

Q1. What does Bag of Words ignore?

Word order and grammar.

Q2. What does each column represent in BoW?

A unique word from the vocabulary.

Q3. Why is BoW suitable for ML models?

Because it converts text into numeric vectors.

Quick Quiz

Q1. Does Bag of Words capture meaning?

No, it only captures word frequency.

Q2. Which sklearn class is used for BoW?

CountVectorizer.

Quick Recap

  • Bag of Words converts text into numbers
  • It uses word frequency
  • Order and meaning are ignored
  • Simple but powerful baseline technique
  • Foundation for TF-IDF and embeddings