NLP Lesson 9 – Bag of Words | Dataplexa

Bag of Words (BoW)

After learning text cleaning, stopwords, stemming, and lemmatization, we now reach a very important turning point in NLP.

So far, we prepared text. Now we answer the big question:

How does a machine understand text?

The answer is simple but powerful: machines do not understand words — they understand numbers.

The Bag of Words (BoW) model is the first and most fundamental method that converts text into numbers.

What Is Bag of Words?

Bag of Words is a text representation technique that converts text into numerical vectors based on word frequency.

Important idea:

Text is treated as a collection (bag) of words
Grammar and word order are ignored
Only word occurrence or frequency matters

That is why it is called a "bag" — order does not matter.

Why Do We Need Bag of Words?

Machine learning algorithms cannot work directly with raw text.

BoW helps by:

Converting text into numbers
Creating fixed-length vectors
Making text usable for ML models

Once text becomes numbers, we can apply:

Naive Bayes
Logistic Regression
SVM
Any ML algorithm

Core Idea of Bag of Words

BoW works in three main steps:

Create a vocabulary (unique words)
Count how many times each word appears
Represent each document as a numeric vector

Let us understand this with a simple example.

Example Sentences

Consider the following sentences:

Sentence 1: I love NLP
Sentence 2: NLP is powerful
Sentence 3: I love learning NLP

Step 1: Build Vocabulary

We collect all unique words from all sentences.

Vocabulary:

i
love
nlp
is
powerful
learning

Each word becomes one column in the vector.

Step 2: Convert Sentences to Vectors

Each sentence is converted into numbers based on how many times each vocabulary word appears.

Sentence	i	love	nlp	is	powerful	learning
I love NLP	1	1	1	0	0	0
NLP is powerful	0	0	1	1	1	0
I love learning NLP	1	1	1	0	0	1

This numeric matrix is the Bag of Words representation.

Important Observations

Word order is ignored
Only frequency matters
Vector length = vocabulary size

This simplicity is both BoW’s strength and weakness.

Bag of Words Using Python (CountVectorizer)

You can run this code using:

Google Colab (recommended)
Jupyter Notebook
VS Code with Python

Python Example: Bag of Words

from sklearn.feature_extraction.text import CountVectorizer

sentences = [
    "I love NLP",
    "NLP is powerful",
    "I love learning NLP"
]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(sentences)

print("Vocabulary:")
print(vectorizer.get_feature_names_out())

print("\nBoW Matrix:")
print(X.toarray())

Output:

Output

Vocabulary:
['is' 'learning' 'love' 'nlp' 'powerful']

BoW Matrix:
[[0 0 1 1 0]
 [1 0 0 1 1]
 [0 1 1 1 0]]

How to Understand This Output

Each row represents a sentence. Each column represents a word.

For example:

Sentence 1 has words: love, nlp
So their columns contain 1
Other words are 0

This numeric form is what ML models actually learn from.

Advantages of Bag of Words

Very simple to understand
Easy to implement
Works well for small datasets
Strong baseline for text classification

Limitations of Bag of Words

Ignores word order
Ignores meaning and context
Vocabulary can become very large
Does not handle semantics

Because of these limitations, BoW is later improved using TF-IDF and embeddings.

Where Bag of Words Is Used

Spam detection
Sentiment analysis (basic)
Document classification
Text clustering

Even today, BoW is widely used as a baseline model.

Assignment / Homework

Where to practice:

Google Colab
Jupyter Notebook

Tasks:

Create BoW vectors for 5 sentences
Apply stopword removal before BoW
Compare BoW before and after lemmatization
Note vocabulary size changes

Practice Questions

Q1. What does Bag of Words ignore?

Word order and grammar.

Q2. What does each column represent in BoW?

A unique word from the vocabulary.

Q3. Why is BoW suitable for ML models?

Because it converts text into numeric vectors.

Quick Quiz

Q1. Does Bag of Words capture meaning?

No, it only captures word frequency.

Q2. Which sklearn class is used for BoW?

CountVectorizer.

Quick Recap

Bag of Words converts text into numbers
It uses word frequency
Order and meaning are ignored
Simple but powerful baseline technique
Foundation for TF-IDF and embeddings

← Previous Course Index Next →