Bag of Words (BoW)
After learning text cleaning, stopwords, stemming, and lemmatization, we now reach a very important turning point in NLP.
So far, we prepared text. Now we answer the big question:
How does a machine understand text?
The answer is simple but powerful: machines do not understand words — they understand numbers.
The Bag of Words (BoW) model is the first and most fundamental method that converts text into numbers.
What Is Bag of Words?
Bag of Words is a text representation technique that converts text into numerical vectors based on word frequency.
Important idea:
- Text is treated as a collection (bag) of words
- Grammar and word order are ignored
- Only word occurrence or frequency matters
That is why it is called a "bag" — order does not matter.
Why Do We Need Bag of Words?
Machine learning algorithms cannot work directly with raw text.
BoW helps by:
- Converting text into numbers
- Creating fixed-length vectors
- Making text usable for ML models
Once text becomes numbers, we can apply:
- Naive Bayes
- Logistic Regression
- SVM
- Any ML algorithm
Core Idea of Bag of Words
BoW works in three main steps:
- Create a vocabulary (unique words)
- Count how many times each word appears
- Represent each document as a numeric vector
Let us understand this with a simple example.
Example Sentences
Consider the following sentences:
- Sentence 1: I love NLP
- Sentence 2: NLP is powerful
- Sentence 3: I love learning NLP
Step 1: Build Vocabulary
We collect all unique words from all sentences.
Vocabulary:
- i
- love
- nlp
- is
- powerful
- learning
Each word becomes one column in the vector.
Step 2: Convert Sentences to Vectors
Each sentence is converted into numbers based on how many times each vocabulary word appears.
| Sentence | i | love | nlp | is | powerful | learning |
|---|---|---|---|---|---|---|
| I love NLP | 1 | 1 | 1 | 0 | 0 | 0 |
| NLP is powerful | 0 | 0 | 1 | 1 | 1 | 0 |
| I love learning NLP | 1 | 1 | 1 | 0 | 0 | 1 |
This numeric matrix is the Bag of Words representation.
Important Observations
- Word order is ignored
- Only frequency matters
- Vector length = vocabulary size
This simplicity is both BoW’s strength and weakness.
Bag of Words Using Python (CountVectorizer)
You can run this code using:
- Google Colab (recommended)
- Jupyter Notebook
- VS Code with Python
from sklearn.feature_extraction.text import CountVectorizer
sentences = [
"I love NLP",
"NLP is powerful",
"I love learning NLP"
]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(sentences)
print("Vocabulary:")
print(vectorizer.get_feature_names_out())
print("\nBoW Matrix:")
print(X.toarray())
Output:
Vocabulary:
['is' 'learning' 'love' 'nlp' 'powerful']
BoW Matrix:
[[0 0 1 1 0]
[1 0 0 1 1]
[0 1 1 1 0]]
How to Understand This Output
Each row represents a sentence. Each column represents a word.
For example:
- Sentence 1 has words: love, nlp
- So their columns contain 1
- Other words are 0
This numeric form is what ML models actually learn from.
Advantages of Bag of Words
- Very simple to understand
- Easy to implement
- Works well for small datasets
- Strong baseline for text classification
Limitations of Bag of Words
- Ignores word order
- Ignores meaning and context
- Vocabulary can become very large
- Does not handle semantics
Because of these limitations, BoW is later improved using TF-IDF and embeddings.
Where Bag of Words Is Used
- Spam detection
- Sentiment analysis (basic)
- Document classification
- Text clustering
Even today, BoW is widely used as a baseline model.
Assignment / Homework
Where to practice:
- Google Colab
- Jupyter Notebook
Tasks:
- Create BoW vectors for 5 sentences
- Apply stopword removal before BoW
- Compare BoW before and after lemmatization
- Note vocabulary size changes
Practice Questions
Q1. What does Bag of Words ignore?
Q2. What does each column represent in BoW?
Q3. Why is BoW suitable for ML models?
Quick Quiz
Q1. Does Bag of Words capture meaning?
Q2. Which sklearn class is used for BoW?
Quick Recap
- Bag of Words converts text into numbers
- It uses word frequency
- Order and meaning are ignored
- Simple but powerful baseline technique
- Foundation for TF-IDF and embeddings