N-grams
So far, we have learned how text is converted into numbers using Bag of Words and TF-IDF.
But there is one major limitation in both:
They treat words independently.
This means the phrase:
- “not good”
is treated the same as:
- “good”
That is a serious problem in real NLP tasks. N-grams are introduced to solve this.
What Is an N-gram?
An N-gram is a sequence of N consecutive words from a given text.
Instead of looking at single words, N-grams capture word combinations.
Examples:
- Unigram (1-gram): NLP, is, powerful
- Bigram (2-gram): NLP is, is powerful
- Trigram (3-gram): NLP is powerful
Why Do We Need N-grams?
Language meaning often depends on word order.
Consider these sentences:
- “This movie is good”
- “This movie is not good”
Using only unigrams:
- Both sentences contain the word “good”
Using bigrams:
- “not good” becomes a meaningful feature
So N-grams help models understand local context.
Types of N-grams
| N-gram Type | N Value | Example |
|---|---|---|
| Unigram | 1 | nlp |
| Bigram | 2 | machine learning |
| Trigram | 3 | natural language processing |
N-grams with CountVectorizer
Let us see how N-grams are created in practice.
We will use:
- CountVectorizer
- ngram_range = (1, 2)
This means:
- Include unigrams
- Include bigrams
Where to run this code:
- Google Colab (recommended)
- Jupyter Notebook
- VS Code with Python
from sklearn.feature_extraction.text import CountVectorizer
sentences = [
"I love NLP",
"NLP is very powerful",
"I love learning NLP"
]
vectorizer = CountVectorizer(ngram_range=(1, 2))
X = vectorizer.fit_transform(sentences)
print("Vocabulary:")
print(vectorizer.get_feature_names_out())
print("\nMatrix:")
print(X.toarray())
Output:
Vocabulary:
['i' 'i love' 'is' 'is very' 'learning' 'learning nlp'
'love' 'love learning' 'love nlp' 'nlp' 'nlp is'
'very' 'very powerful' 'powerful']
Matrix:
[[1 1 0 0 0 0 1 0 1 1 0 0 0 0]
[0 0 1 1 0 0 0 0 0 1 1 1 1 1]
[1 0 0 0 1 1 1 1 0 1 0 0 0 0]]
How to Understand This Output
Now each column represents:
- A word (unigram)
- OR a word pair (bigram)
Example:
- “love nlp” captures relationship
- “learning nlp” captures intent
This is much richer than plain Bag of Words.
N-grams with TF-IDF
N-grams can also be combined with TF-IDF to reduce noise from common phrases.
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(ngram_range=(1, 2))
X = vectorizer.fit_transform(sentences)
print(vectorizer.get_feature_names_out())
print(X.toarray())
When Should You Use N-grams?
- Sentiment analysis
- Spam detection
- Search engines
- Text classification
- Short-text problems
Especially useful when:
- Negation matters (not good, not bad)
- Phrase meaning matters
Limitations of N-grams
N-grams improve context, but they introduce new problems:
- Feature space grows very fast
- Memory usage increases
- Does not capture long-distance meaning
These limitations lead us toward word embeddings in the next lessons.
Assignment / Homework
Practice Environment:
- Google Colab
- Jupyter Notebook
Tasks:
- Create unigrams, bigrams, and trigrams
- Compare feature sizes
- Use TF-IDF with bigrams
- Test with a sentiment dataset
Practice Questions
Q1. What problem do N-grams solve?
Q2. What is a bigram?
Q3. What is the downside of large N?
Quick Quiz
Q1. Can N-grams capture meaning?
Q2. Which is better for context: unigram or bigram?
Quick Recap
- N-grams capture word order
- They improve BoW and TF-IDF
- Commonly used in sentiment analysis
- Feature size grows rapidly
- Foundation for embeddings