NLP Lesson 11 – N-grams | Dataplexa

N-grams

So far, we have learned how text is converted into numbers using Bag of Words and TF-IDF.

But there is one major limitation in both:

They treat words independently.

This means the phrase:

“not good”

is treated the same as:

“good”

That is a serious problem in real NLP tasks. N-grams are introduced to solve this.

What Is an N-gram?

An N-gram is a sequence of N consecutive words from a given text.

Instead of looking at single words, N-grams capture word combinations.

Examples:

Unigram (1-gram): NLP, is, powerful
Bigram (2-gram): NLP is, is powerful
Trigram (3-gram): NLP is powerful

Why Do We Need N-grams?

Language meaning often depends on word order.

Consider these sentences:

“This movie is good”
“This movie is not good”

Using only unigrams:

Both sentences contain the word “good”

Using bigrams:

“not good” becomes a meaningful feature

So N-grams help models understand local context.

Types of N-grams

N-gram Type	N Value	Example
Unigram	1	nlp
Bigram	2	machine learning
Trigram	3	natural language processing

N-grams with CountVectorizer

Let us see how N-grams are created in practice.

We will use:

CountVectorizer
ngram_range = (1, 2)

This means:

Include unigrams
Include bigrams

Where to run this code:

Google Colab (recommended)
Jupyter Notebook
VS Code with Python

Python Example: Unigrams + Bigrams

from sklearn.feature_extraction.text import CountVectorizer

sentences = [
    "I love NLP",
    "NLP is very powerful",
    "I love learning NLP"
]

vectorizer = CountVectorizer(ngram_range=(1, 2))
X = vectorizer.fit_transform(sentences)

print("Vocabulary:")
print(vectorizer.get_feature_names_out())

print("\nMatrix:")
print(X.toarray())

Output:

Output

Vocabulary:
['i' 'i love' 'is' 'is very' 'learning' 'learning nlp'
 'love' 'love learning' 'love nlp' 'nlp' 'nlp is'
 'very' 'very powerful' 'powerful']

Matrix:
[[1 1 0 0 0 0 1 0 1 1 0 0 0 0]
 [0 0 1 1 0 0 0 0 0 1 1 1 1 1]
 [1 0 0 0 1 1 1 1 0 1 0 0 0 0]]

How to Understand This Output

Now each column represents:

A word (unigram)
OR a word pair (bigram)

Example:

“love nlp” captures relationship
“learning nlp” captures intent

This is much richer than plain Bag of Words.

N-grams with TF-IDF

N-grams can also be combined with TF-IDF to reduce noise from common phrases.

Python Example: TF-IDF with Bigrams

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(ngram_range=(1, 2))
X = vectorizer.fit_transform(sentences)

print(vectorizer.get_feature_names_out())
print(X.toarray())

When Should You Use N-grams?

Sentiment analysis
Spam detection
Search engines
Text classification
Short-text problems

Especially useful when:

Negation matters (not good, not bad)
Phrase meaning matters

Limitations of N-grams

N-grams improve context, but they introduce new problems:

Feature space grows very fast
Memory usage increases
Does not capture long-distance meaning

These limitations lead us toward word embeddings in the next lessons.

Assignment / Homework

Practice Environment:

Google Colab
Jupyter Notebook

Tasks:

Create unigrams, bigrams, and trigrams
Compare feature sizes
Use TF-IDF with bigrams
Test with a sentiment dataset

Practice Questions

Q1. What problem do N-grams solve?

They capture word order and local context.

Q2. What is a bigram?

A sequence of two consecutive words.

Q3. What is the downside of large N?

Feature explosion and high memory usage.

Quick Quiz

Q1. Can N-grams capture meaning?

Partially, by capturing local word combinations.

Q2. Which is better for context: unigram or bigram?

Bigram.

Quick Recap

N-grams capture word order
They improve BoW and TF-IDF
Commonly used in sentiment analysis
Feature size grows rapidly
Foundation for embeddings

← Previous Course Index Next →