NLP Lesson 11 – N-grams | Dataplexa

N-grams

So far, we have learned how text is converted into numbers using Bag of Words and TF-IDF.

But there is one major limitation in both:

They treat words independently.

This means the phrase:

  • “not good”

is treated the same as:

  • “good”

That is a serious problem in real NLP tasks. N-grams are introduced to solve this.


What Is an N-gram?

An N-gram is a sequence of N consecutive words from a given text.

Instead of looking at single words, N-grams capture word combinations.

Examples:

  • Unigram (1-gram): NLP, is, powerful
  • Bigram (2-gram): NLP is, is powerful
  • Trigram (3-gram): NLP is powerful

Why Do We Need N-grams?

Language meaning often depends on word order.

Consider these sentences:

  • “This movie is good”
  • “This movie is not good”

Using only unigrams:

  • Both sentences contain the word “good”

Using bigrams:

  • “not good” becomes a meaningful feature

So N-grams help models understand local context.


Types of N-grams

N-gram Type N Value Example
Unigram 1 nlp
Bigram 2 machine learning
Trigram 3 natural language processing

N-grams with CountVectorizer

Let us see how N-grams are created in practice.

We will use:

  • CountVectorizer
  • ngram_range = (1, 2)

This means:

  • Include unigrams
  • Include bigrams

Where to run this code:

  • Google Colab (recommended)
  • Jupyter Notebook
  • VS Code with Python
Python Example: Unigrams + Bigrams
from sklearn.feature_extraction.text import CountVectorizer

sentences = [
    "I love NLP",
    "NLP is very powerful",
    "I love learning NLP"
]

vectorizer = CountVectorizer(ngram_range=(1, 2))
X = vectorizer.fit_transform(sentences)

print("Vocabulary:")
print(vectorizer.get_feature_names_out())

print("\nMatrix:")
print(X.toarray())

Output:

Output
Vocabulary:
['i' 'i love' 'is' 'is very' 'learning' 'learning nlp'
 'love' 'love learning' 'love nlp' 'nlp' 'nlp is'
 'very' 'very powerful' 'powerful']

Matrix:
[[1 1 0 0 0 0 1 0 1 1 0 0 0 0]
 [0 0 1 1 0 0 0 0 0 1 1 1 1 1]
 [1 0 0 0 1 1 1 1 0 1 0 0 0 0]]

How to Understand This Output

Now each column represents:

  • A word (unigram)
  • OR a word pair (bigram)

Example:

  • “love nlp” captures relationship
  • “learning nlp” captures intent

This is much richer than plain Bag of Words.


N-grams with TF-IDF

N-grams can also be combined with TF-IDF to reduce noise from common phrases.

Python Example: TF-IDF with Bigrams
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(ngram_range=(1, 2))
X = vectorizer.fit_transform(sentences)

print(vectorizer.get_feature_names_out())
print(X.toarray())

When Should You Use N-grams?

  • Sentiment analysis
  • Spam detection
  • Search engines
  • Text classification
  • Short-text problems

Especially useful when:

  • Negation matters (not good, not bad)
  • Phrase meaning matters

Limitations of N-grams

N-grams improve context, but they introduce new problems:

  • Feature space grows very fast
  • Memory usage increases
  • Does not capture long-distance meaning

These limitations lead us toward word embeddings in the next lessons.


Assignment / Homework

Practice Environment:

  • Google Colab
  • Jupyter Notebook

Tasks:

  • Create unigrams, bigrams, and trigrams
  • Compare feature sizes
  • Use TF-IDF with bigrams
  • Test with a sentiment dataset

Practice Questions

Q1. What problem do N-grams solve?

They capture word order and local context.

Q2. What is a bigram?

A sequence of two consecutive words.

Q3. What is the downside of large N?

Feature explosion and high memory usage.

Quick Quiz

Q1. Can N-grams capture meaning?

Partially, by capturing local word combinations.

Q2. Which is better for context: unigram or bigram?

Bigram.

Quick Recap

  • N-grams capture word order
  • They improve BoW and TF-IDF
  • Commonly used in sentiment analysis
  • Feature size grows rapidly
  • Foundation for embeddings