Word2Vec – CBOW and Skip-Gram
In the previous lesson, you learned what word embeddings are and why they are essential for modern NLP.
In this lesson, we study Word2Vec, the first widely successful algorithm that learns meaningful word embeddings directly from text.
You will understand:
- What Word2Vec is
- How it learns word meaning
- CBOW vs Skip-Gram
- When to use each model
What Is Word2Vec?
Word2Vec is a technique that learns dense vector representations of words by training a shallow neural network on large text data.
Instead of counting words (like Bag of Words), Word2Vec learns word meaning based on context.
Important point:
Word2Vec does NOT understand language. It learns statistical patterns from word usage.
The Key Idea Behind Word2Vec
Word2Vec is built on one simple idea:
Words that appear in similar contexts should have similar vectors.
Example:
- "I love deep learning"
- "I love machine learning"
Here, deep and machine appear in similar positions and contexts. Word2Vec learns this relationship automatically.
How Word2Vec Learns (High-Level View)
Word2Vec converts text into a learning problem.
The model:
- Takes words as input
- Tries to predict nearby words
- Adjusts vector values to reduce prediction error
After training:
- Hidden layer weights become word embeddings
- Similar words end up with similar vectors
Two Architectures of Word2Vec
Word2Vec has two main training architectures:
- CBOW (Continuous Bag of Words)
- Skip-Gram
Both learn embeddings, but in opposite directions.
CBOW (Continuous Bag of Words)
CBOW predicts the target word using its context words.
Example sentence:
"I love natural language processing"
If the target word is natural, CBOW uses:
- love
- language
to predict natural.
CBOW – Key Characteristics
- Faster training
- Works well with large datasets
- Better for frequent words
- Slightly less accurate for rare words
CBOW is commonly used when speed matters.
Skip-Gram
Skip-Gram does the opposite of CBOW.
It uses a target word to predict its context words.
Example:
Target word: natural
Predict:
- love
- language
Skip-Gram – Key Characteristics
- Slower than CBOW
- Works very well for rare words
- Produces higher-quality embeddings
- Preferred for semantic accuracy
Most research-grade embeddings use Skip-Gram.
CBOW vs Skip-Gram (Comparison)
| Aspect | CBOW | Skip-Gram |
|---|---|---|
| Prediction direction | Context → Target | Target → Context |
| Training speed | Faster | Slower |
| Rare words | Less effective | Very effective |
| Embedding quality | Good | Better |
| Used when | Large data, speed | Accuracy matters |
Neural Network Structure (Conceptual)
Word2Vec uses a very simple neural network:
- Input layer (one-hot encoded word)
- Hidden layer (embedding layer)
- Output layer (softmax prediction)
The hidden layer weights are what we finally use as word embeddings.
Simple Code Example (Word2Vec with Gensim)
Now let us see how Word2Vec is used in practice.
In this example, we:
- Train Word2Vec on small sentences
- Generate word embeddings
- Check similarity between words
Where to run this code:
- Google Colab (recommended)
- Jupyter Notebook
- VS Code with Python
from gensim.models import Word2Vec
sentences = [
["i", "love", "nlp"],
["nlp", "is", "powerful"],
["i", "enjoy", "learning", "nlp"]
]
model = Word2Vec(
sentences,
vector_size=50,
window=2,
min_count=1,
sg=1 # sg=1 means Skip-Gram
)
print(model.wv["nlp"])
print(model.wv.similarity("nlp", "learning"))
Output Explanation:
- The first output is a 50-dimensional vector for the word nlp
- The similarity score shows how close two words are (range: −1 to 1)
A higher similarity value means the words appear in similar contexts.
Why Word2Vec Was a Breakthrough
- Captured semantic relationships
- Efficient and scalable
- Enabled vector arithmetic
- Foundation for modern NLP models
Almost all later embedding techniques build upon Word2Vec ideas.
Assignment / Homework
Theory:
- Explain CBOW in your own words
- Explain Skip-Gram in your own words
Practical:
- Run the Word2Vec code in Google Colab
- Change
sg=0and observe the difference - Try increasing
vector_size
Practice Questions
Q1. What does Word2Vec learn?
Q2. Which model is better for rare words?
Quick Quiz
Q1. CBOW predicts what?
Q2. What layer gives embeddings in Word2Vec?
Quick Recap
- Word2Vec learns embeddings using context
- CBOW: context → target
- Skip-Gram: target → context
- Skip-Gram works better for rare words
- Word2Vec is the foundation of modern embeddings
In the next lesson, we will study GloVe, which combines global statistics with Word2Vec ideas.