NLP Lesson 18 – Word2Vec | Dataplexa

Word2Vec – CBOW and Skip-Gram

In the previous lesson, you learned what word embeddings are and why they are essential for modern NLP.

In this lesson, we study Word2Vec, the first widely successful algorithm that learns meaningful word embeddings directly from text.

You will understand:

What Word2Vec is
How it learns word meaning
CBOW vs Skip-Gram
When to use each model

What Is Word2Vec?

Word2Vec is a technique that learns dense vector representations of words by training a shallow neural network on large text data.

Instead of counting words (like Bag of Words), Word2Vec learns word meaning based on context.

Important point:

Word2Vec does NOT understand language. It learns statistical patterns from word usage.

The Key Idea Behind Word2Vec

Word2Vec is built on one simple idea:

Words that appear in similar contexts should have similar vectors.

Example:

"I love deep learning"
"I love machine learning"

Here, deep and machine appear in similar positions and contexts. Word2Vec learns this relationship automatically.

How Word2Vec Learns (High-Level View)

Word2Vec converts text into a learning problem.

The model:

Takes words as input
Tries to predict nearby words
Adjusts vector values to reduce prediction error

After training:

Hidden layer weights become word embeddings
Similar words end up with similar vectors

Two Architectures of Word2Vec

Word2Vec has two main training architectures:

CBOW (Continuous Bag of Words)
Skip-Gram

Both learn embeddings, but in opposite directions.

CBOW (Continuous Bag of Words)

CBOW predicts the target word using its context words.

Example sentence:

"I love natural language processing"

If the target word is natural, CBOW uses:

love
language

to predict natural.

CBOW – Key Characteristics

Faster training
Works well with large datasets
Better for frequent words
Slightly less accurate for rare words

CBOW is commonly used when speed matters.

Skip-Gram

Skip-Gram does the opposite of CBOW.

It uses a target word to predict its context words.

Example:

Target word: natural

Predict:

love
language

Skip-Gram – Key Characteristics

Slower than CBOW
Works very well for rare words
Produces higher-quality embeddings
Preferred for semantic accuracy

Most research-grade embeddings use Skip-Gram.

CBOW vs Skip-Gram (Comparison)

Aspect	CBOW	Skip-Gram
Prediction direction	Context → Target	Target → Context
Training speed	Faster	Slower
Rare words	Less effective	Very effective
Embedding quality	Good	Better
Used when	Large data, speed	Accuracy matters

Neural Network Structure (Conceptual)

Word2Vec uses a very simple neural network:

Input layer (one-hot encoded word)
Hidden layer (embedding layer)
Output layer (softmax prediction)

The hidden layer weights are what we finally use as word embeddings.

Simple Code Example (Word2Vec with Gensim)

Now let us see how Word2Vec is used in practice.

In this example, we:

Train Word2Vec on small sentences
Generate word embeddings
Check similarity between words

Where to run this code:

Google Colab (recommended)
Jupyter Notebook
VS Code with Python

Python Example: Word2Vec (Skip-Gram)

from gensim.models import Word2Vec

sentences = [
    ["i", "love", "nlp"],
    ["nlp", "is", "powerful"],
    ["i", "enjoy", "learning", "nlp"]
]

model = Word2Vec(
    sentences,
    vector_size=50,
    window=2,
    min_count=1,
    sg=1   # sg=1 means Skip-Gram
)

print(model.wv["nlp"])
print(model.wv.similarity("nlp", "learning"))

Output Explanation:

The first output is a 50-dimensional vector for the word nlp
The similarity score shows how close two words are (range: −1 to 1)

A higher similarity value means the words appear in similar contexts.

Why Word2Vec Was a Breakthrough

Captured semantic relationships
Efficient and scalable
Enabled vector arithmetic
Foundation for modern NLP models

Almost all later embedding techniques build upon Word2Vec ideas.

Assignment / Homework

Theory:

Explain CBOW in your own words
Explain Skip-Gram in your own words

Practical:

Run the Word2Vec code in Google Colab
Change sg=0 and observe the difference
Try increasing vector_size

Practice Questions

Q1. What does Word2Vec learn?

Dense vector representations of words based on context.

Q2. Which model is better for rare words?

Skip-Gram.

Quick Quiz

Q1. CBOW predicts what?

Target word from context words.

Q2. What layer gives embeddings in Word2Vec?

The hidden layer.

Quick Recap

Word2Vec learns embeddings using context
CBOW: context → target
Skip-Gram: target → context
Skip-Gram works better for rare words
Word2Vec is the foundation of modern embeddings

In the next lesson, we will study GloVe, which combines global statistics with Word2Vec ideas.

← Previous Course Index Next →