NLP Lesson 18 – Word2Vec | Dataplexa

Word2Vec – CBOW and Skip-Gram

In the previous lesson, you learned what word embeddings are and why they are essential for modern NLP.

In this lesson, we study Word2Vec, the first widely successful algorithm that learns meaningful word embeddings directly from text.

You will understand:

  • What Word2Vec is
  • How it learns word meaning
  • CBOW vs Skip-Gram
  • When to use each model

What Is Word2Vec?

Word2Vec is a technique that learns dense vector representations of words by training a shallow neural network on large text data.

Instead of counting words (like Bag of Words), Word2Vec learns word meaning based on context.

Important point:

Word2Vec does NOT understand language. It learns statistical patterns from word usage.


The Key Idea Behind Word2Vec

Word2Vec is built on one simple idea:

Words that appear in similar contexts should have similar vectors.

Example:

  • "I love deep learning"
  • "I love machine learning"

Here, deep and machine appear in similar positions and contexts. Word2Vec learns this relationship automatically.


How Word2Vec Learns (High-Level View)

Word2Vec converts text into a learning problem.

The model:

  • Takes words as input
  • Tries to predict nearby words
  • Adjusts vector values to reduce prediction error

After training:

  • Hidden layer weights become word embeddings
  • Similar words end up with similar vectors

Two Architectures of Word2Vec

Word2Vec has two main training architectures:

  • CBOW (Continuous Bag of Words)
  • Skip-Gram

Both learn embeddings, but in opposite directions.


CBOW (Continuous Bag of Words)

CBOW predicts the target word using its context words.

Example sentence:

"I love natural language processing"

If the target word is natural, CBOW uses:

  • love
  • language

to predict natural.


CBOW – Key Characteristics

  • Faster training
  • Works well with large datasets
  • Better for frequent words
  • Slightly less accurate for rare words

CBOW is commonly used when speed matters.


Skip-Gram

Skip-Gram does the opposite of CBOW.

It uses a target word to predict its context words.

Example:

Target word: natural

Predict:

  • love
  • language

Skip-Gram – Key Characteristics

  • Slower than CBOW
  • Works very well for rare words
  • Produces higher-quality embeddings
  • Preferred for semantic accuracy

Most research-grade embeddings use Skip-Gram.


CBOW vs Skip-Gram (Comparison)

Aspect CBOW Skip-Gram
Prediction direction Context → Target Target → Context
Training speed Faster Slower
Rare words Less effective Very effective
Embedding quality Good Better
Used when Large data, speed Accuracy matters

Neural Network Structure (Conceptual)

Word2Vec uses a very simple neural network:

  • Input layer (one-hot encoded word)
  • Hidden layer (embedding layer)
  • Output layer (softmax prediction)

The hidden layer weights are what we finally use as word embeddings.


Simple Code Example (Word2Vec with Gensim)

Now let us see how Word2Vec is used in practice.

In this example, we:

  • Train Word2Vec on small sentences
  • Generate word embeddings
  • Check similarity between words

Where to run this code:

  • Google Colab (recommended)
  • Jupyter Notebook
  • VS Code with Python
Python Example: Word2Vec (Skip-Gram)
from gensim.models import Word2Vec

sentences = [
    ["i", "love", "nlp"],
    ["nlp", "is", "powerful"],
    ["i", "enjoy", "learning", "nlp"]
]

model = Word2Vec(
    sentences,
    vector_size=50,
    window=2,
    min_count=1,
    sg=1   # sg=1 means Skip-Gram
)

print(model.wv["nlp"])
print(model.wv.similarity("nlp", "learning"))

Output Explanation:

  • The first output is a 50-dimensional vector for the word nlp
  • The similarity score shows how close two words are (range: −1 to 1)

A higher similarity value means the words appear in similar contexts.


Why Word2Vec Was a Breakthrough

  • Captured semantic relationships
  • Efficient and scalable
  • Enabled vector arithmetic
  • Foundation for modern NLP models

Almost all later embedding techniques build upon Word2Vec ideas.


Assignment / Homework

Theory:

  • Explain CBOW in your own words
  • Explain Skip-Gram in your own words

Practical:

  • Run the Word2Vec code in Google Colab
  • Change sg=0 and observe the difference
  • Try increasing vector_size

Practice Questions

Q1. What does Word2Vec learn?

Dense vector representations of words based on context.

Q2. Which model is better for rare words?

Skip-Gram.

Quick Quiz

Q1. CBOW predicts what?

Target word from context words.

Q2. What layer gives embeddings in Word2Vec?

The hidden layer.

Quick Recap

  • Word2Vec learns embeddings using context
  • CBOW: context → target
  • Skip-Gram: target → context
  • Skip-Gram works better for rare words
  • Word2Vec is the foundation of modern embeddings

In the next lesson, we will study GloVe, which combines global statistics with Word2Vec ideas.