NLP Lesson 29 – Cosine Similarity | Dataplexa

Cosine Similarity

In the previous lesson, you learned what text similarity is and how machines compare text using numeric vectors.

Now we focus on the most widely used similarity measure in NLP: Cosine Similarity.

Cosine similarity is simple, powerful, and used in search engines, recommendation systems, document clustering, and modern NLP pipelines.

What Is Cosine Similarity?

Cosine similarity measures the angle between two vectors, not their length.

Instead of asking:

“How far are the vectors?”

It asks:

“Are the vectors pointing in the same direction?”

This makes cosine similarity ideal for text data, where document length should not dominate similarity.

Why Angle Matters More Than Distance

Consider two documents:

Short review: “Good product”
Long review: “Good product with excellent quality and service”

Even though lengths differ, their meaning is closely related.

Cosine similarity captures this by ignoring magnitude and focusing on direction.

Cosine Similarity Formula

The mathematical formula is:

cos(θ) = (A · B) / (||A|| × ||B||)

Where:

A · B = dot product of vectors
||A|| = magnitude (length) of vector A
||B|| = magnitude (length) of vector B

The result always lies between 0 and 1 for text-based vectors.

How to Interpret Cosine Similarity Values

Understanding the score is critical.

1.0 → vectors are identical
0.7 – 0.9 → very similar
0.4 – 0.7 → moderately similar
0 – 0.4 → weak similarity
0 → completely unrelated

Negative values appear in other domains, but NLP vectors usually produce non-negative scores.

Cosine Similarity with Text (Conceptual View)

The process is always:

Convert text into vectors (TF-IDF, embeddings)
Compute cosine similarity
Use the score for ranking or decision-making

Cosine similarity does not care about word order, only vector orientation.

Manual Intuition with Simple Vectors

Assume we have two vectors:

A = [1, 1, 0]
B = [1, 1, 0]

They point in the same direction, so cosine similarity = 1.

Now compare:

C = [1, 0, 0]
D = [0, 1, 0]

They are perpendicular, so cosine similarity = 0.

Practical Demo: Cosine Similarity with TF-IDF

We now calculate cosine similarity between real text sentences.

Where to run this code:

Google Colab (recommended)
Jupyter Notebook (Anaconda)

Python Example: Cosine Similarity

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

documents = [
    "I love machine learning",
    "I enjoy learning machines",
    "The weather is hot today"
]

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(documents)

cos_sim = cosine_similarity(X)

print(cos_sim)

Understanding the Output Matrix

The output is a square matrix.

Rows represent documents
Columns represent documents
Each cell shows similarity score

Diagonal values are always 1 because each document is identical to itself.

Observe:

Sentence 1 & 2 → higher similarity
Sentence 3 → low similarity with others

Cosine Similarity vs Euclidean Distance

This comparison is common in exams.

Aspect	Cosine Similarity	Euclidean Distance
Focus	Direction	Distance
Text length sensitivity	Low	High
Common in NLP	Yes	No

Applications of Cosine Similarity

Search result ranking
Document similarity
Duplicate detection
Recommendation systems
Question–answer matching

Common Mistakes to Avoid

Using raw text without vectorization
Comparing vectors of different dimensions
Confusing similarity with distance

Correct preprocessing is essential.

Assignment / Homework

Theory:

Explain why cosine similarity ignores magnitude
Compare cosine similarity and dot product

Practical:

Take 5 sentences of your choice
Compute TF-IDF vectors
Find the most similar sentence pair

Practice Environment:

Google Colab
Jupyter Notebook

Practice Questions

Q1. What does cosine similarity measure?

The angle between two vectors.

Q2. Why is cosine similarity preferred for text?

Because it ignores document length and focuses on direction.

Quick Quiz

Q1. What is the cosine similarity of identical vectors?

Q2. Can cosine similarity be used without vectorization?

Quick Recap

Cosine similarity measures vector direction
Widely used in NLP applications
Insensitive to document length
Works best with TF-IDF and embeddings

In the next lesson, we will use cosine similarity to build Document Clustering and real-world NLP systems.

← Previous Course Index Next →