Cosine Similarity
In the previous lesson, you learned what text similarity is and how machines compare text using numeric vectors.
Now we focus on the most widely used similarity measure in NLP: Cosine Similarity.
Cosine similarity is simple, powerful, and used in search engines, recommendation systems, document clustering, and modern NLP pipelines.
What Is Cosine Similarity?
Cosine similarity measures the angle between two vectors, not their length.
Instead of asking:
“How far are the vectors?”
It asks:
“Are the vectors pointing in the same direction?”
This makes cosine similarity ideal for text data, where document length should not dominate similarity.
Why Angle Matters More Than Distance
Consider two documents:
- Short review: “Good product”
- Long review: “Good product with excellent quality and service”
Even though lengths differ, their meaning is closely related.
Cosine similarity captures this by ignoring magnitude and focusing on direction.
Cosine Similarity Formula
The mathematical formula is:
cos(θ) = (A · B) / (||A|| × ||B||)
Where:
- A · B = dot product of vectors
- ||A|| = magnitude (length) of vector A
- ||B|| = magnitude (length) of vector B
The result always lies between 0 and 1 for text-based vectors.
How to Interpret Cosine Similarity Values
Understanding the score is critical.
- 1.0 → vectors are identical
- 0.7 – 0.9 → very similar
- 0.4 – 0.7 → moderately similar
- 0 – 0.4 → weak similarity
- 0 → completely unrelated
Negative values appear in other domains, but NLP vectors usually produce non-negative scores.
Cosine Similarity with Text (Conceptual View)
The process is always:
- Convert text into vectors (TF-IDF, embeddings)
- Compute cosine similarity
- Use the score for ranking or decision-making
Cosine similarity does not care about word order, only vector orientation.
Manual Intuition with Simple Vectors
Assume we have two vectors:
- A = [1, 1, 0]
- B = [1, 1, 0]
They point in the same direction, so cosine similarity = 1.
Now compare:
- C = [1, 0, 0]
- D = [0, 1, 0]
They are perpendicular, so cosine similarity = 0.
Practical Demo: Cosine Similarity with TF-IDF
We now calculate cosine similarity between real text sentences.
Where to run this code:
- Google Colab (recommended)
- Jupyter Notebook (Anaconda)
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
documents = [
"I love machine learning",
"I enjoy learning machines",
"The weather is hot today"
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(documents)
cos_sim = cosine_similarity(X)
print(cos_sim)
Understanding the Output Matrix
The output is a square matrix.
- Rows represent documents
- Columns represent documents
- Each cell shows similarity score
Diagonal values are always 1 because each document is identical to itself.
Observe:
- Sentence 1 & 2 → higher similarity
- Sentence 3 → low similarity with others
Cosine Similarity vs Euclidean Distance
This comparison is common in exams.
| Aspect | Cosine Similarity | Euclidean Distance |
|---|---|---|
| Focus | Direction | Distance |
| Text length sensitivity | Low | High |
| Common in NLP | Yes | No |
Applications of Cosine Similarity
- Search result ranking
- Document similarity
- Duplicate detection
- Recommendation systems
- Question–answer matching
Common Mistakes to Avoid
- Using raw text without vectorization
- Comparing vectors of different dimensions
- Confusing similarity with distance
Correct preprocessing is essential.
Assignment / Homework
Theory:
- Explain why cosine similarity ignores magnitude
- Compare cosine similarity and dot product
Practical:
- Take 5 sentences of your choice
- Compute TF-IDF vectors
- Find the most similar sentence pair
Practice Environment:
- Google Colab
- Jupyter Notebook
Practice Questions
Q1. What does cosine similarity measure?
Q2. Why is cosine similarity preferred for text?
Quick Quiz
Q1. What is the cosine similarity of identical vectors?
Q2. Can cosine similarity be used without vectorization?
Quick Recap
- Cosine similarity measures vector direction
- Widely used in NLP applications
- Insensitive to document length
- Works best with TF-IDF and embeddings
In the next lesson, we will use cosine similarity to build Document Clustering and real-world NLP systems.