NLP Lesson 29 – Cosine Similarity | Dataplexa

Cosine Similarity

In the previous lesson, you learned what text similarity is and how machines compare text using numeric vectors.

Now we focus on the most widely used similarity measure in NLP: Cosine Similarity.

Cosine similarity is simple, powerful, and used in search engines, recommendation systems, document clustering, and modern NLP pipelines.


What Is Cosine Similarity?

Cosine similarity measures the angle between two vectors, not their length.

Instead of asking:

“How far are the vectors?”

It asks:

“Are the vectors pointing in the same direction?”

This makes cosine similarity ideal for text data, where document length should not dominate similarity.


Why Angle Matters More Than Distance

Consider two documents:

  • Short review: “Good product”
  • Long review: “Good product with excellent quality and service”

Even though lengths differ, their meaning is closely related.

Cosine similarity captures this by ignoring magnitude and focusing on direction.


Cosine Similarity Formula

The mathematical formula is:

cos(θ) = (A · B) / (||A|| × ||B||)

Where:

  • A · B = dot product of vectors
  • ||A|| = magnitude (length) of vector A
  • ||B|| = magnitude (length) of vector B

The result always lies between 0 and 1 for text-based vectors.


How to Interpret Cosine Similarity Values

Understanding the score is critical.

  • 1.0 → vectors are identical
  • 0.7 – 0.9 → very similar
  • 0.4 – 0.7 → moderately similar
  • 0 – 0.4 → weak similarity
  • 0 → completely unrelated

Negative values appear in other domains, but NLP vectors usually produce non-negative scores.


Cosine Similarity with Text (Conceptual View)

The process is always:

  1. Convert text into vectors (TF-IDF, embeddings)
  2. Compute cosine similarity
  3. Use the score for ranking or decision-making

Cosine similarity does not care about word order, only vector orientation.


Manual Intuition with Simple Vectors

Assume we have two vectors:

  • A = [1, 1, 0]
  • B = [1, 1, 0]

They point in the same direction, so cosine similarity = 1.

Now compare:

  • C = [1, 0, 0]
  • D = [0, 1, 0]

They are perpendicular, so cosine similarity = 0.


Practical Demo: Cosine Similarity with TF-IDF

We now calculate cosine similarity between real text sentences.

Where to run this code:

  • Google Colab (recommended)
  • Jupyter Notebook (Anaconda)
Python Example: Cosine Similarity
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

documents = [
    "I love machine learning",
    "I enjoy learning machines",
    "The weather is hot today"
]

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(documents)

cos_sim = cosine_similarity(X)

print(cos_sim)

Understanding the Output Matrix

The output is a square matrix.

  • Rows represent documents
  • Columns represent documents
  • Each cell shows similarity score

Diagonal values are always 1 because each document is identical to itself.

Observe:

  • Sentence 1 & 2 → higher similarity
  • Sentence 3 → low similarity with others

Cosine Similarity vs Euclidean Distance

This comparison is common in exams.

Aspect Cosine Similarity Euclidean Distance
Focus Direction Distance
Text length sensitivity Low High
Common in NLP Yes No

Applications of Cosine Similarity

  • Search result ranking
  • Document similarity
  • Duplicate detection
  • Recommendation systems
  • Question–answer matching

Common Mistakes to Avoid

  • Using raw text without vectorization
  • Comparing vectors of different dimensions
  • Confusing similarity with distance

Correct preprocessing is essential.


Assignment / Homework

Theory:

  • Explain why cosine similarity ignores magnitude
  • Compare cosine similarity and dot product

Practical:

  • Take 5 sentences of your choice
  • Compute TF-IDF vectors
  • Find the most similar sentence pair

Practice Environment:

  • Google Colab
  • Jupyter Notebook

Practice Questions

Q1. What does cosine similarity measure?

The angle between two vectors.

Q2. Why is cosine similarity preferred for text?

Because it ignores document length and focuses on direction.

Quick Quiz

Q1. What is the cosine similarity of identical vectors?

1

Q2. Can cosine similarity be used without vectorization?

No

Quick Recap

  • Cosine similarity measures vector direction
  • Widely used in NLP applications
  • Insensitive to document length
  • Works best with TF-IDF and embeddings

In the next lesson, we will use cosine similarity to build Document Clustering and real-world NLP systems.