NLP Lesson 28 – Text Similarity | Dataplexa

Text Similarity

So far, you have learned how NLP models can discover topics and represent text numerically.

Now we move to a very important concept: Text Similarity. This tells us how similar or different two pieces of text are.

Text similarity is the foundation of:

  • Search engines
  • Plagiarism detection
  • Recommendation systems
  • Chatbots and Q&A systems
  • Document clustering

What Is Text Similarity?

Text similarity measures how closely two texts are related in meaning or structure.

Instead of saying “same or not same”, similarity gives a score, usually between 0 and 1.

Higher score → more similar Lower score → less similar


Why Text Similarity Is Important

In real life, users never type the exact same sentence.

Example:

  • “Best mobile under 20000”
  • “Good phones below 20k”

A smart system must understand that both sentences mean almost the same thing.


Types of Text Similarity

Text similarity can be categorized into two major types:

  • Lexical Similarity: based on words
  • Semantic Similarity: based on meaning

We will start with lexical similarity and later move to semantic methods.


Lexical Similarity (Word-Based)

Lexical similarity compares text using:

  • Common words
  • Word frequency
  • Vector representations (BoW, TF-IDF)

It works well when wording is similar, but struggles with synonyms.


Semantic Similarity (Meaning-Based)

Semantic similarity tries to capture meaning rather than exact words.

It uses:

  • Word embeddings
  • Sentence embeddings
  • Transformer-based models

We will cover these methods in upcoming lessons.


How Machines Compare Text

Machines cannot compare raw text.

The process is:

  1. Convert text into vectors
  2. Apply a similarity measure
  3. Produce a similarity score

One of the most common similarity measures is Cosine Similarity, which we will study next lesson in detail.


Simple Example of Similar vs Dissimilar Text

Consider these sentences:

  • Sentence A: “I love machine learning”
  • Sentence B: “I enjoy learning machines”
  • Sentence C: “The weather is hot today”

A similarity system should assign:

  • High similarity between A and B
  • Low similarity between A and C

Practical Demo: Text Similarity Using TF-IDF

We will now calculate text similarity using TF-IDF vectors.

Where to run this code:

  • Google Colab (recommended)
  • Jupyter Notebook (Anaconda)
Python Example: Text Similarity with TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

sentences = [
    "I love machine learning",
    "I enjoy learning machines",
    "The weather is hot today"
]

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(sentences)

similarity_matrix = cosine_similarity(X)

print(similarity_matrix)

Understanding the Output

The output is a similarity matrix.

Each value represents similarity between two sentences.

  • Values close to 1 → very similar
  • Values close to 0 → very different

You will observe:

  • Sentence 1 & 2 have higher similarity
  • Sentence 3 is dissimilar to both

This proves machines can compare text numerically.


Interpreting Similarity Scores

Similarity scores do not have fixed meaning.

Typical interpretation:

  • 0.8 – 1.0 → highly similar
  • 0.4 – 0.8 → moderately similar
  • Below 0.4 → weak similarity

Thresholds depend on the application.


Applications of Text Similarity

  • Search result ranking
  • Duplicate content detection
  • Resume–job matching
  • Chatbot response selection
  • Document clustering

Common Mistakes to Avoid

  • Comparing raw text directly
  • Ignoring preprocessing
  • Expecting lexical methods to capture meaning

Better representations lead to better similarity.


Assignment / Homework

Theory:

  • Difference between lexical and semantic similarity
  • Why cosine similarity is commonly used

Practical:

  • Take 10 sentences of your choice
  • Compute TF-IDF similarity matrix
  • Identify the most similar pair

Practice Environment:

  • Google Colab
  • Jupyter Notebook

Practice Questions

Q1. Can similarity be greater than 1?

No. Similarity scores are usually between 0 and 1.

Q2. Does high lexical similarity guarantee same meaning?

No. Lexical similarity ignores meaning and synonyms.

Quick Quiz

Q1. What must text be converted into before comparison?

Numeric vectors.

Q2. Which similarity measure is most common in NLP?

Cosine similarity.

Quick Recap

  • Text similarity measures closeness between texts
  • Similarity is numeric, not binary
  • Lexical and semantic similarity differ
  • Vectorization is mandatory

In the next lesson, we will dive deep into Cosine Similarity and understand its mathematics and intuition.