NLP Lesson 28 – Text Similarity | Dataplexa

Text Similarity

So far, you have learned how NLP models can discover topics and represent text numerically.

Now we move to a very important concept: Text Similarity. This tells us how similar or different two pieces of text are.

Text similarity is the foundation of:

Search engines
Plagiarism detection
Recommendation systems
Chatbots and Q&A systems
Document clustering

What Is Text Similarity?

Text similarity measures how closely two texts are related in meaning or structure.

Instead of saying “same or not same”, similarity gives a score, usually between 0 and 1.

Higher score → more similar Lower score → less similar

Why Text Similarity Is Important

In real life, users never type the exact same sentence.

Example:

“Best mobile under 20000”
“Good phones below 20k”

A smart system must understand that both sentences mean almost the same thing.

Types of Text Similarity

Text similarity can be categorized into two major types:

Lexical Similarity: based on words
Semantic Similarity: based on meaning

We will start with lexical similarity and later move to semantic methods.

Lexical Similarity (Word-Based)

Lexical similarity compares text using:

Common words
Word frequency
Vector representations (BoW, TF-IDF)

It works well when wording is similar, but struggles with synonyms.

Semantic Similarity (Meaning-Based)

Semantic similarity tries to capture meaning rather than exact words.

It uses:

Word embeddings
Sentence embeddings
Transformer-based models

We will cover these methods in upcoming lessons.

How Machines Compare Text

Machines cannot compare raw text.

The process is:

Convert text into vectors
Apply a similarity measure
Produce a similarity score

One of the most common similarity measures is Cosine Similarity, which we will study next lesson in detail.

Simple Example of Similar vs Dissimilar Text

Consider these sentences:

Sentence A: “I love machine learning”
Sentence B: “I enjoy learning machines”
Sentence C: “The weather is hot today”

A similarity system should assign:

High similarity between A and B
Low similarity between A and C

Practical Demo: Text Similarity Using TF-IDF

We will now calculate text similarity using TF-IDF vectors.

Where to run this code:

Google Colab (recommended)
Jupyter Notebook (Anaconda)

Python Example: Text Similarity with TF-IDF

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

sentences = [
    "I love machine learning",
    "I enjoy learning machines",
    "The weather is hot today"
]

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(sentences)

similarity_matrix = cosine_similarity(X)

print(similarity_matrix)

Understanding the Output

The output is a similarity matrix.

Each value represents similarity between two sentences.

Values close to 1 → very similar
Values close to 0 → very different

You will observe:

Sentence 1 & 2 have higher similarity
Sentence 3 is dissimilar to both

This proves machines can compare text numerically.

Interpreting Similarity Scores

Similarity scores do not have fixed meaning.

Typical interpretation:

0.8 – 1.0 → highly similar
0.4 – 0.8 → moderately similar
Below 0.4 → weak similarity

Thresholds depend on the application.

Applications of Text Similarity

Search result ranking
Duplicate content detection
Resume–job matching
Chatbot response selection
Document clustering

Common Mistakes to Avoid

Comparing raw text directly
Ignoring preprocessing
Expecting lexical methods to capture meaning

Better representations lead to better similarity.

Assignment / Homework

Theory:

Difference between lexical and semantic similarity
Why cosine similarity is commonly used

Practical:

Take 10 sentences of your choice
Compute TF-IDF similarity matrix
Identify the most similar pair

Practice Environment:

Google Colab
Jupyter Notebook

Practice Questions

Q1. Can similarity be greater than 1?

No. Similarity scores are usually between 0 and 1.

Q2. Does high lexical similarity guarantee same meaning?

No. Lexical similarity ignores meaning and synonyms.

Quick Quiz

Q1. What must text be converted into before comparison?

Numeric vectors.

Q2. Which similarity measure is most common in NLP?

Cosine similarity.

Quick Recap

Text similarity measures closeness between texts
Similarity is numeric, not binary
Lexical and semantic similarity differ
Vectorization is mandatory

In the next lesson, we will dive deep into Cosine Similarity and understand its mathematics and intuition.

← Previous Course Index Next →