Text Similarity
So far, you have learned how NLP models can discover topics and represent text numerically.
Now we move to a very important concept: Text Similarity. This tells us how similar or different two pieces of text are.
Text similarity is the foundation of:
- Search engines
- Plagiarism detection
- Recommendation systems
- Chatbots and Q&A systems
- Document clustering
What Is Text Similarity?
Text similarity measures how closely two texts are related in meaning or structure.
Instead of saying “same or not same”, similarity gives a score, usually between 0 and 1.
Higher score → more similar Lower score → less similar
Why Text Similarity Is Important
In real life, users never type the exact same sentence.
Example:
- “Best mobile under 20000”
- “Good phones below 20k”
A smart system must understand that both sentences mean almost the same thing.
Types of Text Similarity
Text similarity can be categorized into two major types:
- Lexical Similarity: based on words
- Semantic Similarity: based on meaning
We will start with lexical similarity and later move to semantic methods.
Lexical Similarity (Word-Based)
Lexical similarity compares text using:
- Common words
- Word frequency
- Vector representations (BoW, TF-IDF)
It works well when wording is similar, but struggles with synonyms.
Semantic Similarity (Meaning-Based)
Semantic similarity tries to capture meaning rather than exact words.
It uses:
- Word embeddings
- Sentence embeddings
- Transformer-based models
We will cover these methods in upcoming lessons.
How Machines Compare Text
Machines cannot compare raw text.
The process is:
- Convert text into vectors
- Apply a similarity measure
- Produce a similarity score
One of the most common similarity measures is Cosine Similarity, which we will study next lesson in detail.
Simple Example of Similar vs Dissimilar Text
Consider these sentences:
- Sentence A: “I love machine learning”
- Sentence B: “I enjoy learning machines”
- Sentence C: “The weather is hot today”
A similarity system should assign:
- High similarity between A and B
- Low similarity between A and C
Practical Demo: Text Similarity Using TF-IDF
We will now calculate text similarity using TF-IDF vectors.
Where to run this code:
- Google Colab (recommended)
- Jupyter Notebook (Anaconda)
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
sentences = [
"I love machine learning",
"I enjoy learning machines",
"The weather is hot today"
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(sentences)
similarity_matrix = cosine_similarity(X)
print(similarity_matrix)
Understanding the Output
The output is a similarity matrix.
Each value represents similarity between two sentences.
- Values close to 1 → very similar
- Values close to 0 → very different
You will observe:
- Sentence 1 & 2 have higher similarity
- Sentence 3 is dissimilar to both
This proves machines can compare text numerically.
Interpreting Similarity Scores
Similarity scores do not have fixed meaning.
Typical interpretation:
- 0.8 – 1.0 → highly similar
- 0.4 – 0.8 → moderately similar
- Below 0.4 → weak similarity
Thresholds depend on the application.
Applications of Text Similarity
- Search result ranking
- Duplicate content detection
- Resume–job matching
- Chatbot response selection
- Document clustering
Common Mistakes to Avoid
- Comparing raw text directly
- Ignoring preprocessing
- Expecting lexical methods to capture meaning
Better representations lead to better similarity.
Assignment / Homework
Theory:
- Difference between lexical and semantic similarity
- Why cosine similarity is commonly used
Practical:
- Take 10 sentences of your choice
- Compute TF-IDF similarity matrix
- Identify the most similar pair
Practice Environment:
- Google Colab
- Jupyter Notebook
Practice Questions
Q1. Can similarity be greater than 1?
Q2. Does high lexical similarity guarantee same meaning?
Quick Quiz
Q1. What must text be converted into before comparison?
Q2. Which similarity measure is most common in NLP?
Quick Recap
- Text similarity measures closeness between texts
- Similarity is numeric, not binary
- Lexical and semantic similarity differ
- Vectorization is mandatory
In the next lesson, we will dive deep into Cosine Similarity and understand its mathematics and intuition.