GenAI Lesson 15 – Similarity Search | Dataplexa

Similarity Search

Once text is converted into embeddings, the next challenge is finding meaning efficiently.

Similarity search is the mechanism that allows machines to retrieve relevant information based on semantics, not exact keyword matches.

This concept is central to search engines, recommendation systems, and RAG pipelines.

The Problem Similarity Search Solves

Traditional search relies on exact word matches.

This fails when:

Users phrase questions differently
Synonyms are used
Conceptual meaning matters more than keywords

Similarity search addresses these issues by comparing meaning rather than text.

Thinking Before Coding

Before writing any code, ask:

What exactly are we comparing?

In similarity search, we compare vectors, not strings.

That means every input must be embedded first.

High-Level Similarity Search Flow

A typical workflow looks like this:

Embed all documents or chunks
Store embeddings
Embed the user query
Compare query embedding to stored vectors
Return the most similar results

Every production system follows this pattern.

Vector Distance and Similarity

To compare embeddings, we need a numerical similarity measure.

The most common choice is cosine similarity.

Why Cosine Similarity?

Cosine similarity measures the angle between vectors, not their magnitude.

This makes it ideal for comparing semantic meaning.

Simple Similarity Calculation

Let’s start with a small, controlled example to understand how similarity works.


import numpy as np

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

doc_1 = np.array([0.8, 0.2])
doc_2 = np.array([0.75, 0.25])
doc_3 = np.array([0.1, 0.9])

query = np.array([0.78, 0.22])

print(cosine_similarity(query, doc_1))
print(cosine_similarity(query, doc_2))
print(cosine_similarity(query, doc_3))

Before running this code, understand the intent:

query represents user intent
doc_* represent stored document chunks
The highest score indicates closest meaning

0.999 0.998 0.347

The query is semantically closer to doc_1 and doc_2 than doc_3.

Ranking Results

In real systems, you do not return a single match.

You rank results by similarity and return the top-k items.

Ranking Logic Example


documents = {
    "doc_1": doc_1,
    "doc_2": doc_2,
    "doc_3": doc_3
}

scores = {
    name: cosine_similarity(query, vec)
    for name, vec in documents.items()
}

ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)
print(ranked)

This ranking step is critical.

It determines which information the model will see next in a RAG system.

[('doc_1', 0.999), ('doc_2', 0.998), ('doc_3', 0.347)]

Similarity Search at Scale

The examples so far use only a few vectors.

In real applications, you may have millions of embeddings.

Brute-force comparison becomes too slow.

Approximate Nearest Neighbor (ANN)

To scale similarity search, systems use approximate methods.

These trade tiny accuracy loss for massive performance gains.

Vector databases implement these techniques internally.

Where Similarity Search Is Used

Similarity search powers:

Semantic document search
Recommendation engines
Duplicate detection
RAG pipelines

If embeddings are the foundation, similarity search is the engine.

Common Mistakes to Avoid

Comparing raw text instead of vectors
Mixing embedding models
Ignoring normalization

These mistakes lead to incorrect rankings.

Practice

What must text be converted into before similarity search?

Which similarity metric is most commonly used?

What step orders results by relevance?

Quick Quiz

Similarity search compares:

Vectors
Strings
Tokens

What determines which result is returned first?

Ranking by similarity
Text length
Input order

Why are vector databases used?

To scale similarity search
To train models
To render text

Recap: Similarity search retrieves relevant information by comparing embeddings, not keywords.

Next up: We introduce vector databases — purpose-built systems for storing and searching embeddings.

← Previous Course Index Next →

Generative AI Course