GenAI Lesson 15 – Similarity Search | Dataplexa

Similarity Search

Once text is converted into embeddings, the next challenge is finding meaning efficiently.

Similarity search is the mechanism that allows machines to retrieve relevant information based on semantics, not exact keyword matches.

This concept is central to search engines, recommendation systems, and RAG pipelines.

The Problem Similarity Search Solves

Traditional search relies on exact word matches.

This fails when:

  • Users phrase questions differently
  • Synonyms are used
  • Conceptual meaning matters more than keywords

Similarity search addresses these issues by comparing meaning rather than text.

Thinking Before Coding

Before writing any code, ask:

What exactly are we comparing?

In similarity search, we compare vectors, not strings.

That means every input must be embedded first.

High-Level Similarity Search Flow

A typical workflow looks like this:

  • Embed all documents or chunks
  • Store embeddings
  • Embed the user query
  • Compare query embedding to stored vectors
  • Return the most similar results

Every production system follows this pattern.

Vector Distance and Similarity

To compare embeddings, we need a numerical similarity measure.

The most common choice is cosine similarity.

Why Cosine Similarity?

Cosine similarity measures the angle between vectors, not their magnitude.

This makes it ideal for comparing semantic meaning.

Simple Similarity Calculation

Let’s start with a small, controlled example to understand how similarity works.


import numpy as np

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

doc_1 = np.array([0.8, 0.2])
doc_2 = np.array([0.75, 0.25])
doc_3 = np.array([0.1, 0.9])

query = np.array([0.78, 0.22])

print(cosine_similarity(query, doc_1))
print(cosine_similarity(query, doc_2))
print(cosine_similarity(query, doc_3))
  

Before running this code, understand the intent:

  • query represents user intent
  • doc_* represent stored document chunks
  • The highest score indicates closest meaning
0.999 0.998 0.347

The query is semantically closer to doc_1 and doc_2 than doc_3.

Ranking Results

In real systems, you do not return a single match.

You rank results by similarity and return the top-k items.

Ranking Logic Example


documents = {
    "doc_1": doc_1,
    "doc_2": doc_2,
    "doc_3": doc_3
}

scores = {
    name: cosine_similarity(query, vec)
    for name, vec in documents.items()
}

ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)
print(ranked)
  

This ranking step is critical.

It determines which information the model will see next in a RAG system.

[('doc_1', 0.999), ('doc_2', 0.998), ('doc_3', 0.347)]

Similarity Search at Scale

The examples so far use only a few vectors.

In real applications, you may have millions of embeddings.

Brute-force comparison becomes too slow.

Approximate Nearest Neighbor (ANN)

To scale similarity search, systems use approximate methods.

These trade tiny accuracy loss for massive performance gains.

Vector databases implement these techniques internally.

Where Similarity Search Is Used

Similarity search powers:

  • Semantic document search
  • Recommendation engines
  • Duplicate detection
  • RAG pipelines

If embeddings are the foundation, similarity search is the engine.

Common Mistakes to Avoid

  • Comparing raw text instead of vectors
  • Mixing embedding models
  • Ignoring normalization

These mistakes lead to incorrect rankings.

Practice

What must text be converted into before similarity search?



Which similarity metric is most commonly used?



What step orders results by relevance?



Quick Quiz

Similarity search compares:





What determines which result is returned first?





Why are vector databases used?





Recap: Similarity search retrieves relevant information by comparing embeddings, not keywords.

Next up: We introduce vector databases — purpose-built systems for storing and searching embeddings.