GenAI Lesson 47 – Indexing | Dataplexa

Indexing Strategies: How Chunks Are Stored and Retrieved

Once documents are chunked, the next critical question is simple:

How does the system find the right chunk at query time?

The answer lies in indexing.

What Indexing Means in GenAI Systems

Indexing is the process of organizing chunk embeddings so they can be searched efficiently.

Without indexing, retrieval would require comparing a query against every chunk — which is infeasible at scale.

How to Think Before Building an Index

Before writing code, engineers decide:

How large is the dataset?
How fast should retrieval be?
How accurate must results be?

Indexing is always a trade-off between speed, accuracy, and cost.

Naive Linear Search (Baseline Understanding)

The simplest retrieval approach is brute-force similarity comparison.


import numpy as np

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def linear_search(query_vec, vectors):
    scores = []
    for idx, vec in enumerate(vectors):
        score = cosine_similarity(query_vec, vec)
        scores.append((idx, score))
    return sorted(scores, key=lambda x: x[1], reverse=True)

This works for learning but fails in production due to high latency.

Why Linear Search Breaks at Scale

Problems appear when:

Vectors exceed tens of thousands
Queries arrive concurrently
Latency requirements are strict

Indexing structures solve these issues.

Approximate Nearest Neighbor (ANN) Indexing

Most GenAI systems use ANN instead of exact search.

ANN trades a small amount of accuracy for massive speed gains.

How ANN Indexing Works Conceptually

Instead of comparing against all vectors, ANN:

Groups similar vectors
Reduces search space
Searches only promising regions

This makes millisecond-level retrieval possible.

FAISS Index Example

FAISS is a common vector indexing library.


import faiss
import numpy as np

dimension = 768
index = faiss.IndexFlatL2(dimension)

vectors = np.random.rand(1000, dimension).astype('float32')
index.add(vectors)

query = np.random.rand(1, dimension).astype('float32')
distances, indices = index.search(query, k=5)

Here, vectors are indexed once and reused for fast retrieval.

What Actually Happens During a Search

At query time:

The query is embedded
The index narrows candidate vectors
Similarity is computed only on candidates

This is why indexing directly affects latency and relevance.

Index Types and Their Trade-offs

Common index types include:

Flat indexes – accurate, slow
IVF – fast, less precise
HNSW – balanced speed and accuracy

Production systems often benchmark multiple indexes.

Indexing and Updates

Indexes must handle:

New documents
Deleted content
Re-embedding updates

Some indexes support dynamic updates better than others.

Indexing in Real RAG Pipelines

In practice:

Chunk → embed → index once
Query → embed → retrieve top-k
Retrieved chunks → prompt

Indexing sits at the center of this loop.

How Learners Should Practice Indexing

To gain real skill:

Compare linear vs indexed retrieval
Measure latency differences
Inspect retrieved chunk relevance

Indexing quality shows up in answer quality.

Practice

What process organizes embeddings for fast search?

What technique trades accuracy for speed?

Indexing primarily improves what system metric?

Quick Quiz

ANN stands for:

Approximate Nearest Neighbor
Absolute Nearest Neighbor
Adaptive Neural Network

ANN reduces what during retrieval?

Search space
Embedding size
Prompt

Choosing an index type is a:

Design choice
Default step
Optimizer task

Recap: Indexing determines how fast and how accurately knowledge is retrieved in GenAI systems.

Next up: Prompt caching — reducing cost and latency in repeated queries.

← Previous Course Index Next →

Generative AI Course