Generative AI Course
Indexing Strategies: How Chunks Are Stored and Retrieved
Once documents are chunked, the next critical question is simple:
How does the system find the right chunk at query time?
The answer lies in indexing.
What Indexing Means in GenAI Systems
Indexing is the process of organizing chunk embeddings so they can be searched efficiently.
Without indexing, retrieval would require comparing a query against every chunk — which is infeasible at scale.
How to Think Before Building an Index
Before writing code, engineers decide:
- How large is the dataset?
- How fast should retrieval be?
- How accurate must results be?
Indexing is always a trade-off between speed, accuracy, and cost.
Naive Linear Search (Baseline Understanding)
The simplest retrieval approach is brute-force similarity comparison.
import numpy as np
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
def linear_search(query_vec, vectors):
scores = []
for idx, vec in enumerate(vectors):
score = cosine_similarity(query_vec, vec)
scores.append((idx, score))
return sorted(scores, key=lambda x: x[1], reverse=True)
This works for learning but fails in production due to high latency.
Why Linear Search Breaks at Scale
Problems appear when:
- Vectors exceed tens of thousands
- Queries arrive concurrently
- Latency requirements are strict
Indexing structures solve these issues.
Approximate Nearest Neighbor (ANN) Indexing
Most GenAI systems use ANN instead of exact search.
ANN trades a small amount of accuracy for massive speed gains.
How ANN Indexing Works Conceptually
Instead of comparing against all vectors, ANN:
- Groups similar vectors
- Reduces search space
- Searches only promising regions
This makes millisecond-level retrieval possible.
FAISS Index Example
FAISS is a common vector indexing library.
import faiss
import numpy as np
dimension = 768
index = faiss.IndexFlatL2(dimension)
vectors = np.random.rand(1000, dimension).astype('float32')
index.add(vectors)
query = np.random.rand(1, dimension).astype('float32')
distances, indices = index.search(query, k=5)
Here, vectors are indexed once and reused for fast retrieval.
What Actually Happens During a Search
At query time:
- The query is embedded
- The index narrows candidate vectors
- Similarity is computed only on candidates
This is why indexing directly affects latency and relevance.
Index Types and Their Trade-offs
Common index types include:
- Flat indexes – accurate, slow
- IVF – fast, less precise
- HNSW – balanced speed and accuracy
Production systems often benchmark multiple indexes.
Indexing and Updates
Indexes must handle:
- New documents
- Deleted content
- Re-embedding updates
Some indexes support dynamic updates better than others.
Indexing in Real RAG Pipelines
In practice:
- Chunk → embed → index once
- Query → embed → retrieve top-k
- Retrieved chunks → prompt
Indexing sits at the center of this loop.
How Learners Should Practice Indexing
To gain real skill:
- Compare linear vs indexed retrieval
- Measure latency differences
- Inspect retrieved chunk relevance
Indexing quality shows up in answer quality.
Practice
What process organizes embeddings for fast search?
What technique trades accuracy for speed?
Indexing primarily improves what system metric?
Quick Quiz
ANN stands for:
ANN reduces what during retrieval?
Choosing an index type is a:
Recap: Indexing determines how fast and how accurately knowledge is retrieved in GenAI systems.
Next up: Prompt caching — reducing cost and latency in repeated queries.