Generative AI Course
Document Embeddings
In real-world GenAI systems, information rarely lives in single sentences.
Knowledge exists in documents: policies, manuals, PDFs, emails, logs, and reports.
Document embeddings are the bridge between large unstructured text and intelligent retrieval systems.
Why Sentence Embeddings Are Not Enough
Sentence embeddings work well for short text, but enterprise knowledge is long and structured.
A document can span:
- Multiple topics
- Hundreds or thousands of words
- Different intent sections
Embedding an entire document as one vector often loses important details.
Thinking Before Coding
Before writing any code, ask:
What do we want to retrieve from a document?
Usually, we want the most relevant part, not the entire document.
What Is a Document Embedding?
A document embedding represents the semantic meaning of long-form text.
In practice, this is rarely done as a single vector.
Instead, documents are broken into smaller units and embedded piece by piece.
The Chunking Principle
Chunking means splitting a document into manageable, semantically meaningful parts.
Each chunk is embedded independently.
This enables precise retrieval later.
Why Chunking Exists
- Models have context length limits
- Smaller chunks improve retrieval accuracy
- Fine-grained matches outperform whole-doc matches
Simple Chunking Example
Let’s simulate a document and split it into logical chunks.
document = """
Generative AI enables machines to create content.
It is used in chatbots and assistants.
Embeddings convert text into vectors.
They power search and retrieval systems.
"""
chunks = document.strip().split("\n\n")
print(chunks)
Each chunk now represents a focused idea.
Embedding Chunks Instead of Documents
Once chunked, each piece is embedded separately.
This allows the system to retrieve only the most relevant section.
Conceptual Embedding Representation
embeddings = {
"chunk_1": [0.8, 0.2],
"chunk_2": [0.1, 0.9]
}
print(embeddings)
Each chunk now has its own semantic identity.
Searching Inside Documents
The power of document embeddings comes from semantic search.
A user query is embedded and compared with stored chunk embeddings.
Query Matching Example
import numpy as np
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
query_embedding = np.array([0.75, 0.25])
score_1 = cosine_similarity(query_embedding, embeddings["chunk_1"])
score_2 = cosine_similarity(query_embedding, embeddings["chunk_2"])
print(score_1, score_2)
The chunk with the higher similarity is the most relevant answer.
Why Document Embeddings Matter in GenAI
Document embeddings enable:
- Enterprise search
- Knowledge-base assistants
- Retrieval-Augmented Generation (RAG)
- Compliance and policy Q&A
Almost every serious GenAI product uses this pattern.
Document Embeddings vs Sentence Embeddings
Key differences:
- Sentence embeddings focus on short text
- Document embeddings manage long, structured text
- Document embeddings rely heavily on chunking strategy
The embedding model may be the same, but the system design is different.
Common Mistakes to Avoid
- Embedding entire documents as one vector
- Using chunks that are too large or too small
- Ignoring document structure
These mistakes lead to poor retrieval quality.
Practice
What technique splits documents into smaller semantic units?
What numerical representation is used for document search?
What process finds the most relevant document chunk?
Quick Quiz
What is embedded instead of the full document?
Document embeddings are primarily used for:
Which GenAI technique relies heavily on document embeddings?
Recap: Document embeddings enable precise retrieval by representing long text as searchable semantic chunks.
Next up: We move from theory to practice — generating embeddings using OpenAI APIs.