GenAI Lesson 13 – Document Embeddings | Dataplexa

Document Embeddings

In real-world GenAI systems, information rarely lives in single sentences.

Knowledge exists in documents: policies, manuals, PDFs, emails, logs, and reports.

Document embeddings are the bridge between large unstructured text and intelligent retrieval systems.

Why Sentence Embeddings Are Not Enough

Sentence embeddings work well for short text, but enterprise knowledge is long and structured.

A document can span:

  • Multiple topics
  • Hundreds or thousands of words
  • Different intent sections

Embedding an entire document as one vector often loses important details.

Thinking Before Coding

Before writing any code, ask:

What do we want to retrieve from a document?

Usually, we want the most relevant part, not the entire document.

What Is a Document Embedding?

A document embedding represents the semantic meaning of long-form text.

In practice, this is rarely done as a single vector.

Instead, documents are broken into smaller units and embedded piece by piece.

The Chunking Principle

Chunking means splitting a document into manageable, semantically meaningful parts.

Each chunk is embedded independently.

This enables precise retrieval later.

Why Chunking Exists

  • Models have context length limits
  • Smaller chunks improve retrieval accuracy
  • Fine-grained matches outperform whole-doc matches

Simple Chunking Example

Let’s simulate a document and split it into logical chunks.


document = """
Generative AI enables machines to create content.
It is used in chatbots and assistants.

Embeddings convert text into vectors.
They power search and retrieval systems.
"""

chunks = document.strip().split("\n\n")
print(chunks)
  

Each chunk now represents a focused idea.

['Generative AI enables machines to create content.\nIt is used in chatbots and assistants.', 'Embeddings convert text into vectors.\nThey power search and retrieval systems.']

Embedding Chunks Instead of Documents

Once chunked, each piece is embedded separately.

This allows the system to retrieve only the most relevant section.

Conceptual Embedding Representation


embeddings = {
    "chunk_1": [0.8, 0.2],
    "chunk_2": [0.1, 0.9]
}

print(embeddings)
  

Each chunk now has its own semantic identity.

{'chunk_1': [0.8, 0.2], 'chunk_2': [0.1, 0.9]}

Searching Inside Documents

The power of document embeddings comes from semantic search.

A user query is embedded and compared with stored chunk embeddings.

Query Matching Example


import numpy as np

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

query_embedding = np.array([0.75, 0.25])

score_1 = cosine_similarity(query_embedding, embeddings["chunk_1"])
score_2 = cosine_similarity(query_embedding, embeddings["chunk_2"])

print(score_1, score_2)
  

The chunk with the higher similarity is the most relevant answer.

0.998 0.281

Why Document Embeddings Matter in GenAI

Document embeddings enable:

  • Enterprise search
  • Knowledge-base assistants
  • Retrieval-Augmented Generation (RAG)
  • Compliance and policy Q&A

Almost every serious GenAI product uses this pattern.

Document Embeddings vs Sentence Embeddings

Key differences:

  • Sentence embeddings focus on short text
  • Document embeddings manage long, structured text
  • Document embeddings rely heavily on chunking strategy

The embedding model may be the same, but the system design is different.

Common Mistakes to Avoid

  • Embedding entire documents as one vector
  • Using chunks that are too large or too small
  • Ignoring document structure

These mistakes lead to poor retrieval quality.

Practice

What technique splits documents into smaller semantic units?



What numerical representation is used for document search?



What process finds the most relevant document chunk?



Quick Quiz

What is embedded instead of the full document?





Document embeddings are primarily used for:





Which GenAI technique relies heavily on document embeddings?





Recap: Document embeddings enable precise retrieval by representing long text as searchable semantic chunks.

Next up: We move from theory to practice — generating embeddings using OpenAI APIs.