GenAI Lesson 46 – Chunking | Dataplexa

Chunking Strategies: How to Split Documents for Effective RAG

In a RAG system, documents are not stored or retrieved as full files.

They are broken into smaller pieces called chunks.

How you create these chunks directly determines whether your system retrieves the right information or fails silently.

Why Chunking Is Required

Large documents cannot be embedded or retrieved effectively as a single unit.

If a document contains many topics, only a small part may be relevant to a user query.

Chunking solves this by isolating meaningful sections.

Think Before You Chunk

Before writing any code, ask:

  • What questions will users ask?
  • How dense is the information?
  • Do ideas span multiple paragraphs?

Chunking is a design decision, not a mechanical step.

Basic Fixed-Size Chunking

The simplest approach is splitting text by a fixed number of characters or tokens.


def fixed_chunk(text, size=500):
    chunks = []
    for i in range(0, len(text), size):
        chunks.append(text[i:i+size])
    return chunks
  

This method is easy to implement but often breaks semantic meaning.

What Goes Wrong with Fixed Chunking

Problems include:

  • Sentences cut in half
  • Concepts split across chunks
  • Loss of contextual continuity

Retrieval accuracy suffers even if embeddings are good.

Sentence-Based Chunking

A better approach is splitting by sentences or paragraphs.

This preserves semantic boundaries.


import nltk

def sentence_chunk(text, max_sentences=5):
    sentences = nltk.sent_tokenize(text)
    chunks, current = [], []

    for sentence in sentences:
        current.append(sentence)
        if len(current) >= max_sentences:
            chunks.append(" ".join(current))
            current = []

    if current:
        chunks.append(" ".join(current))

    return chunks
  

Each chunk now represents a complete thought.

Overlapping Chunks

Some ideas span chunk boundaries.

Overlapping ensures no critical context is lost.


def overlapping_chunks(chunks, overlap=1):
    final_chunks = []
    for i in range(len(chunks)):
        start = max(0, i - overlap)
        combined = " ".join(chunks[start:i+1])
        final_chunks.append(combined)
    return final_chunks
  

This improves recall at the cost of storage and compute.

Chunk Size Trade-offs

There is no perfect chunk size.

  • Small chunks → precise retrieval, less context
  • Large chunks → more context, higher noise

Most production systems tune chunk size experimentally.

Chunking for Different Data Types

Chunking strategy depends on data:

  • Technical docs → section-based
  • FAQs → question-answer pairs
  • Logs → time-based windows

One strategy does not fit all.

How Chunking Affects the Entire RAG Pipeline

Chunking influences:

  • Embedding quality
  • Retrieval relevance
  • Prompt size
  • Latency and cost

Bad chunking cannot be fixed later.

How Learners Should Practice Chunking

To truly understand chunking:

  • Visualize chunks before embedding
  • Test the same query on different strategies
  • Inspect retrieved chunks manually

Chunking is learned by iteration, not memorization.

Practice

What process splits documents into smaller units?



What technique preserves cross-boundary context?



Chunking quality most affects which RAG stage?



Quick Quiz

What is the biggest risk of fixed-size chunking?





Why are overlapping chunks used?





Chunking should be treated as a:





Recap: Chunking determines how knowledge is stored, retrieved, and understood in RAG systems.

Next up: Indexing strategies — how chunks are stored for fast and accurate retrieval.