GenAI Lesson 46 – Chunking | Dataplexa

Chunking Strategies: How to Split Documents for Effective RAG

In a RAG system, documents are not stored or retrieved as full files.

They are broken into smaller pieces called chunks.

How you create these chunks directly determines whether your system retrieves the right information or fails silently.

Why Chunking Is Required

Large documents cannot be embedded or retrieved effectively as a single unit.

If a document contains many topics, only a small part may be relevant to a user query.

Chunking solves this by isolating meaningful sections.

Think Before You Chunk

Before writing any code, ask:

What questions will users ask?
How dense is the information?
Do ideas span multiple paragraphs?

Chunking is a design decision, not a mechanical step.

Basic Fixed-Size Chunking

The simplest approach is splitting text by a fixed number of characters or tokens.


def fixed_chunk(text, size=500):
    chunks = []
    for i in range(0, len(text), size):
        chunks.append(text[i:i+size])
    return chunks

This method is easy to implement but often breaks semantic meaning.

What Goes Wrong with Fixed Chunking

Problems include:

Sentences cut in half
Concepts split across chunks
Loss of contextual continuity

Retrieval accuracy suffers even if embeddings are good.

Sentence-Based Chunking

A better approach is splitting by sentences or paragraphs.

This preserves semantic boundaries.


import nltk

def sentence_chunk(text, max_sentences=5):
    sentences = nltk.sent_tokenize(text)
    chunks, current = [], []

    for sentence in sentences:
        current.append(sentence)
        if len(current) >= max_sentences:
            chunks.append(" ".join(current))
            current = []

    if current:
        chunks.append(" ".join(current))

    return chunks

Each chunk now represents a complete thought.

Overlapping Chunks

Some ideas span chunk boundaries.

Overlapping ensures no critical context is lost.


def overlapping_chunks(chunks, overlap=1):
    final_chunks = []
    for i in range(len(chunks)):
        start = max(0, i - overlap)
        combined = " ".join(chunks[start:i+1])
        final_chunks.append(combined)
    return final_chunks

This improves recall at the cost of storage and compute.

Chunk Size Trade-offs

There is no perfect chunk size.

Small chunks → precise retrieval, less context
Large chunks → more context, higher noise

Most production systems tune chunk size experimentally.

Chunking for Different Data Types

Chunking strategy depends on data:

Technical docs → section-based
FAQs → question-answer pairs
Logs → time-based windows

One strategy does not fit all.

How Chunking Affects the Entire RAG Pipeline

Chunking influences:

Embedding quality
Retrieval relevance
Prompt size
Latency and cost

Bad chunking cannot be fixed later.

How Learners Should Practice Chunking

To truly understand chunking:

Visualize chunks before embedding
Test the same query on different strategies
Inspect retrieved chunks manually

Chunking is learned by iteration, not memorization.

Practice

What process splits documents into smaller units?

What technique preserves cross-boundary context?

Chunking quality most affects which RAG stage?

Quick Quiz

What is the biggest risk of fixed-size chunking?

Breaking semantic meaning
Slower inference
Tokenization errors

Why are overlapping chunks used?

Preserve context
Compress data
Speed batching

Chunking should be treated as a:

Design decision
Default setting
Optimizer

Recap: Chunking determines how knowledge is stored, retrieved, and understood in RAG systems.

Next up: Indexing strategies — how chunks are stored for fast and accurate retrieval.

← Previous Course Index Next →

Generative AI Course