GenAI Lesson 17 – ChromaDB | Dataplexa

ChromaDB

In the previous lesson, you learned what vector databases are and why they are essential for modern GenAI systems.

Now we move from theory to a real, developer-friendly tool: ChromaDB.

ChromaDB is an open-source vector database designed to be simple, lightweight, and easy to integrate into GenAI projects.

Why ChromaDB Exists

Many developers do not want to start with complex, cloud-heavy infrastructure.

They want:

Local development
Simple APIs
Fast semantic search
Easy experimentation with embeddings

ChromaDB was created to fill this gap.

When Should You Use ChromaDB?

Before writing code, ask yourself:

Am I building or testing a GenAI system that needs semantic search?

If the answer is yes, ChromaDB is often a good first choice.

It is commonly used for:

RAG prototypes
Local chatbots
Document Q&A systems
Learning and experimentation

High-Level Workflow

Every ChromaDB project follows the same logical steps:

Create a collection
Add documents with embeddings
Store metadata
Query the collection
Retrieve relevant results

Understanding this flow is more important than memorizing the API.

Installing ChromaDB

Before coding, ensure ChromaDB is installed.


pip install chromadb

This installs the core ChromaDB library for local development.

Creating a ChromaDB Client

The first step in any ChromaDB project is creating a client.

Think of the client as the entry point to your vector database.


import chromadb

client = chromadb.Client()

At this stage:

No data is stored yet
No embeddings exist
You are just initializing the system

Creating a Collection

A collection is where vectors are stored.

You can think of it like a table, but optimized for embeddings.


collection = client.create_collection(
    name="documents"
)

This collection will later store:

Text chunks
Embeddings
Metadata

Adding Documents

Before adding data, understand the goal:

We want ChromaDB to retrieve documents by meaning.

That means each document must be stored with content and an identifier.


collection.add(
    documents=[
        "Generative AI creates new content",
        "Vector databases store embeddings",
        "ChromaDB enables semantic search"
    ],
    ids=["doc1", "doc2", "doc3"]
)

What happens internally:

Text is embedded
Vectors are indexed
Documents become searchable

Querying the Collection

Now comes the most important part: retrieving relevant information.

Before writing the query, think:

What is the user trying to find?


results = collection.query(
    query_texts=["How does semantic search work?"],
    n_results=2
)

print(results)

ChromaDB performs:

Query embedding
Similarity search
Result ranking

{'documents': [['ChromaDB enables semantic search', 'Vector databases store embeddings']]}

The results returned are the most semantically relevant documents.

Why This Matters for Jobs

In real GenAI roles, you will:

Store company knowledge
Retrieve relevant context
Feed results into LLMs

ChromaDB often becomes the backbone of these systems.

Common Beginner Mistakes

Not chunking documents properly
Storing too much text in one document
Ignoring metadata

These mistakes reduce retrieval quality.

Practice

What is the main storage unit in ChromaDB?

What does ChromaDB store internally?

Which operation retrieves similar documents?

Quick Quiz

ChromaDB primarily supports:

Semantic search
Keyword search
Sorting

ChromaDB is best suited for:

Local development
Model training
UI rendering

ChromaDB is commonly used in:

RAG pipelines
CSS frameworks
Logging systems

Recap: ChromaDB is a practical, developer-friendly vector database used to build semantic search and RAG systems.

Next up: Pinecone — a managed, production-grade vector database used at scale.

← Previous Course Index Next →

Generative AI Course