GenAI Lesson 17 – ChromaDB | Dataplexa

ChromaDB

In the previous lesson, you learned what vector databases are and why they are essential for modern GenAI systems.

Now we move from theory to a real, developer-friendly tool: ChromaDB.

ChromaDB is an open-source vector database designed to be simple, lightweight, and easy to integrate into GenAI projects.

Why ChromaDB Exists

Many developers do not want to start with complex, cloud-heavy infrastructure.

They want:

  • Local development
  • Simple APIs
  • Fast semantic search
  • Easy experimentation with embeddings

ChromaDB was created to fill this gap.

When Should You Use ChromaDB?

Before writing code, ask yourself:

Am I building or testing a GenAI system that needs semantic search?

If the answer is yes, ChromaDB is often a good first choice.

It is commonly used for:

  • RAG prototypes
  • Local chatbots
  • Document Q&A systems
  • Learning and experimentation

High-Level Workflow

Every ChromaDB project follows the same logical steps:

  • Create a collection
  • Add documents with embeddings
  • Store metadata
  • Query the collection
  • Retrieve relevant results

Understanding this flow is more important than memorizing the API.

Installing ChromaDB

Before coding, ensure ChromaDB is installed.


pip install chromadb
  

This installs the core ChromaDB library for local development.

Creating a ChromaDB Client

The first step in any ChromaDB project is creating a client.

Think of the client as the entry point to your vector database.


import chromadb

client = chromadb.Client()
  

At this stage:

  • No data is stored yet
  • No embeddings exist
  • You are just initializing the system

Creating a Collection

A collection is where vectors are stored.

You can think of it like a table, but optimized for embeddings.


collection = client.create_collection(
    name="documents"
)
  

This collection will later store:

  • Text chunks
  • Embeddings
  • Metadata

Adding Documents

Before adding data, understand the goal:

We want ChromaDB to retrieve documents by meaning.

That means each document must be stored with content and an identifier.


collection.add(
    documents=[
        "Generative AI creates new content",
        "Vector databases store embeddings",
        "ChromaDB enables semantic search"
    ],
    ids=["doc1", "doc2", "doc3"]
)
  

What happens internally:

  • Text is embedded
  • Vectors are indexed
  • Documents become searchable

Querying the Collection

Now comes the most important part: retrieving relevant information.

Before writing the query, think:

What is the user trying to find?


results = collection.query(
    query_texts=["How does semantic search work?"],
    n_results=2
)

print(results)
  

ChromaDB performs:

  • Query embedding
  • Similarity search
  • Result ranking
{'documents': [['ChromaDB enables semantic search', 'Vector databases store embeddings']]}

The results returned are the most semantically relevant documents.

Why This Matters for Jobs

In real GenAI roles, you will:

  • Store company knowledge
  • Retrieve relevant context
  • Feed results into LLMs

ChromaDB often becomes the backbone of these systems.

Common Beginner Mistakes

  • Not chunking documents properly
  • Storing too much text in one document
  • Ignoring metadata

These mistakes reduce retrieval quality.

Practice

What is the main storage unit in ChromaDB?



What does ChromaDB store internally?



Which operation retrieves similar documents?



Quick Quiz

ChromaDB primarily supports:





ChromaDB is best suited for:





ChromaDB is commonly used in:





Recap: ChromaDB is a practical, developer-friendly vector database used to build semantic search and RAG systems.

Next up: Pinecone — a managed, production-grade vector database used at scale.