NO SQL Lesson 16 – CouchDB Overview | Dataplexa
NoSQL Database Types · Lesson 16

CouchDB Overview

In 2005, Damien Katz was frustrated. He wanted a database that worked like the web — every operation a simple HTTP call, no special drivers, no binary protocol. A database that could survive network partitions gracefully, replicate across machines automatically, and work offline first. He built CouchDB in his spare time, released it as open source, and Apache adopted it in 2008. Today it powers offline-capable medical systems, field survey tools, and any application where the internet is a luxury rather than a guarantee.

CouchDB's Core Philosophy — HTTP Is the API

Every other database we've covered requires a dedicated driver or SDK to connect. CouchDB is different: its entire API is HTTP. Every document is a URL. Every operation is a standard HTTP verb. You can interact with CouchDB using curl, a browser, or any HTTP client — no special driver needed.

Every CouchDB operation = HTTP verb

GET Read a document
PUT Create or update a document
POST Create with auto-generated ID
DELETE Delete a document

URL structure

http://localhost:5984/
http://localhost:5984/{db}
http://localhost:5984/{db}/{doc_id}
http://localhost:5984/{db}/_design/{view}

Server → Database → Document → View. Every resource has a URL. REST from the ground up.

MVCC — How CouchDB Handles Concurrent Writes

CouchDB uses MVCC (Multi-Version Concurrency Control) — a fundamentally different approach to managing concurrent writes compared to locks. Every document version is preserved with a revision number. Updates never overwrite — they create a new revision.

How MVCC works — step by step:

Write 1
Create document. CouchDB assigns _rev: "1-abc123". The 1- is the generation number. abc123 is an MD5 hash of the content.
Write 2
Update the document. Must provide _rev: "1-abc123". CouchDB creates a new version: _rev: "2-def456". The old version is retained internally.
Conflict
Two clients both read _rev: "1-abc123" and both try to update. The first one wins — CouchDB accepts it and creates _rev: "2-def456". The second one fails with 409 Conflict because their _rev is now stale.

Your First CouchDB Operations — Pure HTTP

The scenario: You're building a field survey app for an NGO. Survey workers collect patient data with no internet connectivity. Here's how to create a database, add documents, and read them — all with plain HTTP using curl:

# Create a database
curl -X PUT http://admin:password@localhost:5984/patient_surveys

# Create a document with a specific ID
curl -X PUT http://admin:password@localhost:5984/patient_surveys/patient_001 \
  -H "Content-Type: application/json" \
  -d '{
    "name":        "Amara Diallo",
    "age":         34,
    "village":     "Koumbia",
    "diagnosis":   "malaria_suspected",
    "collected_at": "2024-01-15T09:22:00Z",
    "worker_id":   "w_44"
  }'
-- Create database:
{"ok":true}

-- Create document:
{
  "ok":  true,
  "id":  "patient_001",
  "rev": "1-8a5c3f2b9d4e7a1f0c6b8d3e5f2a9c7b"
}

-- CouchDB returns the _rev automatically
-- Save this rev — you'll need it for any future update
PUT /patient_surveys/patient_001

The document ID is part of the URL. PUT with a specific ID creates or replaces that document. The database (patient_surveys) and document ID (patient_001) are both path segments — pure REST. No special query language needed to create a document.

"rev": "1-8a5c3f..."

CouchDB's revision token. The 1- prefix is the generation number — this is the first revision. The hex string after the dash is the MD5 hash of the document content. You must store this and send it with every subsequent update to prove you're working from the latest version.

# Read the document back
curl http://admin:password@localhost:5984/patient_surveys/patient_001

# Update the document — MUST include current _rev
curl -X PUT http://admin:password@localhost:5984/patient_surveys/patient_001 \
  -H "Content-Type: application/json" \
  -d '{
    "_rev":        "1-8a5c3f2b9d4e7a1f0c6b8d3e5f2a9c7b",
    "name":        "Amara Diallo",
    "age":         34,
    "village":     "Koumbia",
    "diagnosis":   "malaria_confirmed",
    "treatment":   "artemisinin",
    "collected_at": "2024-01-15T09:22:00Z",
    "updated_at":  "2024-01-15T14:30:00Z",
    "worker_id":   "w_44"
  }'
-- Read response:
{
  "_id":         "patient_001",
  "_rev":        "1-8a5c3f2b9d4e7a1f0c6b8d3e5f2a9c7b",
  "name":        "Amara Diallo",
  "diagnosis":   "malaria_suspected",
  ...all other fields...
}

-- Update response:
{
  "ok":  true,
  "id":  "patient_001",
  "rev": "2-a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6"
}
-- New revision: 2- (second generation)
Update requires _rev in the body

Unlike MongoDB's $set which patches specific fields, CouchDB updates replace the entire document. You GET the document, modify it in memory, then PUT the whole modified document back — including _rev. This is intentional: CouchDB stores complete document snapshots, not field-level diffs.

Wrong _rev → 409 Conflict

If you try to update with an outdated _rev — because someone else updated the document between your GET and your PUT — CouchDB returns HTTP 409 Conflict. Your application must GET the latest version, merge the changes, and retry. This is MVCC conflict detection in action.

Using CouchDB with Python

For production applications, the couchdb Python library wraps the HTTP API into a clean interface while preserving the document-centric model:

import couchdb

# Connect to CouchDB
couch = couchdb.Server('http://admin:password@localhost:5984/')

# Get or create a database
db = couch['patient_surveys']

# Create a document — CouchDB assigns _id and _rev automatically
doc_id, doc_rev = db.save({
    'name':       'Fatima Ndiaye',
    'age':        28,
    'village':    'Tambacounda',
    'diagnosis':  'anemia',
    'worker_id':  'w_12'
})
print(f"Saved: {doc_id} at revision {doc_rev}")
Saved: 8f3a2c1b9d4e5f0a7b6c8d3e at revision 1-4f2a8b1c9d3e5f0a7b2c4d6e
# Read and update safely — get fresh _rev first
doc = db[doc_id]                          # fetch with current _rev included
doc['diagnosis'] = 'anemia_confirmed'     # modify in memory
doc['treatment'] = 'iron_supplements'

db.save(doc)                              # save back — _rev is already in doc
print(f"Updated to revision: {doc['_rev']}")
Updated to revision: 2-9c1d3e5f7a2b4c6d8e0f1a3b5c7d9e2f

db[doc_id] — returns the document as a Python dict including _id and _rev. Modify the dict in memory, then call db.save(doc). The library sends the full document with the existing _rev back to CouchDB. If _rev matches, the update succeeds. If another client updated in between, CouchDB returns 409 — your code should catch and retry.

CouchDB's Killer Feature — Multi-Master Replication

This is where CouchDB genuinely has no equal. CouchDB supports multi-master replication — every node can accept reads AND writes simultaneously, and changes sync automatically between all nodes when connectivity is available.

The offline-first workflow:

📱

1. Worker offline

Tablet has local CouchDB. Worker reads and writes normally — no internet needed.

✏️

2. Data collected

20 patient records written to local CouchDB. Fully functional database on device.

📶

3. WiFi restored

CouchDB replication triggers automatically. Local docs sync to central server.

🏥

4. Conflict resolved

If two workers updated the same patient, CouchDB flags the conflict. App resolves it.

# Set up replication — from local tablet to central server
# This runs on the local CouchDB instance
curl -X POST http://admin:password@localhost:5984/_replicator \
  -H "Content-Type: application/json" \
  -d '{
    "_id":        "sync_to_central",
    "source":     "http://localhost:5984/patient_surveys",
    "target":     "https://central.ngo.org/couchdb/patient_surveys",
    "continuous": true,
    "filter":     "surveys/by_worker",
    "query_params": { "worker_id": "w_44" }
  }'
"continuous": true

Starts a long-running replication that watches for new changes and syncs them as they happen. When the device is offline, CouchDB queues the changes. The moment connectivity is restored, replication resumes from where it left off — no data is lost, no manual intervention needed.

"filter": "surveys/by_worker"

Filtered replication — only sync documents matching a condition. Worker 44's tablet only syncs their own records, not every patient from every worker. This reduces bandwidth and keeps local databases small. Filters are JavaScript functions stored in design documents.

Conflict Detection and Resolution

The scenario: Two field workers both edit the same patient record while offline. When their devices sync, CouchDB detects the conflict. Here's how to find and resolve it:

# Find all documents that have conflicts
conflicts = db.view('_all_docs', conflicts=True, include_docs=True)

for row in conflicts:
    doc = row.doc
    if doc.get('_conflicts'):
        print(f"Conflict in: {doc['_id']}")
        print(f"Conflicting revisions: {doc['_conflicts']}")
Conflict in: patient_001
Conflicting revisions: ['2-a1b2c3d4e5f6...', '2-f6e5d4c3b2a1...']

-- Both are generation 2 (both workers updated from rev 1)
-- CouchDB stored both versions
-- One is the "winner" (shown by default GET)
-- The other is the "loser" (stored, flagged as conflict)
# Fetch both conflicting versions and merge them
winning_doc = db['patient_001']                           # the winner by default
losing_rev  = winning_doc['_conflicts'][0]

# Fetch the losing revision
losing_doc = db.get('patient_001', rev=losing_rev)

# Merge: take the most recent diagnosis, combine treatments
merged = dict(winning_doc)
merged['treatment'] = f"{winning_doc.get('treatment','')} + {losing_doc.get('treatment','')}"
merged['_conflicts'] = None                               # clear conflicts flag

db.save(merged)                                           # save merged version

# Delete the losing revision
db.delete({'_id': 'patient_001', '_rev': losing_rev})

CouchDB never loses data during conflicts. Both versions are stored. Your application decides the merge strategy — last write wins, field-level merge, user prompt, or any custom logic. CouchDB guarantees you always have access to all conflicting versions to make that decision. This is fundamentally different from MongoDB where one write silently overwrites another.

Views — CouchDB's Query System

CouchDB doesn't have MongoDB's rich query language. Instead, you define views — JavaScript MapReduce functions stored in the database that pre-compute query results. Views are updated incrementally as documents change.

The scenario: You need to query all surveys by village, and count how many surveys each village has. Here's how you define and query a view:

// Design document — stores views in the database itself
// PUT to /_design/surveys
{
  "_id": "_design/surveys",
  "views": {
    "by_village": {
      "map": "function(doc) { if (doc.village) { emit(doc.village, 1); } }",
      "reduce": "_count"
    }
  }
}
emit(doc.village, 1)

The map function runs on every document. For each document, it emits zero or more key-value pairs into an index. Here: emit the village name as the key, and 1 as the value. The resulting index has one entry per document, sorted by village name alphabetically.

"reduce": "_count"

_count is a built-in reduce function that counts the number of map outputs per key. When you query with group=true, you get the count per village. Other built-ins: _sum (sum values), _stats (min, max, mean, variance).

# Query the view — count surveys per village
curl "http://admin:password@localhost:5984/patient_surveys/_design/surveys/_view/by_village?group=true"

# Get all surveys from a specific village
curl "http://admin:password@localhost:5984/patient_surveys/_design/surveys/_view/by_village?key=%22Koumbia%22&reduce=false&include_docs=true"
-- group=true: count per village
{
  "rows": [
    { "key": "Koumbia",       "value": 47 },
    { "key": "Tambacounda",   "value": 31 },
    { "key": "Ziguinchor",    "value": 28 }
  ]
}

-- key="Koumbia" with include_docs: all Koumbia surveys
{
  "rows": [
    { "key": "Koumbia", "id": "patient_001", "doc": { ...full document... } },
    { "key": "Koumbia", "id": "patient_018", "doc": { ...full document... } },
    ...47 rows total...
  ]
}

Views are pre-computed and indexed. CouchDB builds the view index as documents are added. When you query, it reads from the pre-built index — not the raw documents. First query builds the index (slow). Subsequent queries hit the index (fast). Adding a new document incrementally updates the index.

The trade-off vs MongoDB: CouchDB views must be defined before you can query efficiently. MongoDB lets you query any field ad-hoc and add indexes later. CouchDB's pre-defined view approach is more predictable in performance but less flexible for exploratory queries.

CouchDB vs MongoDB — Choosing Between Them

Scenario MongoDB CouchDB
App works offline, syncs when online Complex — manual sync logic required ✅ Built-in — core design feature
Rich ad-hoc queries and aggregations ✅ MQL + full aggregation pipeline Limited — pre-defined views only
Multi-master replication, conflict handling Not supported natively ✅ Core feature — designed for this
REST/HTTP API without a driver Requires MongoDB driver ✅ Native — curl is enough
Ecosystem, tooling, community ✅ Much larger ecosystem Smaller, more specialist

Teacher's Note

CouchDB is the most underrated database in this entire course. Most engineers have never used it because MongoDB gets all the press. But when you have a genuinely offline-first requirement — field workers, medical devices, remote industrial sensors, POS systems in areas with poor connectivity — CouchDB's multi-master replication and built-in conflict resolution is years ahead of anything else. The teams who discover CouchDB for the right use case become fierce advocates. The key is knowing that use case: if your app must work without internet and sync later, CouchDB is not one option — it is the option.

Practice Questions — You're the Engineer

Scenario:

A developer fetches a CouchDB document, modifies it, and tries to save it back. They get a 409 Conflict HTTP error. Investigation shows another process updated the same document between the GET and the PUT. What field must be included in the PUT request body — matching the current version — to avoid this error?


Scenario:

You configure CouchDB replication between a field tablet and a central server. When you test it, the replication runs once and then stops. New documents added to the tablet after the initial sync are never sent to the server. Which replication configuration property must be set to true to keep replication running and sync new changes as they happen?


Scenario:

Two nurses on different devices both open the same patient record (revision 1) offline and make different updates. When both devices sync, CouchDB stores both modified versions and flags the document as having a conflict — neither update is silently lost. What concurrency control mechanism does CouchDB use that makes this conflict detection possible?


Quiz — CouchDB in Production

Scenario:

A disaster relief organisation deploys volunteers across earthquake-affected regions. Volunteers use tablets to record damage assessments. Network connectivity is intermittent — sometimes days pass without internet access. When connectivity is restored, assessments from all volunteers must sync to a central coordination server. Multiple volunteers may assess the same building. Which database is purpose-built for this scenario?

Scenario:

Two field workers both edited the same inventory document offline. When both devices synced, CouchDB flagged a conflict. A junior developer asks: "Did CouchDB lose one of the updates?" What actually happened, and what does the application need to do?

Scenario:

A developer is switching from MongoDB to CouchDB for a new project. They try to run db.find({"{"} region: "north", status: "active" {"}"}) style queries and find CouchDB's Mango query API is limited compared to MongoDB's MQL. Their team lead says "In CouchDB, you define your queries upfront as part of the database." What is this CouchDB query system called and what is its main trade-off?

Up Next · Lesson 17

Column-Family Databases

How column-family storage works at the byte level, why it achieves million-writes-per-second throughput, and the partition key design that makes Cassandra either blazing fast or catastrophically slow.