NO SQL Lesson 36 – Monitoring NoSQL Systems | Dataplexa
Enterprise & Cloud · Lesson 36

Monitoring NoSQL Systems

Monitoring is not about watching dashboards. It is about knowing your database is about to have a bad day before your users do. The difference between a 2-minute incident and a 4-hour outage is almost always the same: how fast you found out, and whether the signal was buried in noise or sitting in plain sight. This lesson is about building that signal.

The Four Golden Signals for NoSQL

Google's Site Reliability Engineering book introduced the four golden signals — the minimum set of metrics that together tell you whether a system is healthy. For NoSQL databases, they map directly to database-specific indicators:

① Latency
How long do queries take?
Track p50, p95, and p99 — not averages. An average of 5ms with a p99 of 2,000ms means 1 in 100 users waits 2 seconds. That user is your highest-value customer filling out a checkout form.
② Traffic
How many operations per second?
Reads/sec, writes/sec, active connections. A sudden spike reveals an anomaly. A gradual climb tells you when to scale. A drop to zero is often the worst alert of all.
③ Errors
What is failing and how often?
Write conflicts, auth failures, replication lag errors, OOM kills. Each type points to a different root cause. Track the error rate, not just the raw count — counts hide the severity.
④ Saturation
How full is the system?
Disk usage, memory pressure, connection pool exhaustion, CPU. Saturation metrics predict future failures — they tell you the system is approaching its limit before it actually hits it.

Triage with db.serverStatus()

MongoDB exposes a rich set of runtime statistics through db.serverStatus(). Every monitoring agent — Atlas, Datadog, Prometheus — ultimately scrapes this endpoint. Knowing what each field means lets you triage an incident in minutes instead of guessing.

The scenario: You are the on-call engineer at a SaaS company. Your alerting system pages you at 11:47pm — MongoDB query latency has spiked from 8ms to 340ms over the past 15 minutes. You open a terminal and need to triage in under 3 minutes. Is it a query problem? A memory problem? A connection problem? You reach for db.serverStatus().

MongoDB shell — triage with serverStatus()
const s = db.serverStatus();

// 1. Are operations queued waiting for a lock?
printjson({
  readWaiters:  s.globalLock.currentQueue.readers,
  writeWaiters: s.globalLock.currentQueue.writers
});

// 2. Is the working set fitting in RAM?
const cache = s.wiredTiger.cache;
printjson({
  cacheUsedGB:   (cache["bytes currently in the cache"] / 1e9).toFixed(2),
  cacheDirtyPct: (cache["tracked dirty bytes in the cache"] /
                  cache["maximum bytes configured"] * 100).toFixed(1) + "%",
  diskReads:     cache["pages read into cache"]  // high = cache misses = disk I/O
});

// 3. Is the connection pool near exhaustion?
printjson({
  currentConns:   s.connections.current,
  availableConns: s.connections.available
});
// Lock queue — something is holding a write lock
{ readWaiters: 0, writeWaiters: 47 }    ⚠️  47 writes queued!

// Cache — working set has outgrown available RAM
{
  cacheUsedGB:   15.94,      // at the 16GB limit
  cacheDirtyPct: "34.2%",    // high — WiredTiger is under flush pressure
  diskReads:     8841293     // 8.8M pages read from disk  ⚠️
}

// Connections — not the bottleneck here
{ currentConns: 312, availableConns: 688 }  ✓
globalLock.currentQueue.writers: 47

47 write operations are queued waiting for a lock. Something is holding a write lock for an unusually long time — a slow bulk operation, an index build, or a missing-index query that scans millions of documents. Every queued operation adds to your p99 latency. This is the primary triage signal: the spike is caused by lock contention, not hardware.

cacheDirtyPct: 34.2% + 8.8M disk reads

The WiredTiger cache is full. 34% of the cache contains modified pages waiting to be flushed to disk — MongoDB is under write pressure. 8.8 million page reads means the working set no longer fits in RAM and queries are hitting the disk. Cache pressure is the contributing cause slowing the writes that are causing the lock queue to build.

connections.available: 688

688 connection slots remain open — connections are not the bottleneck this time. In a different incident, seeing availableConns: 3 with totalCreated climbing rapidly would point directly to a connection leak, where application code creates new MongoClient instances on every request instead of reusing a shared pool.

Finding Slow Queries — The Database Profiler

Lock contention usually has a root cause: a query scanning millions of documents without an index. MongoDB's database profiler captures every query that exceeds a configurable time threshold, writing them to the system.profile collection — letting you pinpoint the exact query causing the problem.

The scenario: The serverStatus() triage pointed to write lock contention. You suspect a slow, unindexed query is the culprit. You enable the profiler at level 1 to log queries slower than 100ms and immediately find the offenders.

MongoDB shell — enable profiler, find worst queries
// Level 1 = log only queries slower than slowMs
// Level 0 = off | Level 2 = log everything (never in production)
db.setProfilingLevel(1, { slowMs: 100 });

// Find the 5 slowest queries in the last 10 minutes
db.system.profile.find({
  ts: { $gte: new Date(Date.now() - 10 * 60 * 1000) }
}).sort({ millis: -1 }).limit(5).forEach(op => {
  print(`${op.millis}ms | ${op.op} | ${op.ns}`);
  print(`  query:        ${JSON.stringify(op.query || op.command)}`);
  print(`  docsExamined: ${op.docsExamined}  returned: ${op.nreturned}`);
  print(`  planSummary:  ${op.planSummary}`);
  print("---");
});
847ms | query | myapp.orders
  query:        {"status":"pending","created_at":{"$lt":"2025-03-01T00:00:00Z"}}
  docsExamined: 2847391   returned: 12   ⚠️  examined 237K docs per result!
  planSummary:  COLLSCAN   ← full table scan, no index used
---
312ms | query | myapp.users
  query:        {"email":"user@example.com"}
  docsExamined: 941203    returned: 1
  planSummary:  COLLSCAN   ← no index on email field
---
Root cause confirmed. Fix: add compound index on orders(status, created_at)
and single-field index on users(email).
docsExamined: 2,847,391 vs returned: 12

This ratio is the clearest signal of a missing index. The query scanned 2.8 million documents to return 12 results — a ratio of 237,000:1. A well-indexed query has a ratio close to 1:1. MongoDB held a read lock throughout the entire scan, causing every write operation to queue behind it and explaining the 47 write waiters seen in serverStatus().

planSummary: COLLSCAN vs IXSCAN

COLLSCAN means MongoDB read every document in the collection. IXSCAN means MongoDB walked the B-tree index and jumped directly to matching documents. Any COLLSCAN on a collection with more than a few thousand documents is a candidate for indexing. In production, zero COLLSCAN entries in the profiler is a reasonable target.

setProfilingLevel(1, { slowMs: 100 })

Level 2 logs every single operation — on a busy cluster this produces gigabytes of profile entries per hour, adds latency to every operation, and floods system.profile until it rotates. Always use level 1 with a sensible slowMs threshold. Start at 100ms, lower to 50ms if you need to catch more. Disable the profiler again once the investigation is complete.

Prometheus Alert Rules — Alerts That Fire Before the Outage

Terminal triage is reactive — you are already in an incident. Prometheus alert rules are proactive — they fire when a metric crosses a threshold, giving you time to act before users are affected. The key to useful alerts is getting two things right: the threshold (not too sensitive, not too loose) and the for duration (long enough to avoid false positives from momentary spikes).

The scenario: You are setting up a production monitoring stack for a self-managed MongoDB replica set. Your team has been burned twice by replication lag surprises during failovers and once by a connection pool exhaustion that caused a 20-minute outage. You are writing the alert rules that would have caught both incidents early.

prometheus-alerts.yml — MongoDB alert rules
groups:
  - name: mongodb_alerts
    rules:

      # Replication lag — alert before a failover becomes a data loss event
      - alert: MongoReplicationLagHigh
        expr: mongodb_mongod_replset_member_replication_lag > 10
        for: 2m        # ignore transient 1-second spikes
        labels:
          severity: warning
        annotations:
          summary: "Replication lag {{ $value }}s on {{ $labels.instance }}"
          description: "Secondary falling behind. Writes could be lost on failover."

      # Connection pool — alert at 85% so you have headroom before rejection
      - alert: MongoConnectionsNearLimit
        expr: |
          mongodb_connections{state="current"} /
          (mongodb_connections{state="current"} +
           mongodb_connections{state="available"}) > 0.85
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Connection pool {{ $value | humanizePercentage }} full"
          description: "At current growth rate, connections will be exhausted in minutes."

      # Disk space — alert at 80% with time to act, not at 99%
      - alert: MongoDiskSpaceLow
        expr: |
          (node_filesystem_size_bytes{mountpoint="/data"} -
           node_filesystem_avail_bytes{mountpoint="/data"}) /
           node_filesystem_size_bytes{mountpoint="/data"} > 0.80
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "MongoDB data disk {{ $value | humanizePercentage }} full"
ALERTS FIRING:

[WARNING] MongoReplicationLagHigh
  instance: mongo-secondary-2:9216
  value:    34.2s  (threshold: 10s, firing for: 4m12s)
  message:  Secondary is 34 seconds behind primary.
            If primary fails now, 34 seconds of committed writes will be lost.
  action:   Check secondary CPU/disk I/O, look for long-running writes on primary.

[CRITICAL] MongoConnectionsNearLimit
  instance: mongo-primary:9216
  value:    91%  (threshold: 85%, firing for: 6m30s)
  message:  910 of 1000 connection slots in use.
  action:   Identify which service is not releasing connections.
            Check for MongoClient created per-request instead of at startup.
for: 2m — why not fire immediately?

A single-second replication lag spike caused by a momentary network hiccup would wake your on-call engineer at 3am for something that self-resolved before they opened their laptop. The for clause requires the condition to remain true continuously for 2 minutes. Alert fatigue from false positives trains engineers to ignore alerts — which is worse than no alerting at all.

Replication lag alert at 10 seconds

When the output shows 34 seconds of lag: if the primary fails right now and a secondary at 34 seconds of lag is elected, every write in those 34 seconds is permanently lost — your application issued success responses, your users saw confirmations, the data is gone. Replication lag is a silent data loss time bomb. Alert at 10 seconds to give yourself time to investigate before a failover makes it irreversible.

Connection alert at 85%, not 100%

By the time connections hit 100%, MongoDB is already rejecting new connections and your application is returning errors to users. The incident has started. Alerting at 85% gives you 15% headroom — typically 5 to 15 minutes at normal growth rates — to identify the leaking service, adjust pool sizing, or spin up capacity before anyone notices.

Monitoring Cassandra — nodetool Triage

Cassandra has its own monitoring vocabulary. The first tool you reach for is nodetool — it speaks to the JMX interface of a running Cassandra node and gives you cluster ring topology, per-node health, and compaction status. Knowing how to read nodetool status output is a prerequisite for operating any Cassandra cluster.

The scenario: Your alerting system shows elevated read latency on a 6-node Cassandra cluster storing IoT event data. You need to identify which node is struggling, whether the ring is balanced, and whether a compaction backlog is the culprit — before you escalate to the database team.

bash — Cassandra triage with nodetool
# Step 1: cluster ring — are all nodes up and is load balanced?
nodetool status

# Step 2: are compactions backed up on any node?
nodetool compactionstats

# Step 3: check for dropped messages — a critical health signal
# Dropped mutations = writes that were rejected and lost
nodetool tpstats | grep -A3 "Dropped"
$ nodetool status
Datacenter: dc1
Status=Up/Down  State=Normal/Leaving/Joining/Moving
--  Address       Load        Tokens  Owns   State
UN  10.0.1.10    45.2 GiB    256     16.7%  Normal
UN  10.0.1.11    44.8 GiB    256     16.6%  Normal
DN  10.0.1.12    ?           256     16.7%  Normal   ⚠️  DN = DOWN!
UN  10.0.1.13    45.1 GiB    256     16.7%  Normal
UN  10.0.1.14    88.4 GiB    256     16.6%  Normal   ⚠️  2x the load!
UN  10.0.1.15    44.9 GiB    256     16.6%  Normal

$ nodetool compactionstats  (run on node 10.0.1.14)
pending tasks: 847     ⚠️  847 compactions queued!
Active compaction remaining time : 4h 12m

Root cause chain:
  node-12 is DOWN → its token ranges fell to node-14
  → node-14 handling 2x normal write load
  → compaction backlog building → SSTable read amplification
  → elevated read latency across the cluster  ✓
DN — Down Node

DN in nodetool status means the node is down and not serving traffic. With a replication factor of 3, one down node is tolerable — requests route to the other replicas. But the down node's write load has been redistributed to its neighbours. Node-14 shows double the data load of every other node, which is the immediate explanation for the compaction backlog.

pending compaction tasks: 847

Cassandra writes new data as new SSTable files and merges them during compaction. A backlog of 847 pending compactions means reads on node-14 must scan more SSTable files per query — this is read amplification. The more uncompacted SSTables, the slower each read becomes, because Cassandra must check multiple files and merge the results. Compaction backlog is one of the top causes of Cassandra read latency spikes.

Dropped mutations in tpstats

Dropped mutations are writes that Cassandra accepted, queued internally, but then dropped because the queue was full — meaning those writes are permanently lost without the client ever knowing. A non-zero and growing dropped mutation count is a critical alert. It means the node is overwhelmed and is silently discarding data. This is the most severe signal in nodetool tpstats.

Key Metrics Reference — What to Monitor and Why

Metric System Alert threshold What it means when firing
replication_lag MongoDB > 10s for 2m Secondary falling behind — data loss risk on failover
connections %used MongoDB > 85% for 5m Pool near exhaustion — likely connection leak in app
globalLock.writeWaiters MongoDB > 10 for 1m Lock contention — find the slow query holding the lock
cache dirty % MongoDB > 20% sustained WiredTiger under flush pressure — working set exceeds RAM
pending compactions Cassandra > 100 per node Read amplification building — latency will increase
dropped mutations Cassandra > 0 (any) Critical — writes being silently discarded, node overwhelmed
disk usage All > 80% Extend volume before compaction and journals run out of space
p99 read latency All 3× baseline for 3m Something changed — new slow query, resource pressure, or node issue

Teacher's Note

The most common monitoring mistake is alerting on too many things with thresholds set too low. After three months of false positives, engineers start dismissing alerts before reading them — and then miss the one real incident buried among the noise. Start with just five alerts: replication lag, connection pool saturation, disk usage, p99 latency baseline deviation, and any down node. Get those five alerts tuned to produce zero false positives before you add anything else. Five reliable signals beat fifty unreliable ones every time.

Practice Questions — You're the Engineer

Scenario:

You enable the MongoDB profiler at level 1 during a latency incident and find a query in system.profile showing docsExamined: 1,200,000 and nreturned: 3. The query took 620ms. The planSummary field shows a two-word value indicating MongoDB read every document in the collection without using any index. What is the value of planSummary for this query?


Scenario:

Your team's Prometheus MongoDB replication lag alert is firing 8 to 12 times per night. Each time you investigate, the lag has already dropped back to normal — it was a momentary 2-second spike caused by a brief network hiccup between the primary and secondary. The on-call rotation is furious. The alert fires immediately as soon as the expression is true, with no persistence requirement. What single Prometheus alert rule field should you add, and what value would prevent these transient spikes from firing the alert while still catching genuine sustained lag?


Scenario:

You are reviewing nodetool tpstats output on a Cassandra node that has been under heavy write load for the past hour. You notice a counter that was zero this morning is now at 4,271 and still climbing. This counter represents write operations that Cassandra accepted from the client, queued internally, but then silently discarded without completing — meaning those writes are permanently lost and the writing client received no error. Your SRE team considers any non-zero value for this metric an immediate critical alert. What is this metric called in Cassandra?


Quiz — Monitoring in Production

Scenario:

During a latency incident you check db.serverStatus() and see globalLock.currentQueue.writers: 62. You enable the profiler and find a query in system.profile with docsExamined: 3,100,000, nreturned: 8, millis: 1240, and planSummary: "COLLSCAN". What is the root cause of the write queue, and what does this profiler entry tell you about it?

Scenario:

A colleague reviews your Prometheus alert rules and questions why the MongoDB connection pool alert fires at 85% rather than 100%. They argue that firing at 85% means you are paging the on-call engineer 15% early, before there is actually a problem. During last month's connection leak incident, the pool went from 80% to 100% in 4 minutes. What is the correct justification for the 85% threshold?

Scenario:

You run nodetool status on a 6-node Cassandra cluster and see: five nodes with ~45GiB load each, one node marked DN (down), and one node showing 88GiB load — nearly double every other node. You then run nodetool compactionstats on the overloaded node and see 847 pending compaction tasks with an estimated completion time of 4 hours. Your monitoring alert is for elevated read latency across the cluster. Which option correctly explains the full causal chain?

Up Next · Lesson 37

Scaling NoSQL Databases

Vertical vs horizontal scaling, read replicas, and the moment you realise a single node will never be enough again.