NoSQL
Monitoring NoSQL Systems
Monitoring is not about watching dashboards. It is about knowing your database is about to have a bad day before your users do. The difference between a 2-minute incident and a 4-hour outage is almost always the same: how fast you found out, and whether the signal was buried in noise or sitting in plain sight. This lesson is about building that signal.
The Four Golden Signals for NoSQL
Google's Site Reliability Engineering book introduced the four golden signals — the minimum set of metrics that together tell you whether a system is healthy. For NoSQL databases, they map directly to database-specific indicators:
Triage with db.serverStatus()
MongoDB exposes a rich set of runtime statistics through db.serverStatus(). Every monitoring agent — Atlas, Datadog, Prometheus — ultimately scrapes this endpoint. Knowing what each field means lets you triage an incident in minutes instead of guessing.
The scenario: You are the on-call engineer at a SaaS company. Your alerting system pages you at 11:47pm — MongoDB query latency has spiked from 8ms to 340ms over the past 15 minutes. You open a terminal and need to triage in under 3 minutes. Is it a query problem? A memory problem? A connection problem? You reach for db.serverStatus().
const s = db.serverStatus();
// 1. Are operations queued waiting for a lock?
printjson({
readWaiters: s.globalLock.currentQueue.readers,
writeWaiters: s.globalLock.currentQueue.writers
});
// 2. Is the working set fitting in RAM?
const cache = s.wiredTiger.cache;
printjson({
cacheUsedGB: (cache["bytes currently in the cache"] / 1e9).toFixed(2),
cacheDirtyPct: (cache["tracked dirty bytes in the cache"] /
cache["maximum bytes configured"] * 100).toFixed(1) + "%",
diskReads: cache["pages read into cache"] // high = cache misses = disk I/O
});
// 3. Is the connection pool near exhaustion?
printjson({
currentConns: s.connections.current,
availableConns: s.connections.available
});
// Lock queue — something is holding a write lock
{ readWaiters: 0, writeWaiters: 47 } ⚠️ 47 writes queued!
// Cache — working set has outgrown available RAM
{
cacheUsedGB: 15.94, // at the 16GB limit
cacheDirtyPct: "34.2%", // high — WiredTiger is under flush pressure
diskReads: 8841293 // 8.8M pages read from disk ⚠️
}
// Connections — not the bottleneck here
{ currentConns: 312, availableConns: 688 } ✓globalLock.currentQueue.writers: 47
47 write operations are queued waiting for a lock. Something is holding a write lock for an unusually long time — a slow bulk operation, an index build, or a missing-index query that scans millions of documents. Every queued operation adds to your p99 latency. This is the primary triage signal: the spike is caused by lock contention, not hardware.
cacheDirtyPct: 34.2% + 8.8M disk reads
The WiredTiger cache is full. 34% of the cache contains modified pages waiting to be flushed to disk — MongoDB is under write pressure. 8.8 million page reads means the working set no longer fits in RAM and queries are hitting the disk. Cache pressure is the contributing cause slowing the writes that are causing the lock queue to build.
connections.available: 688
688 connection slots remain open — connections are not the bottleneck this time. In a different incident, seeing availableConns: 3 with totalCreated climbing rapidly would point directly to a connection leak, where application code creates new MongoClient instances on every request instead of reusing a shared pool.
Finding Slow Queries — The Database Profiler
Lock contention usually has a root cause: a query scanning millions of documents without an index. MongoDB's database profiler captures every query that exceeds a configurable time threshold, writing them to the system.profile collection — letting you pinpoint the exact query causing the problem.
The scenario: The serverStatus() triage pointed to write lock contention. You suspect a slow, unindexed query is the culprit. You enable the profiler at level 1 to log queries slower than 100ms and immediately find the offenders.
// Level 1 = log only queries slower than slowMs
// Level 0 = off | Level 2 = log everything (never in production)
db.setProfilingLevel(1, { slowMs: 100 });
// Find the 5 slowest queries in the last 10 minutes
db.system.profile.find({
ts: { $gte: new Date(Date.now() - 10 * 60 * 1000) }
}).sort({ millis: -1 }).limit(5).forEach(op => {
print(`${op.millis}ms | ${op.op} | ${op.ns}`);
print(` query: ${JSON.stringify(op.query || op.command)}`);
print(` docsExamined: ${op.docsExamined} returned: ${op.nreturned}`);
print(` planSummary: ${op.planSummary}`);
print("---");
});
847ms | query | myapp.orders
query: {"status":"pending","created_at":{"$lt":"2025-03-01T00:00:00Z"}}
docsExamined: 2847391 returned: 12 ⚠️ examined 237K docs per result!
planSummary: COLLSCAN ← full table scan, no index used
---
312ms | query | myapp.users
query: {"email":"user@example.com"}
docsExamined: 941203 returned: 1
planSummary: COLLSCAN ← no index on email field
---
Root cause confirmed. Fix: add compound index on orders(status, created_at)
and single-field index on users(email).docsExamined: 2,847,391 vs returned: 12
This ratio is the clearest signal of a missing index. The query scanned 2.8 million documents to return 12 results — a ratio of 237,000:1. A well-indexed query has a ratio close to 1:1. MongoDB held a read lock throughout the entire scan, causing every write operation to queue behind it and explaining the 47 write waiters seen in serverStatus().
planSummary: COLLSCAN vs IXSCAN
COLLSCAN means MongoDB read every document in the collection. IXSCAN means MongoDB walked the B-tree index and jumped directly to matching documents. Any COLLSCAN on a collection with more than a few thousand documents is a candidate for indexing. In production, zero COLLSCAN entries in the profiler is a reasonable target.
setProfilingLevel(1, { slowMs: 100 })
Level 2 logs every single operation — on a busy cluster this produces gigabytes of profile entries per hour, adds latency to every operation, and floods system.profile until it rotates. Always use level 1 with a sensible slowMs threshold. Start at 100ms, lower to 50ms if you need to catch more. Disable the profiler again once the investigation is complete.
Prometheus Alert Rules — Alerts That Fire Before the Outage
Terminal triage is reactive — you are already in an incident. Prometheus alert rules are proactive — they fire when a metric crosses a threshold, giving you time to act before users are affected. The key to useful alerts is getting two things right: the threshold (not too sensitive, not too loose) and the for duration (long enough to avoid false positives from momentary spikes).
The scenario: You are setting up a production monitoring stack for a self-managed MongoDB replica set. Your team has been burned twice by replication lag surprises during failovers and once by a connection pool exhaustion that caused a 20-minute outage. You are writing the alert rules that would have caught both incidents early.
groups:
- name: mongodb_alerts
rules:
# Replication lag — alert before a failover becomes a data loss event
- alert: MongoReplicationLagHigh
expr: mongodb_mongod_replset_member_replication_lag > 10
for: 2m # ignore transient 1-second spikes
labels:
severity: warning
annotations:
summary: "Replication lag {{ $value }}s on {{ $labels.instance }}"
description: "Secondary falling behind. Writes could be lost on failover."
# Connection pool — alert at 85% so you have headroom before rejection
- alert: MongoConnectionsNearLimit
expr: |
mongodb_connections{state="current"} /
(mongodb_connections{state="current"} +
mongodb_connections{state="available"}) > 0.85
for: 5m
labels:
severity: critical
annotations:
summary: "Connection pool {{ $value | humanizePercentage }} full"
description: "At current growth rate, connections will be exhausted in minutes."
# Disk space — alert at 80% with time to act, not at 99%
- alert: MongoDiskSpaceLow
expr: |
(node_filesystem_size_bytes{mountpoint="/data"} -
node_filesystem_avail_bytes{mountpoint="/data"}) /
node_filesystem_size_bytes{mountpoint="/data"} > 0.80
for: 10m
labels:
severity: warning
annotations:
summary: "MongoDB data disk {{ $value | humanizePercentage }} full"
ALERTS FIRING:
[WARNING] MongoReplicationLagHigh
instance: mongo-secondary-2:9216
value: 34.2s (threshold: 10s, firing for: 4m12s)
message: Secondary is 34 seconds behind primary.
If primary fails now, 34 seconds of committed writes will be lost.
action: Check secondary CPU/disk I/O, look for long-running writes on primary.
[CRITICAL] MongoConnectionsNearLimit
instance: mongo-primary:9216
value: 91% (threshold: 85%, firing for: 6m30s)
message: 910 of 1000 connection slots in use.
action: Identify which service is not releasing connections.
Check for MongoClient created per-request instead of at startup.for: 2m — why not fire immediately?
A single-second replication lag spike caused by a momentary network hiccup would wake your on-call engineer at 3am for something that self-resolved before they opened their laptop. The for clause requires the condition to remain true continuously for 2 minutes. Alert fatigue from false positives trains engineers to ignore alerts — which is worse than no alerting at all.
Replication lag alert at 10 seconds
When the output shows 34 seconds of lag: if the primary fails right now and a secondary at 34 seconds of lag is elected, every write in those 34 seconds is permanently lost — your application issued success responses, your users saw confirmations, the data is gone. Replication lag is a silent data loss time bomb. Alert at 10 seconds to give yourself time to investigate before a failover makes it irreversible.
Connection alert at 85%, not 100%
By the time connections hit 100%, MongoDB is already rejecting new connections and your application is returning errors to users. The incident has started. Alerting at 85% gives you 15% headroom — typically 5 to 15 minutes at normal growth rates — to identify the leaking service, adjust pool sizing, or spin up capacity before anyone notices.
Monitoring Cassandra — nodetool Triage
Cassandra has its own monitoring vocabulary. The first tool you reach for is nodetool — it speaks to the JMX interface of a running Cassandra node and gives you cluster ring topology, per-node health, and compaction status. Knowing how to read nodetool status output is a prerequisite for operating any Cassandra cluster.
The scenario: Your alerting system shows elevated read latency on a 6-node Cassandra cluster storing IoT event data. You need to identify which node is struggling, whether the ring is balanced, and whether a compaction backlog is the culprit — before you escalate to the database team.
# Step 1: cluster ring — are all nodes up and is load balanced?
nodetool status
# Step 2: are compactions backed up on any node?
nodetool compactionstats
# Step 3: check for dropped messages — a critical health signal
# Dropped mutations = writes that were rejected and lost
nodetool tpstats | grep -A3 "Dropped"
$ nodetool status Datacenter: dc1 Status=Up/Down State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns State UN 10.0.1.10 45.2 GiB 256 16.7% Normal UN 10.0.1.11 44.8 GiB 256 16.6% Normal DN 10.0.1.12 ? 256 16.7% Normal ⚠️ DN = DOWN! UN 10.0.1.13 45.1 GiB 256 16.7% Normal UN 10.0.1.14 88.4 GiB 256 16.6% Normal ⚠️ 2x the load! UN 10.0.1.15 44.9 GiB 256 16.6% Normal $ nodetool compactionstats (run on node 10.0.1.14) pending tasks: 847 ⚠️ 847 compactions queued! Active compaction remaining time : 4h 12m Root cause chain: node-12 is DOWN → its token ranges fell to node-14 → node-14 handling 2x normal write load → compaction backlog building → SSTable read amplification → elevated read latency across the cluster ✓
DN — Down Node
DN in nodetool status means the node is down and not serving traffic. With a replication factor of 3, one down node is tolerable — requests route to the other replicas. But the down node's write load has been redistributed to its neighbours. Node-14 shows double the data load of every other node, which is the immediate explanation for the compaction backlog.
pending compaction tasks: 847
Cassandra writes new data as new SSTable files and merges them during compaction. A backlog of 847 pending compactions means reads on node-14 must scan more SSTable files per query — this is read amplification. The more uncompacted SSTables, the slower each read becomes, because Cassandra must check multiple files and merge the results. Compaction backlog is one of the top causes of Cassandra read latency spikes.
Dropped mutations in tpstats
Dropped mutations are writes that Cassandra accepted, queued internally, but then dropped because the queue was full — meaning those writes are permanently lost without the client ever knowing. A non-zero and growing dropped mutation count is a critical alert. It means the node is overwhelmed and is silently discarding data. This is the most severe signal in nodetool tpstats.
Key Metrics Reference — What to Monitor and Why
| Metric | System | Alert threshold | What it means when firing |
|---|---|---|---|
replication_lag |
MongoDB | > 10s for 2m | Secondary falling behind — data loss risk on failover |
connections %used |
MongoDB | > 85% for 5m | Pool near exhaustion — likely connection leak in app |
globalLock.writeWaiters |
MongoDB | > 10 for 1m | Lock contention — find the slow query holding the lock |
cache dirty % |
MongoDB | > 20% sustained | WiredTiger under flush pressure — working set exceeds RAM |
pending compactions |
Cassandra | > 100 per node | Read amplification building — latency will increase |
dropped mutations |
Cassandra | > 0 (any) | Critical — writes being silently discarded, node overwhelmed |
disk usage |
All | > 80% | Extend volume before compaction and journals run out of space |
p99 read latency |
All | 3× baseline for 3m | Something changed — new slow query, resource pressure, or node issue |
Teacher's Note
The most common monitoring mistake is alerting on too many things with thresholds set too low. After three months of false positives, engineers start dismissing alerts before reading them — and then miss the one real incident buried among the noise. Start with just five alerts: replication lag, connection pool saturation, disk usage, p99 latency baseline deviation, and any down node. Get those five alerts tuned to produce zero false positives before you add anything else. Five reliable signals beat fifty unreliable ones every time.
Practice Questions — You're the Engineer
Scenario:
system.profile showing docsExamined: 1,200,000 and nreturned: 3. The query took 620ms. The planSummary field shows a two-word value indicating MongoDB read every document in the collection without using any index. What is the value of planSummary for this query?
Scenario:
Scenario:
nodetool tpstats output on a Cassandra node that has been under heavy write load for the past hour. You notice a counter that was zero this morning is now at 4,271 and still climbing. This counter represents write operations that Cassandra accepted from the client, queued internally, but then silently discarded without completing — meaning those writes are permanently lost and the writing client received no error. Your SRE team considers any non-zero value for this metric an immediate critical alert. What is this metric called in Cassandra?
Quiz — Monitoring in Production
Scenario:
db.serverStatus() and see globalLock.currentQueue.writers: 62. You enable the profiler and find a query in system.profile with docsExamined: 3,100,000, nreturned: 8, millis: 1240, and planSummary: "COLLSCAN". What is the root cause of the write queue, and what does this profiler entry tell you about it?
Scenario:
Scenario:
nodetool status on a 6-node Cassandra cluster and see: five nodes with ~45GiB load each, one node marked DN (down), and one node showing 88GiB load — nearly double every other node. You then run nodetool compactionstats on the overloaded node and see 847 pending compaction tasks with an estimated completion time of 4 hours. Your monitoring alert is for elevated read latency across the cluster. Which option correctly explains the full causal chain?
Up Next · Lesson 37
Scaling NoSQL Databases
Vertical vs horizontal scaling, read replicas, and the moment you realise a single node will never be enough again.