Kubernetes Course
Kubernetes Monitoring
Logs tell you what happened. Metrics tell you how your system performs right now and over time. This lesson covers the Prometheus and Grafana stack, the four golden signals every service should track, alerting with Alertmanager, and the dashboards platform teams rely on to keep production healthy.
The Monitoring Stack
The de-facto Kubernetes monitoring stack is Prometheus + Grafana, installed together as kube-prometheus-stack via Helm. It bundles everything needed out of the box: Prometheus server, Grafana, Alertmanager, node-exporter DaemonSet, kube-state-metrics, and a library of pre-built dashboards and alerts.
Prometheus
Scrapes metrics endpoints every 15–60s. Stores as time-series data. Query language: PromQL. Holds data for 15 days by default.
Grafana
Visualises Prometheus data as dashboards and charts. Sends alert notifications to Slack, PagerDuty, email via Alertmanager.
kube-state-metrics
Exposes Kubernetes object state as metrics: Deployment replica counts, Pod phase, PVC bound status, Job completion. Queries the API server.
node-exporter
DaemonSet that exposes host-level metrics: CPU, memory, disk I/O, network, filesystem usage. Runs on every node.
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--version 57.0.3 \
--set prometheus.prometheusSpec.retention=30d \
--set prometheus.prometheusSpec.retentionSize=50GB \
--set grafana.adminPassword=changeme \
--set alertmanager.alertmanagerSpec.replicas=2
# Access Grafana locally
kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n monitoring
# Open http://localhost:3000 admin / changeme
$ helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \ --namespace monitoring --create-namespace --version 57.0.3 NAME: kube-prometheus-stack STATUS: deployed REVISION: 1 $ kubectl get pods -n monitoring NAME READY alertmanager-kube-prometheus-stack-alertmanager-0 2/2 Running kube-prometheus-stack-grafana-7d9f4-xkp2m 3/3 Running kube-prometheus-stack-kube-state-metrics-abc-def 1/1 Running kube-prometheus-stack-operator-xyz-123 1/1 Running prometheus-kube-prometheus-stack-prometheus-0 2/2 Running kube-prometheus-stack-prometheus-node-exporter-a1b2 1/1 Running ← DaemonSet, one per node kube-prometheus-stack-prometheus-node-exporter-c3d4 1/1 Running # Access Grafana $ kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n monitoring & # Open http://localhost:3000 admin / changeme
The Four Golden Signals
Google SRE defines four golden signals that, together, give a complete picture of a service's health. If you can only track four things per service, track these. Every PromQL example below is a production-ready query you can put directly into a Grafana panel.
1. Latency — How long requests take
Track P50, P95, P99 — not averages. A 500ms average hides a P99 of 5 seconds. Alert on P99, not P50.
# P99 request latency per service over the last 5 minutes
histogram_quantile(0.99,
sum by (service, le) (
rate(http_request_duration_seconds_bucket[5m])
)
)
# P50 and P95 for comparison
histogram_quantile(0.50, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
# In Grafana or prometheus-ui -- result at this moment:
{service="payment-api"} 0.127 ← P99 = 127ms ✓
{service="fraud-service"} 0.843 ← P99 = 843ms ⚠ approaching SLO
# P50 vs P95 vs P99 comparison:
{quantile="0.50"} 0.038 ← median 38ms (most users fine)
{quantile="0.95"} 0.312 ← 95th percentile 312ms
{quantile="0.99"} 0.127 ← P99 127ms -- alert threshold: 500ms, we're green2. Traffic — How much demand hits the system
Requests per second. Use this to correlate latency spikes with traffic spikes, and to right-size HPA targets.
# Requests per second per service
sum by (service) (rate(http_requests_total[5m]))
# Requests per second broken down by HTTP status code
sum by (status_code) (rate(http_requests_total[5m]))
# Current traffic:
{service="payment-api"} 142.3 ← 142 req/s
{service="fraud-service"} 98.7
# By status code:
{status_code="200"} 140.1 ← 98.5% success
{status_code="500"} 1.4 ← 1.0% server errors
{status_code="400"} 0.8 ← 0.5% client errors3. Errors — What fraction of requests fail
Error rate as a percentage of total traffic. Alert when it exceeds your SLO threshold (e.g. 0.1% error rate).
# HTTP 5xx error rate as percentage of total traffic
sum(rate(http_requests_total{status_code=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
* 100
# Kubernetes Pod restart rate (infrastructure errors)
sum by (namespace, pod) (
increase(kube_pod_container_status_restarts_total[1h])
)
# Error rate result:
0.98 ← 0.98% 5xx rate -- SLO is 0.5%, we're BREACHING 🚨
# Pod restart rate (last hour):
{namespace="payments", pod="payment-api-7d9f4-xkp2m"} 0
{namespace="payments", pod="payment-api-7d9f4-rvqn2"} 3 ← crash-looping -- investigate!4. Saturation — How full the system is
CPU throttling, memory pressure, queue depth. A system near saturation degrades before it fails — catch it early.
# CPU throttling rate per container (high value = requests too low, limits too strict)
sum by (namespace, pod, container) (
rate(container_cpu_cfs_throttled_seconds_total[5m])
)
/
sum by (namespace, pod, container) (
rate(container_cpu_cfs_periods_total[5m])
)
# Memory usage vs limit (alert when above 85%)
container_memory_working_set_bytes
/
container_spec_memory_limit_bytes
* 100
# CPU throttling result (fraction 0-1):
{pod="payment-api-7d9f4-xkp2m", container="payment-api"} 0.42 ← 42% throttled -- requests too low!
# Memory vs limit:
{pod="payment-api-7d9f4-xkp2m", container="payment-api"} 87 ← 87% of limit -- near OOMKill risk
# Fix: increase CPU request from 100m to 300m and memory limit from 256Mi to 512MiExposing Custom Application Metrics
Your application can expose its own business metrics — payment success rates, queue depths, cache hit ratios — in the Prometheus format. Prometheus scrapes them automatically via a ServiceMonitor resource.
# Python: expose metrics with prometheus_client
from prometheus_client import Counter, Histogram, start_http_server
payments_total = Counter(
'payments_total',
'Total payment attempts',
['status', 'currency'] # Labels -- group metrics by status and currency
)
payment_duration = Histogram(
'payment_duration_seconds',
'Payment processing duration',
buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 2.0, 5.0]
)
# In your payment handler:
with payment_duration.time(): # Automatically record duration
result = process_payment(amount, currency)
payments_total.labels(
status='success' if result.ok else 'failed',
currency=currency
).inc()
# What the /metrics endpoint exposes:
$ kubectl port-forward svc/payment-api 8080:80 -n payments &
$ curl -s http://localhost:8080/metrics | grep payments
# HELP payments_total Total payment attempts
# TYPE payments_total counter
payments_total{currency="USD",status="success"} 18432.0
payments_total{currency="USD",status="failed"} 47.0
payments_total{currency="EUR",status="success"} 3291.0
# HELP payment_duration_seconds Payment processing duration
# TYPE payment_duration_seconds histogram
payment_duration_seconds_bucket{le="0.1"} 15234.0
payment_duration_seconds_bucket{le="0.5"} 18100.0
payment_duration_seconds_bucket{le="+Inf"} 18432.0
payment_duration_seconds_sum 1842.7
payment_duration_seconds_count 18432.0# ServiceMonitor: tells Prometheus where to scrape your application
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: payment-api
namespace: payments
labels:
release: kube-prometheus-stack # Must match Prometheus's serviceMonitorSelector
spec:
selector:
matchLabels:
app: payment-api # Selects the Service exposing /metrics
endpoints:
- port: metrics # Named port on the Service
path: /metrics
interval: 30s # Scrape every 30 seconds
namespaceSelector:
matchNames:
- payments
# What Prometheus sees at /metrics on your Pod:
# HELP payments_total Total payment attempts
# TYPE payments_total counter
payments_total{status="success",currency="USD"} 18432
payments_total{status="failed",currency="USD"} 47
payments_total{status="success",currency="EUR"} 3291
payments_total{status="failed",currency="EUR"} 12
# HELP payment_duration_seconds Payment processing duration
# TYPE payment_duration_seconds histogram
payment_duration_seconds_bucket{le="0.1"} 15234
payment_duration_seconds_bucket{le="0.5"} 18100
payment_duration_seconds_bucket{le="1.0"} 18390
payment_duration_seconds_bucket{le="+Inf"} 18432
payment_duration_seconds_sum 1842.7
payment_duration_seconds_count 18432
# PromQL: payment success rate
sum(rate(payments_total{status="success"}[5m]))
/
sum(rate(payments_total[5m]))
* 100
# Result: 99.72% -- your SLO is 99.5%, you're greenWhat just happened?
ServiceMonitor is the Kubernetes-native way to configure scraping — Instead of editing Prometheus's scrape_configs manually, the Prometheus Operator watches for ServiceMonitor resources and automatically updates the Prometheus configuration. You deploy a new service with a ServiceMonitor, and Prometheus picks it up within 30 seconds — no Prometheus restart required.
Labels on metrics are dimensions for slicing — The status and currency labels let you slice the payment success rate by currency. Which currency has the highest failure rate? PromQL: sum by (currency) (rate(payments_total{status="failed"}[5m])). Every label you add becomes a free dimension for filtering and grouping — but keep label cardinality low (avoid user IDs or trace IDs as labels — those belong in logs, not metrics).
Alerting with Alertmanager
Prometheus evaluates alert rules against the metrics it holds. When a rule fires, it sends the alert to Alertmanager, which handles routing, deduplication, grouping, and notification to Slack, PagerDuty, email, or any webhook.
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: payment-api-alerts
namespace: payments
labels:
release: kube-prometheus-stack
spec:
groups:
- name: payment-api
interval: 30s # Evaluate these rules every 30 seconds
rules:
- alert: PaymentErrorRateTooHigh
expr: |
(
sum(rate(payments_total{status="failed"}[5m]))
/
sum(rate(payments_total[5m]))
) * 100 > 1
for: 2m # Must be true for 2 minutes before firing
labels:
severity: critical
team: payments
annotations:
summary: "Payment error rate above 1%"
description: "Payment error rate is {{ $value | humanize }}% — SLO breach imminent."
runbook_url: "https://wiki.company.com/runbooks/payment-errors"
- alert: PodCrashLooping
expr: |
increase(kube_pod_container_status_restarts_total{namespace="payments"}[1h]) > 3
for: 0m # Fire immediately -- crash loops need fast response
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.pod }} is crash-looping"
description: "{{ $labels.pod }} has restarted {{ $value }} times in the last hour."
- alert: HighCPUThrottling
expr: |
sum by (pod, container) (
rate(container_cpu_cfs_throttled_seconds_total{namespace="payments"}[5m])
)
/
sum by (pod, container) (
rate(container_cpu_cfs_periods_total{namespace="payments"}[5m])
)
> 0.25
for: 5m
labels:
severity: warning
annotations:
summary: "Container {{ $labels.container }} heavily throttled"
description: "CPU throttling at {{ $value | humanizePercentage }} -- increase CPU limit."
# Check active alerts
$ kubectl get prometheusrule payment-api-alerts -n payments
NAME AGE
payment-api-alerts 2m
# Query current alert state in Prometheus UI or via API
$ curl prometheus:9090/api/v1/alerts | jq '.data.alerts[] | {alertname, state, labels}'
{
"alertname": "PaymentErrorRateTooHigh",
"state": "firing",
"labels": {
"severity": "critical",
"team": "payments"
}
}
# Alertmanager routes this to the payments team's PagerDuty
# Slack message in #alerts-payments:
# FIRING: PaymentErrorRateTooHigh
# Payment error rate is 2.47% -- SLO breach imminent.
# Runbook: https://wiki.company.com/runbooks/payment-errorsTeacher's Note: Alert fatigue and the SLO-based alerting model
The most common monitoring failure is not too few alerts — it's too many. Teams that alert on CPU above 80% get paged at 2am for a CPU spike that resolved itself in 30 seconds. Engineers start ignoring alerts. Real incidents are missed.
The better model is SLO-based alerting: define what "good service" means to users (e.g., 99.5% of payments succeed, P99 latency below 500ms), then alert only when you're burning through your error budget faster than sustainable. Symptom-based alerts (user-facing error rate, latency) wake people up. Cause-based alerts (high CPU, memory usage) go to a dashboard for investigation during business hours. This distinction — page on symptoms, not causes — is the single biggest improvement most teams can make to their on-call experience.
The for: 2m duration in alert rules is your friend. A transient spike that resolves in 90 seconds should never page anyone. Set for to at least 2–5 minutes for most alerts, and reserve for: 0m for genuinely catastrophic conditions like crash loops.
Practice Questions
1. Which component of the kube-prometheus-stack exposes Kubernetes object state as Prometheus metrics — including Deployment replica counts, Pod phase, and PVC bound status?
2. Which Kubernetes custom resource tells the Prometheus Operator to automatically scrape a Service's /metrics endpoint — without manually editing Prometheus configuration?
3. In a Prometheus alert rule, which field prevents the alert from firing on a brief transient spike by requiring the condition to be true for a sustained duration?
Quiz
1. What are the four golden signals defined by Google SRE for monitoring services?
2. A developer wants to add user_id as a label on the payments_total counter. Why is this a problem?
3. Your team is getting paged at 2am for CPU alerts that resolve themselves in 2 minutes. What is the best practice to reduce this alert fatigue?
Up Next · Lesson 54
Helm Introduction
Helm is the Kubernetes package manager. This lesson covers finding and installing charts from repositories, customising with values files, managing releases, and rolling back deployments atomically.