Kubernetes Lesson 53 – Kubernetes Monitoring | Dataplexa
Advanced Workloads & Operations · Lesson 53

Kubernetes Monitoring

Logs tell you what happened. Metrics tell you how your system performs right now and over time. This lesson covers the Prometheus and Grafana stack, the four golden signals every service should track, alerting with Alertmanager, and the dashboards platform teams rely on to keep production healthy.

The Monitoring Stack

The de-facto Kubernetes monitoring stack is Prometheus + Grafana, installed together as kube-prometheus-stack via Helm. It bundles everything needed out of the box: Prometheus server, Grafana, Alertmanager, node-exporter DaemonSet, kube-state-metrics, and a library of pre-built dashboards and alerts.

Prometheus

Scrapes metrics endpoints every 15–60s. Stores as time-series data. Query language: PromQL. Holds data for 15 days by default.

Grafana

Visualises Prometheus data as dashboards and charts. Sends alert notifications to Slack, PagerDuty, email via Alertmanager.

kube-state-metrics

Exposes Kubernetes object state as metrics: Deployment replica counts, Pod phase, PVC bound status, Job completion. Queries the API server.

node-exporter

DaemonSet that exposes host-level metrics: CPU, memory, disk I/O, network, filesystem usage. Runs on every node.

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --version 57.0.3 \
  --set prometheus.prometheusSpec.retention=30d \
  --set prometheus.prometheusSpec.retentionSize=50GB \
  --set grafana.adminPassword=changeme \
  --set alertmanager.alertmanagerSpec.replicas=2

# Access Grafana locally
kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n monitoring
# Open http://localhost:3000  admin / changeme
$ helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
  --namespace monitoring --create-namespace --version 57.0.3
NAME: kube-prometheus-stack
STATUS: deployed  REVISION: 1

$ kubectl get pods -n monitoring
NAME                                                     READY
alertmanager-kube-prometheus-stack-alertmanager-0        2/2     Running
kube-prometheus-stack-grafana-7d9f4-xkp2m                3/3     Running
kube-prometheus-stack-kube-state-metrics-abc-def          1/1     Running
kube-prometheus-stack-operator-xyz-123                    1/1     Running
prometheus-kube-prometheus-stack-prometheus-0             2/2     Running
kube-prometheus-stack-prometheus-node-exporter-a1b2      1/1     Running   ← DaemonSet, one per node
kube-prometheus-stack-prometheus-node-exporter-c3d4      1/1     Running

# Access Grafana
$ kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n monitoring &
# Open http://localhost:3000  admin / changeme

The Four Golden Signals

Google SRE defines four golden signals that, together, give a complete picture of a service's health. If you can only track four things per service, track these. Every PromQL example below is a production-ready query you can put directly into a Grafana panel.

1. Latency — How long requests take

Track P50, P95, P99 — not averages. A 500ms average hides a P99 of 5 seconds. Alert on P99, not P50.

# P99 request latency per service over the last 5 minutes
histogram_quantile(0.99,
  sum by (service, le) (
    rate(http_request_duration_seconds_bucket[5m])
  )
)

# P50 and P95 for comparison
histogram_quantile(0.50, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
# In Grafana or prometheus-ui -- result at this moment:
{service="payment-api"} 0.127    ← P99 = 127ms  ✓
{service="fraud-service"} 0.843  ← P99 = 843ms  ⚠ approaching SLO

# P50 vs P95 vs P99 comparison:
{quantile="0.50"} 0.038   ← median 38ms (most users fine)
{quantile="0.95"} 0.312   ← 95th percentile 312ms
{quantile="0.99"} 0.127   ← P99 127ms -- alert threshold: 500ms, we're green

2. Traffic — How much demand hits the system

Requests per second. Use this to correlate latency spikes with traffic spikes, and to right-size HPA targets.

# Requests per second per service
sum by (service) (rate(http_requests_total[5m]))

# Requests per second broken down by HTTP status code
sum by (status_code) (rate(http_requests_total[5m]))
# Current traffic:
{service="payment-api"}  142.3    ← 142 req/s
{service="fraud-service"} 98.7

# By status code:
{status_code="200"} 140.1   ← 98.5% success
{status_code="500"} 1.4     ← 1.0% server errors
{status_code="400"} 0.8     ← 0.5% client errors

3. Errors — What fraction of requests fail

Error rate as a percentage of total traffic. Alert when it exceeds your SLO threshold (e.g. 0.1% error rate).

# HTTP 5xx error rate as percentage of total traffic
sum(rate(http_requests_total{status_code=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
* 100

# Kubernetes Pod restart rate (infrastructure errors)
sum by (namespace, pod) (
  increase(kube_pod_container_status_restarts_total[1h])
)
# Error rate result:
0.98   ← 0.98% 5xx rate -- SLO is 0.5%, we're BREACHING  🚨

# Pod restart rate (last hour):
{namespace="payments", pod="payment-api-7d9f4-xkp2m"}  0
{namespace="payments", pod="payment-api-7d9f4-rvqn2"}  3  ← crash-looping -- investigate!

4. Saturation — How full the system is

CPU throttling, memory pressure, queue depth. A system near saturation degrades before it fails — catch it early.

# CPU throttling rate per container (high value = requests too low, limits too strict)
sum by (namespace, pod, container) (
  rate(container_cpu_cfs_throttled_seconds_total[5m])
)
/
sum by (namespace, pod, container) (
  rate(container_cpu_cfs_periods_total[5m])
)

# Memory usage vs limit (alert when above 85%)
container_memory_working_set_bytes
/
container_spec_memory_limit_bytes
* 100
# CPU throttling result (fraction 0-1):
{pod="payment-api-7d9f4-xkp2m", container="payment-api"}  0.42  ← 42% throttled -- requests too low!

# Memory vs limit:
{pod="payment-api-7d9f4-xkp2m", container="payment-api"}  87   ← 87% of limit -- near OOMKill risk
# Fix: increase CPU request from 100m to 300m and memory limit from 256Mi to 512Mi

Exposing Custom Application Metrics

Your application can expose its own business metrics — payment success rates, queue depths, cache hit ratios — in the Prometheus format. Prometheus scrapes them automatically via a ServiceMonitor resource.

# Python: expose metrics with prometheus_client
from prometheus_client import Counter, Histogram, start_http_server

payments_total = Counter(
    'payments_total',
    'Total payment attempts',
    ['status', 'currency']   # Labels -- group metrics by status and currency
)
payment_duration = Histogram(
    'payment_duration_seconds',
    'Payment processing duration',
    buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 2.0, 5.0]
)

# In your payment handler:
with payment_duration.time():       # Automatically record duration
    result = process_payment(amount, currency)
    payments_total.labels(
        status='success' if result.ok else 'failed',
        currency=currency
    ).inc()
# What the /metrics endpoint exposes:
$ kubectl port-forward svc/payment-api 8080:80 -n payments &
$ curl -s http://localhost:8080/metrics | grep payments

# HELP payments_total Total payment attempts
# TYPE payments_total counter
payments_total{currency="USD",status="success"} 18432.0
payments_total{currency="USD",status="failed"} 47.0
payments_total{currency="EUR",status="success"} 3291.0

# HELP payment_duration_seconds Payment processing duration
# TYPE payment_duration_seconds histogram
payment_duration_seconds_bucket{le="0.1"} 15234.0
payment_duration_seconds_bucket{le="0.5"} 18100.0
payment_duration_seconds_bucket{le="+Inf"} 18432.0
payment_duration_seconds_sum 1842.7
payment_duration_seconds_count 18432.0
# ServiceMonitor: tells Prometheus where to scrape your application
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: payment-api
  namespace: payments
  labels:
    release: kube-prometheus-stack   # Must match Prometheus's serviceMonitorSelector
spec:
  selector:
    matchLabels:
      app: payment-api               # Selects the Service exposing /metrics
  endpoints:
    - port: metrics                  # Named port on the Service
      path: /metrics
      interval: 30s                  # Scrape every 30 seconds
  namespaceSelector:
    matchNames:
      - payments
# What Prometheus sees at /metrics on your Pod:
# HELP payments_total Total payment attempts
# TYPE payments_total counter
payments_total{status="success",currency="USD"} 18432
payments_total{status="failed",currency="USD"} 47
payments_total{status="success",currency="EUR"} 3291
payments_total{status="failed",currency="EUR"} 12

# HELP payment_duration_seconds Payment processing duration
# TYPE payment_duration_seconds histogram
payment_duration_seconds_bucket{le="0.1"} 15234
payment_duration_seconds_bucket{le="0.5"} 18100
payment_duration_seconds_bucket{le="1.0"} 18390
payment_duration_seconds_bucket{le="+Inf"} 18432
payment_duration_seconds_sum 1842.7
payment_duration_seconds_count 18432

# PromQL: payment success rate
sum(rate(payments_total{status="success"}[5m]))
/
sum(rate(payments_total[5m]))
* 100
# Result: 99.72%  -- your SLO is 99.5%, you're green

What just happened?

ServiceMonitor is the Kubernetes-native way to configure scraping — Instead of editing Prometheus's scrape_configs manually, the Prometheus Operator watches for ServiceMonitor resources and automatically updates the Prometheus configuration. You deploy a new service with a ServiceMonitor, and Prometheus picks it up within 30 seconds — no Prometheus restart required.

Labels on metrics are dimensions for slicing — The status and currency labels let you slice the payment success rate by currency. Which currency has the highest failure rate? PromQL: sum by (currency) (rate(payments_total{status="failed"}[5m])). Every label you add becomes a free dimension for filtering and grouping — but keep label cardinality low (avoid user IDs or trace IDs as labels — those belong in logs, not metrics).

Alerting with Alertmanager

Prometheus evaluates alert rules against the metrics it holds. When a rule fires, it sends the alert to Alertmanager, which handles routing, deduplication, grouping, and notification to Slack, PagerDuty, email, or any webhook.

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: payment-api-alerts
  namespace: payments
  labels:
    release: kube-prometheus-stack
spec:
  groups:
    - name: payment-api
      interval: 30s                         # Evaluate these rules every 30 seconds
      rules:
        - alert: PaymentErrorRateTooHigh
          expr: |
            (
              sum(rate(payments_total{status="failed"}[5m]))
              /
              sum(rate(payments_total[5m]))
            ) * 100 > 1
          for: 2m                           # Must be true for 2 minutes before firing
          labels:
            severity: critical
            team: payments
          annotations:
            summary: "Payment error rate above 1%"
            description: "Payment error rate is {{ $value | humanize }}% — SLO breach imminent."
            runbook_url: "https://wiki.company.com/runbooks/payment-errors"

        - alert: PodCrashLooping
          expr: |
            increase(kube_pod_container_status_restarts_total{namespace="payments"}[1h]) > 3
          for: 0m                           # Fire immediately -- crash loops need fast response
          labels:
            severity: warning
          annotations:
            summary: "Pod {{ $labels.pod }} is crash-looping"
            description: "{{ $labels.pod }} has restarted {{ $value }} times in the last hour."

        - alert: HighCPUThrottling
          expr: |
            sum by (pod, container) (
              rate(container_cpu_cfs_throttled_seconds_total{namespace="payments"}[5m])
            )
            /
            sum by (pod, container) (
              rate(container_cpu_cfs_periods_total{namespace="payments"}[5m])
            )
            > 0.25
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Container {{ $labels.container }} heavily throttled"
            description: "CPU throttling at {{ $value | humanizePercentage }} -- increase CPU limit."
# Check active alerts
$ kubectl get prometheusrule payment-api-alerts -n payments
NAME                  AGE
payment-api-alerts    2m

# Query current alert state in Prometheus UI or via API
$ curl prometheus:9090/api/v1/alerts | jq '.data.alerts[] | {alertname, state, labels}'
{
  "alertname": "PaymentErrorRateTooHigh",
  "state": "firing",
  "labels": {
    "severity": "critical",
    "team": "payments"
  }
}

# Alertmanager routes this to the payments team's PagerDuty
# Slack message in #alerts-payments:
# FIRING: PaymentErrorRateTooHigh
# Payment error rate is 2.47% -- SLO breach imminent.
# Runbook: https://wiki.company.com/runbooks/payment-errors

Teacher's Note: Alert fatigue and the SLO-based alerting model

The most common monitoring failure is not too few alerts — it's too many. Teams that alert on CPU above 80% get paged at 2am for a CPU spike that resolved itself in 30 seconds. Engineers start ignoring alerts. Real incidents are missed.

The better model is SLO-based alerting: define what "good service" means to users (e.g., 99.5% of payments succeed, P99 latency below 500ms), then alert only when you're burning through your error budget faster than sustainable. Symptom-based alerts (user-facing error rate, latency) wake people up. Cause-based alerts (high CPU, memory usage) go to a dashboard for investigation during business hours. This distinction — page on symptoms, not causes — is the single biggest improvement most teams can make to their on-call experience.

The for: 2m duration in alert rules is your friend. A transient spike that resolves in 90 seconds should never page anyone. Set for to at least 2–5 minutes for most alerts, and reserve for: 0m for genuinely catastrophic conditions like crash loops.

Practice Questions

1. Which component of the kube-prometheus-stack exposes Kubernetes object state as Prometheus metrics — including Deployment replica counts, Pod phase, and PVC bound status?



2. Which Kubernetes custom resource tells the Prometheus Operator to automatically scrape a Service's /metrics endpoint — without manually editing Prometheus configuration?



3. In a Prometheus alert rule, which field prevents the alert from firing on a brief transient spike by requiring the condition to be true for a sustained duration?



Quiz

1. What are the four golden signals defined by Google SRE for monitoring services?


2. A developer wants to add user_id as a label on the payments_total counter. Why is this a problem?


3. Your team is getting paged at 2am for CPU alerts that resolve themselves in 2 minutes. What is the best practice to reduce this alert fatigue?


Up Next · Lesson 54

Helm Introduction

Helm is the Kubernetes package manager. This lesson covers finding and installing charts from repositories, customising with values files, managing releases, and rolling back deployments atomically.