Why Monitoring in Kubernetes is Essential
Unlike traditional applications running on fixed servers, Kubernetes workloads are highly dynamic. Pods are constantly created and destroyed, scaling up and down as needed. This makes monitoring more complex but also more critical.
Without proper monitoring, you risk:
- Unexpected outages – Pods crash, nodes fail, and services can break without warning.
- Overspending on cloud resources – Without visibility into CPU, memory, and storage usage, costs can spiral out of control.
- Difficult troubleshooting – Debugging issues in a distributed system without logs and metrics is like searching for a needle in a haystack.
A strong monitoring setup ensures you catch problems before they escalate while keeping costs optimized.
Key Metrics
Monitoring in Kubernetes is typically divided into three main types of data:
1. Metrics Performance Monitoring
These provide real-time insights into the health of your cluster.
Key Metrics to Track:
- CPU and memory usage (per pod, node, and cluster level)
- Network traffic (incoming and outgoing data)
- Pod restarts and failures
- Cluster resource allocation (how efficiently resources are used)
Example:
You can use kubectl to check CPU and memory usage:
kubectl top pod --all-namespaces
2. Logs (Application and System Logs)
Logs help you understand what happened when things go wrong.
Key Logs to Monitor:
- Pod logs (container output, errors, stack traces)
- Node logs (Kubelet, system messages)
- Control plane logs (API server, scheduler, controller manager)
Example:
Check logs for a specific pod:
kubectl logs my-pod -n my-namespace
3. Events (Cluster-Wide Activity)
Events show what Kubernetes is doing behind the scenes.
Key Events to Monitor:
- Pod scheduling failures (if a pod can’t find a node to run on)
- OOMKills (Out-of-Memory Errors)
- Network policies being applied or changed
Example:
See the most recent events in your cluster:
kubectl get events --sort-by=.metadata.creationTimestamp
Open-Source Tools for Monitoring
There are many open-source tools to monitor Kubernetes efficiently. These tools provide visibility into costs, workloads, and performance without vendor lock-in.
1. OpenCost (Cost Monitoring)
What it does:
- Provides real-time cost visibility into Kubernetes workloads.
- Helps teams track and allocate cloud costs per namespace, pod, and container.
- Identifies overprovisioned resources to reduce waste.
Installation guide
Step 1: Install OpenCost in Your Cluster
You can deploy OpenCost using Helm:
helm repo add opencost https://opencost.github.io/opencost-helm-chart/
helm repo update
helm install opencost opencost/opencost --namespace opencost --create-namespace
Step 2: Verify the Installation
Check that the OpenCost pod is running:
kubectl get pods -n opencost
Step 3: Access the OpenCost UI
Expose OpenCost with port forwarding:
kubectl port-forward -n opencost svc/opencost 9090:9090
Now, visit http://localhost:9090 in your browser.
Step 4: View Costs for Workloads
Run this command to see the cost breakdown for all namespaces:
kubectl get cost --namespace=default
Example Use Cases
Find the most expensive namespace
- Helps you identify cost-heavy workloads so you can optimize them.
Break down costs by CPU, memory, and storage
- If a pod is overprovisioned, you can adjust resource requests to reduce waste.
Track costs over time
- Integrate OpenCost with Grafana to visualize trends.
2. Prometheus (Metrics Collection & Monitoring)
What it does:
- Collects real-time CPU, memory, and network usage.
- Stores data in a time-series database for historical analysis.
- Works natively with Kubernetes via the kube-state-metrics exporter.
How to Install and Use Prometheus
Step 1: Install Prometheus with Helm
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/kube-prometheus-stack --namespace monitoring --create-namespace
Step 2: Verify the Installation
Check running pods:
kubectl get pods -n monitoring
Step 3: Access the Prometheus UI
Expose Prometheus:
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090
Now visit http://localhost:9090.
Step 4: Query Metrics in Prometheus
Run this PromQL query to check CPU usage:
rate(container_cpu_usage_seconds_total[5m])
Example Use Cases
Monitor pod resource usage
- Prevent pod crashes by setting resource limits based on actual usage.
Alert on high CPU/memory consumption
- Example: Get an alert when a pod uses more than 80% of allocated CPU:
groups:
- name: high-cpu
rules:
- alert: HighCPUUsage
expr: rate(container_cpu_usage_seconds_total[5m]) > 0.8
for: 2m
labels:
severity: critical
annotations:
summary: "High CPU Usage Detected!"
Track performance trends over time
- Helps with capacity planning by analyzing past performance.
3. Grafana (Kubernetes Dashboard & Visualization)
What it does:
- Provides beautiful dashboards for Kubernetes monitoring.
- Connects with Prometheus, OpenCost, and Loki.
- Supports alerting and notifications.
How to Install and Use Grafana
Step 1: Install Grafana with Helm
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
helm install grafana grafana/grafana --namespace monitoring
Step 2: Get the Admin Password
kubectl get secret --namespace monitoring grafana -o jsonpath="{.data.admin-password}" | base64 --decode
Step 3: Access the Grafana Dashboard
Expose Grafana:
kubectl port-forward svc/grafana 3000:3000 -n monitoring
Now visit http://localhost:3000, log in with admin and the retrieved password.
Step 4: Add Prometheus as a Data Source
- Go to Configuration > Data Sources.
- Click Add data source.
- Select Prometheus and enter:
- URL:
http://prometheus-kube-prometheus-prometheus.monitoring:9090
- URL:
- Click Save & Test.
Step 5: Import a Kubernetes Monitoring Dashboard
- Go to Dashboards > Import.
- Use the dashboard ID 3119 (Kubernetes Cluster Monitoring).
- Click Import and enjoy your real-time cluster dashboard!
Use Cases
Monitor CPU and memory usage visually
- Track trends over time to prevent bottlenecks.
Create alerts based on real-time data
- Example: Send Slack alerts if a pod restarts more than 5 times in 10 minutes.
Combine logs, metrics, and costs in one view
- Helps with troubleshooting and optimization.
4. Loki (Log Aggregation & Analysis)
What it does:
- Collects container logs without massive storage overhead.
- Integrates with Grafana for centralized log visualization.
- Supports log-based alerts.
How to Install and Use Loki
Step 1: Install Loki with Helm
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
helm install loki grafana/loki-stack --namespace logging --create-namespace
Step 2: Forward Logs to Loki
Edit the Fluent Bit ConfigMap to send logs to Loki:
[OUTPUT]
Name loki
Match *
Url http://loki.logging.svc:3100/loki/api/v1/push
Step 3: Query Logs in Grafana
- Go to Explore > Loki.
- Run a log query:
{namespace="default"} |= "error"
Example Use Cases
Find errors in pod logs
- Search for all logs containing “Out of Memory”.
Correlate logs with metrics
- See when an error occurred and how it affected performance.
Set up log-based alerts
Example: Notify teams if “connection refused” appears more than 5 times in 5 minutes.
Best Practices for Kubernetes Monitoring
1. Use OpenCost to Track Cloud Costs
- Identify overprovisioned pods and cut unnecessary spending.
- Allocate costs by namespace, team, or project.
2. Set Up Alerts for Resource Overuse
- Use Prometheus Alertmanager to notify teams of spikes.
- Example alert for high CPU usage:
groups: - name: high-cpu rules: - alert: HighCPUUsage expr: rate(container_cpu_usage_seconds_total[5m]) > 0.8 for: 2m labels: severity: critical annotations: summary: "High CPU Usage Detected!"
3. Correlate Logs and Metrics with Loki & Prometheus
- If a pod crashes, check logs and metrics together for root cause analysis.
4. Use Grafana for Easy Visualization
- Set up dashboards for CPU, memory, cost, and pod health.
Final Thoughts: Do You Need Kubernetes Monitoring?
If you’re running Kubernetes, monitoring is a must. Whether you’re a small team or a large enterprise, using open-source tools like OpenCost, Prometheus, and Grafana will help you prevent downtime, optimize costs, and troubleshoot faster.