Amazon Elastic Kubernetes Service (EKS) makes running Kubernetes a much smoother experience, but upgrading an EKS cluster can be anything but smooth. An upgrade touches both the control plane (the brain of your cluster) and the data plane (the worker nodes actually running workloads). Each component can break things if not handled properly.
Karpenter, the open-source cluster autoscaler, brings efficiency to node provisioning and scaling. But because it directly interacts with your workloads and AWS infrastructure, any EKS version change can affect how Karpenter behaves.
The reality: upgrading is unavoidable, but with the right plan you can do it without breaking production. In this guide, we’ll walk through the process end-to-end so you can upgrade smoothly and confidently.
Why you should upgrade your EKS cluster
Upgrading your cluster is not about chasing shiny features. It’s about avoiding risk and keeping your environment stable.
- Security patches: Kubernetes releases regularly fix vulnerabilities. Falling behind means exposing workloads to known exploits.
- Feature parity: APIs evolve. Some get deprecated. Others unlock new functionality. If you’re on an older version, you risk being blocked by unsupported APIs when deploying new workloads.
- AWS support window: AWS only supports three EKS Kubernetes versions at a time. When your version falls out of support, you lose access to security patches and managed updates.
- Performance gains: Upgrades often bring improvements in scheduling, scalability, and networking performance.
Delaying upgrades piles up technical debt. The further behind you fall, the riskier and more complex the eventual upgrade becomes.
Common issues during upgrades
Understanding the usual pain points will help you anticipate them.
- Control plane vs. data plane: The control plane upgrades first, but worker nodes must be updated separately. Running old nodes with a new control plane can lead to unexpected behavior.
- Version skew: Components like kubelet, kube-proxy, and the AWS VPC CNI plugin must remain compatible with the control plane. A mismatch is a classic cause of broken networking or pods stuck in Pending.
- API deprecations: Each Kubernetes release removes APIs. For example, PodSecurityPolicy was deprecated in 1.21 and removed in 1.25. Workloads still using it will fail on upgrade.
- Karpenter quirks:
- Node provisioning logic may change between versions. If you haven’t validated Karpenter compatibility, it may launch nodes incorrectly.
- When nodes recycle, running workloads can be evicted. Without disruption budgets, you risk downtime.
- Node provisioning logic may change between versions. If you haven’t validated Karpenter compatibility, it may launch nodes incorrectly.
Knowing these pitfalls upfront is the best way to avoid firefighting later. Next, let’s map out how to prepare properly.
Planning your upgrade
Preparation separates smooth upgrades from war stories.
Pre-upgrade checklist
Audit your cluster for deprecated APIs:kubectl get --raw /metrics | grep deprecated
- Review workloads and Helm charts for incompatible resources.
- Validate operators and controllers against the target Kubernetes version. Check release notes from vendors.
- Confirm Karpenter’s supported versions. The Karpenter docs clearly state compatibility with Kubernetes releases.
Staging environment
- Mirror production as closely as possible. Run the upgrade process there first to find breakages without risking real workloads.
Backup strategy
- Ensure you have etcd snapshots if using self-managed control planes.
- Export cluster manifests and custom resources so you can recover quickly.
Capacity planning
- Plan for temporary double usage. When old nodes are drained, new nodes spin up to replace them, meaning you need enough AWS quota and budget for overlap.
Once these basics are covered, you’re ready to start the upgrade with confidence.
Step by step: executing the upgrade with minimal disruption
Here’s the process that experienced practitioners follow to keep systems online.
Step 1: Upgrade the control plane
- Use the AWS console, CLI, or IaC (Terraform, CDK, etc.) to bump the cluster version.
Verify the new control plane version:aws eks describe-cluster --name my-cluster --query cluster.version
Step 2: Upgrade node groups
- For managed node groups: trigger an update through the console or CLI. AWS handles draining and replacement.
For self-managed nodes: do a rolling update. Drain each node gracefully:kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
Step 3: Handle Karpenter provisioners
- Update Karpenter manifests and controller version to match cluster compatibility.
- Configure PodDisruptionBudgets (PDBs) so workloads are not evicted all at once.
- Use topology spread constraints so pods distribute evenly across nodes and AZs.
- Pre-scale with Karpenter: set desired replicas higher before draining, so capacity is available when nodes go offline.
Step 4: Upgrade networking and addons
- Update CNI, CoreDNS, and kube-proxy using eksctl or AWS console.
Confirm pod networking works by deploying a simple busybox pod and testing DNS resolution:kubectl run test-dns --image=busybox:1.28 --rm -it --restart=Never -- nslookup kubernetes.default
Tips for near-zero downtime
- Always enforce PodDisruptionBudgets.
- Leverage multiple availability zones so draining nodes in one AZ doesn’t knock out capacity.
Once your nodes are upgraded and workloads stable, the next step is validation.
Post-upgrade validation and testing
Don’t assume the cluster is healthy just because the upgrade finished.
Cluster health checks
Confirm control plane components are ready:kubectl get componentstatuses
- Ensure all nodes are Ready and autoscaling responds correctly.
Application smoke tests
- Run automated checks for your most critical services.
- Test ingress endpoints, service discovery, and DNS lookups.
Karpenter validation
- Scale workloads up and down and confirm Karpenter provisions and deprovisions nodes as expected.
Check logs for provisioning errors:kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter
Observability
- Validate monitoring dashboards and alerting rules.
- Verify log pipelines are collecting from new nodes.
This is the stage where small misconfigurations show up, so give it real attention.
Best practices and lessons learned to build a sustainable upgrade flow
After a few cycles, you’ll notice patterns. These practices will save you time and headaches:
- Keep Karpenter, CNI, CoreDNS, and kube-proxy aligned with cluster upgrades.
- Maintain a documented runbook so upgrades don’t depend on tribal knowledge.
- Schedule upgrades proactively, not under pressure when AWS deprecates a version.
Every upgrade is an opportunity to refine your process. Share lessons with your team so everyone benefits.
Upgrading without turning into a war story
Upgrading EKS clusters is a reality of operating Kubernetes in production. With careful planning, staged testing, and thorough validation, you can perform upgrades with minimal disruption.
Stay ahead of AWS deprecations, treat Karpenter as a first-class citizen in the upgrade, and keep your playbooks current. Done right, upgrades become routine rather than dreaded fire drills.
