Accelerating K8s node drains: How we reduced drain time at Zesty

Read More >

Draining a Kubernetes node can be painfully slow. The process involves gradually evicting pods according to PodDisruptionBudget (PDB) rules, waiting for replacement pods to become ready, and repeating this cycle until the node is empty. This rolling eviction loop makes sense for safety and workload stability, but in situations where speed is critical, like during a Spot interruption, it can be a real problem.

At Zesty, we ran into this exact issue and implemented a solution that safely replaces all pods at once rather than the rolling eviction loop of the standard Kubernetes drain. That dramatically reduced drain time and was a real game-changer in how we manage our resources. 

In this article, I’ll walk through how we built that solution and the Kubernetes behaviors we leveraged to make it work. 

Understanding the behavior behind the solution

We took advantage of three Kubernetes behaviors that are often overlooked:

  1. ReplicaSet membership is based on labels.
    A pod is considered part of a ReplicaSet only if its labels match the ReplicaSet’s label selector. 
  2. ReplicaSets self-balance their replica count.
    When a ReplicaSet has more pods than its replicas value, it automatically deletes some of them. 
  3. Pod deletion preference can be influenced.
    Kubernetes decides which pods to delete based on a pod-deletion-cost annotation (controller.kubernetes.io/pod-deletion-cost). Pods with lower costs are deleted first. 

By combining these three behaviors, we were able to orchestrate a drain process that replaces all pods on a node almost simultaneously, instead of one by one.

How our approach works

Here’s a simplified version of the mechanism we developed:

  1. Set deletion priorities and cordon the node.
    We first mark all pods on the draining node with a pod-deletion-cost of math.MinInt32, ensuring that Kubernetes will prefer to delete these pods later. Then, we cordon the node to stop new pods from being scheduled on it. 
  2. Create replacement pods outside the ReplicaSet.
    For each pod on the node, we create a replacement with identical configuration—except for one thing: its labels don’t match the ReplicaSet’s selector. This means Kubernetes won’t yet consider it part of the ReplicaSet. 
  3. Activate replacements as they become ready.
    As each replacement pod reaches the “Ready” state, we modify its labels so that it now matches the ReplicaSet’s selector. At that moment, the ReplicaSet sees that it has more replicas than desired. 
  4. Let Kubernetes handle deletion.
    The ReplicaSet then starts deleting pods to get back to the desired replica count. Because we assigned a minimal pod-deletion-cost to the pods on the draining node, Kubernetes prioritizes removing those first. This safely and automatically clears the draining node. 

Once all pods have been deleted from the node, the node itself can be safely removed.

Why our approach works

This approach essentially shifts the timing of pod creation and ReplicaSet membership, allowing replacements to start up in parallel rather than sequentially. It preserves ReplicaSet semantics and health checks but eliminates the unnecessary waiting period between evictions.

This method results in a significant reduction in drain time. That makes a huge difference when it comes to handling various situations, most crucially, Spot interruptions.

Drain your nodes, not your cluster

This solution reminded me that Kubernetes’ control loops are flexible once you understand their underlying mechanics. By understanding Kubernetes internals, you can use existing features in ways they weren’t originally intended. That’s how we came up with this fast-draining mechanism.

So if you ever find yourself constrained by the default limitations of the system, there might be a way to bypass them. Investigate the small details, because it’s not just the devil who lies there; sometimes it’s a perfect solution to a problem no one was trying to solve.

If you want to learn more about Zesty’s Headroom Reduction solution and how it leverages automation to reduce operational costs, click here.