Make your Kubernetes applications spot interruption-tolerant

Spot Instances are one of the most powerful levers we have for cutting cloud costs. They’re also one of the riskiest to use if not done right. The reason is simple: Spot capacity can vanish at any time. When AWS needs the capacity back, your nodes get interrupted, and if your workloads can’t survive that, you’ll be paged into a fire drill.

But this doesn’t have to be the story. Kubernetes gives us tools, and AWS gives us signals, that let us design systems where Spot interruptions are a nuisance rather than a disaster. This article walks you through the practices that make workloads interruption-tolerant, with a focus on the DevOps reality: clear steps, examples you can drop in, and the pitfalls you’ll want to avoid.

Step 1. Start with the signals you’ll receive

Every Spot strategy starts with one fact: AWS will tell you before it reclaims a node. There are two kinds of signals:

Interruption notice: a two-minute warning before the instance shuts down.
Rebalance recommendation: an early warning that your instance might be interrupted, but it isn’t a guarantee. It simply means capacity is tightening in that Availability Zone, so you should consider moving workloads. Sometimes you’ll get the recommendation first, other times you may only receive the two-minute interruption notice.

You can actually see these signals yourself through the EC2 metadata service. From a node or pod, run:

curl -s http://169.254.169.254/latest/meta-data/spot/instance-action

If there’s an active interruption, you’ll get JSON back with an action and timestamp. Most of the time you’ll see a 404, which just means you’re safe for now.

These signals are the foundation. Everything else you’ll build is about catching them quickly and letting Kubernetes shuffle your workloads before the clock runs out.

Step 2. Automate the Response with Interruption Handlers

You don’t want humans watching for signals. You want software. That’s where interruption handlers come in.

Karpenter: If you’re already using it to scale your cluster, Karpenter can also handle Spot interruptions. It drains affected nodes and even launches replacements when a rebalance recommendation arrives.
AWS Node Termination Handler (NTH): If you’re not on Karpenter, NTH is a simple DaemonSet that runs on every node, cordons and drains when it sees interruption signals, and cleans up gracefully.

You typically install NTH as a DaemonSet using Helm. The safest approach is to follow the official installation guide, since configuration flags change over time.

One important detail: whether you use NTH or Karpenter, both rely on an SQS queue connected to EventBridge to receive Spot interruption and rebalance notifications. Setting up that queue is required. You can find step-by-step instructions in the NTH setup docs and the Karpenter interruption handling guide.

Once NTH is installed, check it’s running:

kubectl get pods -n kube-system | grep termination-handler

You should see one pod per node in Running state.

Important: Don’t run both Karpenter and NTH with interruption handling enabled. They will both try to drain the same node, and that creates more disruption instead of less.

Step 3. Diversify to Reduce Your Blast Radius

Even with great handling, if you run all your pods on one instance type in one availability zone, a single market fluctuation can wipe out your cluster. Diversification spreads that risk.

Mix multiple instance families (m5, c6i, r6g, etc.).
Deploy across multiple availability zones.
Use price-capacity-optimized strategies when requesting Spot nodes.

Here’s a simplified Karpenter NodePool showing this approach:


  apiVersion: karpenter.sh/v1

kind: NodePool

metadata:

  name: spot-pool

spec:

  template:

    spec:

      requirements:

        - key: "karpenter.k8s.aws/instance-family"

          operator: In

          values: ["m5", "c6i", "r6g"]

        - key: "topology.kubernetes.io/zone"

          operator: In

          values: ["us-east-1a", "us-east-1b", "us-east-1c"]

        - key: "karpenter.sh/capacity-type"

          operator: In

          values: ["spot", "on-demand"]

With this in place, kubectl get nodes -o wide should show nodes coming from different families and zones. That’s your confirmation that you’re not putting all your eggs in one basket.

Step 4. Teach Kubernetes How to Evict Gracefully

Once a node is marked for termination, pods need to get out of the way smoothly. That’s where Pod Disruption Budgets (PDBs) and grace termination periods come into play.

A PDB defines the minimum number of pods that must remain available during evictions. Here’s an example for a deployment with three replicas:


  apiVersion: policy/v1

kind: PodDisruptionBudget

metadata:

  name: web-pdb

spec:

  minAvailable: 2

  selector:

    matchLabels:

      app: web

Add a terminationGracePeriodSeconds to give each pod time to finish requests and flush data:


  spec:

  terminationGracePeriodSeconds: 60

To test it, use AWS FIS (you can find a detailed explanation about this feature in step 8).

The key here is balance. Too short, and requests are cut mid-flight. Too long, and you run out of the two-minute interruption window.

Step 5. Make Statelessness a First-Class Citizen

Nothing undermines Spot resilience like state tied to a node that disappears. The solution is to design applications that don’t care which node they land on.

That means pushing state out to durable services:

Cache or ephemeral session data → Redis.
Long-term checkpoints → MongoDB or DynamoDB.

If you already write to a local file, swap it for a call to Redis. If your app needs to checkpoint, write to DynamoDB before acknowledging the request. Then when Kubernetes spins up the pod on a new node, it picks up exactly where it left off. To make that work, the app needs startup logic that looks for an existing checkpoint and resumes from it rather than always starting fresh.

Step 6. Cut Startup Time with Lean Images

When a Spot node goes away, Kubernetes schedules your pod somewhere else. How fast it comes back depends a lot on image size.

A few habits help:

Use multi-stage Docker builds to strip unused layers.
Start with slim base images like alpine or distroless.
Keep images small enough to pull quickly.

Here’s a Go app example:


  FROM golang:1.25 as builder

WORKDIR /src

COPY . .

RUN go build -o app .

FROM gcr.io/distroless/base

COPY --from=builder /src/app /app

CMD ["/app"]

Run docker images and check the size. The smaller the better. And watch pod logs when a new replica spins up: if your SLO is “ready in under 10 seconds,” this is where you prove it.

Step 7. Add Probes So Kubernetes Knows the Truth

Even if a pod starts quickly, you don’t want traffic routed until it’s actually ready. That’s what health probes are for.

Readiness probes: control when a pod enters service.
Liveness probes: detect and restart pods that are stuck.

Example:


  livenessProbe:

  httpGet:

    path: /health

    port: 8080

  initialDelaySeconds: 10

  periodSeconds: 10

readinessProbe:

  httpGet:

    path: /ready

    port: 8080

  initialDelaySeconds: 5

  periodSeconds: 5

Run kubectl describe pod <name> to see probe status. Kill the app process and confirm Kubernetes restarts it. That’s your safety net working.

Step 8. Run a Game Day with a Fake Interruption

Theory is good. Practice is better. AWS makes it possible to simulate Spot interruptions using the Fault Injection Simulator (FIS) or the amazon-ec2-spot-interrupter tool.

Set up a test service with all the patterns we’ve covered. Then simulate an interruption. Watch what happens:

Node cordons and drains.
Replacement capacity spins up.
Pods shut down gracefully and restart on a new node.
Service stays available.

If you see errors spike, something needs tuning – maybe the PDB, maybe the grace period. Better to discover that now than during production traffic.

Step 9. Measure and Keep Improving

Spot resilience isn’t a one-and-done job. You want to measure:

How long it takes pods to reschedule.
Whether error rates spike during interruptions.
How quickly new nodes join the cluster.

Schedule quarterly “game days” where you deliberately interrupt nodes and see how the system behaves. That’s how you stay confident that the cost savings don’t come at the expense of reliability.

Wrapping Up

Running workloads on Spot is like surfing: the waves will crash, but with balance and preparation you can ride them instead of being thrown off. By wiring in interruption handlers, diversifying capacity, externalizing state, and tuning Kubernetes behaviors, you make interruptions survivable.

The cost savings are real. The engineering investment is upfront. And the peace of mind you get the next time AWS reclaims your Spot node is priceless.

Resource Optimization

Financial Optimization

Visibility & Recommendations

What's new

Use cases

See how Zesty works

Get to know Zesty

Hear it from out Customers

For developers

Platform learning

Industry learning

Learn Kubernetes

Zesty Blog

How to make your Kubernetes applications spot interruption-tolerant

Step 1. Start with the signals you’ll receive

Step 2. Automate the Response with Interruption Handlers

Step 3. Diversify to Reduce Your Blast Radius

Step 4. Teach Kubernetes How to Evict Gracefully

Step 5. Make Statelessness a First-Class Citizen

Step 6. Cut Startup Time with Lean Images

Step 7. Add Probes So Kubernetes Knows the Truth

Step 8. Run a Game Day with a Fake Interruption

Step 9. Measure and Keep Improving

Wrapping Up

How to rightsize applications on Kubernetes the right way

How to Achieve Lightning-Fast Scaling with Smarter HPA Tuning

How to actually use In-Place Pod Resizing

Optimizing Kubernetes image sizes for faster scaling

How to Attribute Costs to Pods in Kubernetes (Without Losing Your Mind)

The Hidden Network Costs in Kubernetes

Still scrolling?
Nothing beats the excitement
of seeing it live.

Platform

Solutions

Company

Resources

Proud to be

Resource Optimization

Financial Optimization

Visibility & Recommendations

What's new

Use cases

See how Zesty works

Get to know Zesty

Hear it from out Customers

For developers

Platform learning

Industry learning

Learn Kubernetes

Zesty Blog

How to make your Kubernetes applications spot interruption-tolerant

Step 1. Start with the signals you’ll receive

Step 2. Automate the Response with Interruption Handlers

Step 3. Diversify to Reduce Your Blast Radius

Step 4. Teach Kubernetes How to Evict Gracefully

Step 5. Make Statelessness a First-Class Citizen

Step 6. Cut Startup Time with Lean Images

Step 7. Add Probes So Kubernetes Knows the Truth

Step 8. Run a Game Day with a Fake Interruption

Step 9. Measure and Keep Improving

Wrapping Up

Check out related topics

How to rightsize applications on Kubernetes the right way

How to Achieve Lightning-Fast Scaling with Smarter HPA Tuning

How to actually use In-Place Pod Resizing

Optimizing Kubernetes image sizes for faster scaling

How to Attribute Costs to Pods in Kubernetes (Without Losing Your Mind)

The Hidden Network Costs in Kubernetes

Still scrolling? Nothing beats the excitement of seeing it live.

Platform

Solutions

Company

Resources

Proud to be

Still scrolling?
Nothing beats the excitement
of seeing it live.