Spot Instances are one of the most powerful levers we have for cutting cloud costs. They’re also one of the riskiest to use if not done right. The reason is simple: Spot capacity can vanish at any time. When AWS needs the capacity back, your nodes get interrupted, and if your workloads can’t survive that, you’ll be paged into a fire drill.
But this doesn’t have to be the story. Kubernetes gives us tools, and AWS gives us signals, that let us design systems where Spot interruptions are a nuisance rather than a disaster. This article walks you through the practices that make workloads interruption-tolerant, with a focus on the DevOps reality: clear steps, examples you can drop in, and the pitfalls you’ll want to avoid.
Step 1. Start with the signals you’ll receive
Every Spot strategy starts with one fact: AWS will tell you before it reclaims a node. There are two kinds of signals:
- Interruption notice: a two-minute warning before the instance shuts down.
- Rebalance recommendation: an early warning that your instance might be interrupted, but it isn’t a guarantee. It simply means capacity is tightening in that Availability Zone, so you should consider moving workloads. Sometimes you’ll get the recommendation first, other times you may only receive the two-minute interruption notice.
You can actually see these signals yourself through the EC2 metadata service. From a node or pod, run:
curl -s http://169.254.169.254/latest/meta-data/spot/instance-action
If there’s an active interruption, you’ll get JSON back with an action and timestamp. Most of the time you’ll see a 404, which just means you’re safe for now.
These signals are the foundation. Everything else you’ll build is about catching them quickly and letting Kubernetes shuffle your workloads before the clock runs out.
Step 2. Automate the Response with Interruption Handlers
You don’t want humans watching for signals. You want software. That’s where interruption handlers come in.
- Karpenter: If you’re already using it to scale your cluster, Karpenter can also handle Spot interruptions. It drains affected nodes and even launches replacements when a rebalance recommendation arrives.
- AWS Node Termination Handler (NTH): If you’re not on Karpenter, NTH is a simple DaemonSet that runs on every node, cordons and drains when it sees interruption signals, and cleans up gracefully.
You typically install NTH as a DaemonSet using Helm. The safest approach is to follow the official installation guide, since configuration flags change over time.
One important detail: whether you use NTH or Karpenter, both rely on an SQS queue connected to EventBridge to receive Spot interruption and rebalance notifications. Setting up that queue is required. You can find step-by-step instructions in the NTH setup docs and the Karpenter interruption handling guide.
Once NTH is installed, check it’s running:
kubectl get pods -n kube-system | grep termination-handler
You should see one pod per node in Running
state.
Important: Don’t run both Karpenter and NTH with interruption handling enabled. They will both try to drain the same node, and that creates more disruption instead of less.
Step 3. Diversify to Reduce Your Blast Radius
Even with great handling, if you run all your pods on one instance type in one availability zone, a single market fluctuation can wipe out your cluster. Diversification spreads that risk.
- Mix multiple instance families (m5, c6i, r6g, etc.).
- Deploy across multiple availability zones.
- Use price-capacity-optimized strategies when requesting Spot nodes.
Here’s a simplified Karpenter NodePool
showing this approach:
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: spot-pool
spec:
template:
spec:
requirements:
- key: "karpenter.k8s.aws/instance-family"
operator: In
values: ["m5", "c6i", "r6g"]
- key: "topology.kubernetes.io/zone"
operator: In
values: ["us-east-1a", "us-east-1b", "us-east-1c"]
- key: "karpenter.sh/capacity-type"
operator: In
values: ["spot", "on-demand"]
With this in place, kubectl get nodes -o wide
should show nodes coming from different families and zones. That’s your confirmation that you’re not putting all your eggs in one basket.
Step 4. Teach Kubernetes How to Evict Gracefully
Once a node is marked for termination, pods need to get out of the way smoothly. That’s where Pod Disruption Budgets (PDBs) and grace termination periods come into play.
A PDB defines the minimum number of pods that must remain available during evictions. Here’s an example for a deployment with three replicas:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: web-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: web
Add a terminationGracePeriodSeconds
to give each pod time to finish requests and flush data:
spec:
terminationGracePeriodSeconds: 60
To test it, use AWS FIS (you can find a detailed explanation about this feature in step 8).
The key here is balance. Too short, and requests are cut mid-flight. Too long, and you run out of the two-minute interruption window.
Step 5. Make Statelessness a First-Class Citizen
Nothing undermines Spot resilience like state tied to a node that disappears. The solution is to design applications that don’t care which node they land on.
That means pushing state out to durable services:
- Cache or ephemeral session data → Redis.
- Long-term checkpoints → MongoDB or DynamoDB.
If you already write to a local file, swap it for a call to Redis. If your app needs to checkpoint, write to DynamoDB before acknowledging the request. Then when Kubernetes spins up the pod on a new node, it picks up exactly where it left off. To make that work, the app needs startup logic that looks for an existing checkpoint and resumes from it rather than always starting fresh.
Step 6. Cut Startup Time with Lean Images
When a Spot node goes away, Kubernetes schedules your pod somewhere else. How fast it comes back depends a lot on image size.
A few habits help:
- Use multi-stage Docker builds to strip unused layers.
- Start with slim base images like
alpine
ordistroless
. - Keep images small enough to pull quickly.
Here’s a Go app example:
FROM golang:1.25 as builder
WORKDIR /src
COPY . .
RUN go build -o app .
FROM gcr.io/distroless/base
COPY --from=builder /src/app /app
CMD ["/app"]
Run docker images
and check the size. The smaller the better. And watch pod logs when a new replica spins up: if your SLO is “ready in under 10 seconds,” this is where you prove it.
Step 7. Add Probes So Kubernetes Knows the Truth
Even if a pod starts quickly, you don’t want traffic routed until it’s actually ready. That’s what health probes are for.
- Readiness probes: control when a pod enters service.
- Liveness probes: detect and restart pods that are stuck.
Example:
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
Run kubectl describe pod <name>
to see probe status. Kill the app process and confirm Kubernetes restarts it. That’s your safety net working.
Step 8. Run a Game Day with a Fake Interruption
Theory is good. Practice is better. AWS makes it possible to simulate Spot interruptions using the Fault Injection Simulator (FIS) or the amazon-ec2-spot-interrupter tool.
Set up a test service with all the patterns we’ve covered. Then simulate an interruption. Watch what happens:
- Node cordons and drains.
- Replacement capacity spins up.
- Pods shut down gracefully and restart on a new node.
- Service stays available.
If you see errors spike, something needs tuning – maybe the PDB, maybe the grace period. Better to discover that now than during production traffic.
Step 9. Measure and Keep Improving
Spot resilience isn’t a one-and-done job. You want to measure:
- How long it takes pods to reschedule.
- Whether error rates spike during interruptions.
- How quickly new nodes join the cluster.
Schedule quarterly “game days” where you deliberately interrupt nodes and see how the system behaves. That’s how you stay confident that the cost savings don’t come at the expense of reliability.
Wrapping Up
Running workloads on Spot is like surfing: the waves will crash, but with balance and preparation you can ride them instead of being thrown off. By wiring in interruption handlers, diversifying capacity, externalizing state, and tuning Kubernetes behaviors, you make interruptions survivable.
The cost savings are real. The engineering investment is upfront. And the peace of mind you get the next time AWS reclaims your Spot node is priceless.