But as you and I both know, the catch is real. Spot capacity disappears without much warning, and if you’re not prepared, your workloads go with it. If you’re running a Kubernetes cluster, this risk gets multiplied across pods, deployments, and services. Fail once, and people start asking if you tested anything at all.

I’ve spent the last couple of years figuring out how to make Spot actually usable in production. Here’s what you really need to know if you want to make it work, without rewriting your entire architecture or waking up to broken services.

Spot Interruptions Aren’t a Maybe—They’re a When

Let’s start with what actually happens when AWS reclaims a Spot Instance.

You get a two-minute notice. That’s not a figure of speech—it’s a literal 120 seconds. AWS sends a termination signal via the metadata endpoint (http://169.254.169.254/latest/meta-data/spot/instance-action) or through EventBridge. Kubernetes has to react fast:

  • It needs to cordon the node (mark it unschedulable).
  • Drain all the pods (ideally gracefully).
  • Reschedule them onto healthy nodes.

That all sounds manageable, but it hinges on a few things:

  • Do you actually have capacity elsewhere for those pods to land?
  • Are your workloads built to handle shutdowns and rescheduling?
  • Do your health checks and startup probes tolerate this churn?

Even one “no” in that list means your app is going to take a hit.

In my earlier attempts, we thought we were safe because we had multiple replicas. Turns out, all replicas were on Spot nodes. AWS reclaimed them together. We had a temporary outage and a very uncomfortable postmortem.

How to Build Workloads That Can Survive a Spot Interruption

There’s no one-click solution here. But if you approach this like a battlefield strategy—plan for chaos—you’ll come out ahead.

1. Use Karpenter’s karpenter.sh/capacity-type Annotation

If you’re using Karpenter to provision your nodes, there’s no need to manually taint Spot nodes and configure tolerations. Karpenter makes this easier and cleaner with the karpenter.sh/capacity-type node selector labels.

You can explicitly tell Karpenter whether a workload should run on Spot or On-Demand by setting this label in your pod spec or deployment. Under the hood, Karpenter uses this to match your workload to the appropriate node when provisioning.

Here’s how to make sure a workload only runs on Spot nodes (irrelevant fields removed):

apiVersion: apps/v1
kind: Deployment
spec:
  template:
    spec:
      affinity:
        nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
              - matchExpressions:
                - key: karpenter.sh/capacity-type
                  operator: In
                  values:
                  - spot

If you want the opposite—workloads that only run on On-Demand nodes—just switch the value to "on-demand".

This is the cleanest way to separate fault-tolerant workloads (like stateless web servers, background workers, or batch jobs) from critical ones (like databases or stateful services) in Karpenter-managed environments.

2. Set Up Pod Disruption Budgets (PDBs)

These define how many pods can be down during voluntary disruptions. When Spot nodes go away, this becomes your first layer of defense to keep K8s from killing everything at once.

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: my-api

It doesn’t prevent Spot termination, but it prevents Kapenter/aws-node-termination-handler from deleting all replicas at once.

3. Use Topology Spread Constraints to Limit the Blast Radius

Spot interruptions don’t always happen one node at a time. In fact, entire availability zones can suddenly run out of Spot capacity. If all your pods are concentrated in a single zone—because it was cheaper at the time—you’re taking on more risk than you probably realize.

That’s where topology spread constraints come in.

These let you tell Kubernetes to spread pods across zones, instance types, or any other topology domain you define. The goal is to avoid putting all your eggs (pods) in one fragile basket.

Here’s an example that ensures your workload is balanced across zones:

topologySpreadConstraints:
- maxSkew: 1
  topologyKey: topology.kubernetes.io/zone
  whenUnsatisfiable: DoNotSchedule
  labelSelector:
    matchLabels:
      app: backend

This prevents Kubernetes from overloading a single zone—even if that zone happens to be the cheapest for Spot nodes right now.

The key takeaway: don’t assume zone-level availability is stable. Spot capacity is dynamic and often disappears in clusters. Spreading your workloads out keeps you resilient, even when AWS suddenly pulls back all Spots in one zone.

And if you’re using Karpenter, this also plays nicely with its provisioning logic, allowing it to choose across zones based on availability and cost—if your pods aren’t pinned to a single location.

4. Handle Shutdown the Right Way

When a Spot Instance is marked for termination, Kubernetes sends a SIGTERM to each container running on that node. If your application doesn’t handle this signal properly, it can get killed mid-request—before it finishes writing to disk, releasing locks, or sending the final log lines.

You can’t avoid the termination, but you can handle it gracefully.

Here’s how to do it properly:

Step 1: Set terminationGracePeriodSeconds

This tells Kubernetes how long to wait between sending the SIGTERM and forcefully killing the container (with SIGKILL). The default is 30 seconds, but you can and should increase this if your app needs more time to shut down cleanly.

spec:
  terminationGracePeriodSeconds: 60

This gives your application a window to exit on its own term

Step 2: Catch SIGTERM in Your Application Code

Your app needs to listen for SIGTERM and trigger shutdown logic—flush buffers, close DB connections, drain in-flight requests, etc. Here’s an example in Go:

// Graceful shutdown in Go
sigs := make(chan os.Signal, 1)
signal.Notify(sigs, syscall.SIGINT, syscall.SIGTERM)

done := make(chan bool, 1)

go func() {
	sig := <-sigs
	fmt.Println()
	fmt.Println(sig)
	done <- true
}()

fmt.Println("awaiting signal")
<-done
fmt.Println("exiting")

If you’re using Node.js, Python, Java—same idea. Set up a signal handler that knows how to exit properly.

Why not rely on preStop?

You might have seen solutions that add a preStop hook with a sleep command like this:

lifecycle:
  preStop:
    exec:
      command: ["/bin/sh", "-c", "sleep 10"]

That’s not a real shutdown strategy—it just stalls for time. It delays the termination but doesn’t help your application actually clean up. Worse, it adds confusion by creating artificial delays without any feedback to the app.

Use terminationGracePeriodSeconds + SIGTERM handling in your code. That’s the cleanest, most predictable way to make sure your app has a fighting chance when a Spot node goes away.

5. React to Spot Interruption Notices (If Your Stack Doesn’t Do It for You)

When AWS reclaims a Spot Instance, it sends a two-minute interruption notice before pulling the plug. If you can act fast—cordon the node and drain the pods—you give your workloads a better chance to reschedule gracefully.

But here’s the thing: you might not need to handle this manually at all.

If you’re using Karpenter, good news:

Karpenter automatically handles Spot interruption notices. It monitors interruption signals and initiates node cordoning and pod draining for you. No extra setup needed—it’s built into the control loop. You’re covered.

If you’re using cluster-autoscaler:

In this case, interruption handling is not automatic. But you can easily add AWS Node Termination Handler, an open-source project that watches the metadata endpoint or EventBridge for interruption signals. It then cordons and drains the node before it’s killed.

You typically deploy it as a DaemonSet, and it plays nicely with cluster-autoscaler and standard EKS setups.

If you’re building something custom, or running on unmanaged infrastructure:

You may need to handle Spot interruption notices from scratch. That means polling the metadata endpoint (http://169.254.169.254/latest/meta-data/spot/instance-action) and writing logic to drain the node and reschedule pods. This is rare, and usually only necessary in bespoke environments or legacy systems without modern tooling.

When Resilience Just Isn’t Enough

There are times when you just can’t afford to be interrupted, even for a few seconds:

  • Stateful services like databases, Kafka brokers, or Zookeeper
  • Machine learning training jobs running for hours without checkpointing
  • Workloads that have strict uptime SLAs with penalties

Yes, you could redesign these workloads with retries, checkpoints, replication, and queues. But let’s be real—it takes time, budget, and buy-in from teams that aren’t always eager to refactor everything.

In these cases, Spot Instances are usually marked “off-limits” by default. And that means you leave a lot of potential savings on the table.

How I Run Spot-Intolerant Workloads on Spot (Without Gambling on Uptime)

This used to be a hard no for me. If a workload couldn’t tolerate interruption, I just kept it on On-Demand. The two-minute warning from AWS wasn’t enough to move anything critical—not without risk.

But I’ve been running these same kinds of workloads on Spot lately, and not because I rewrote them to be fault-tolerant. I’ve been using Zesty for Spot Optimization, which handles Spot differently than anything else I’ve tried.

It’s Not About Reacting Faster—It’s About Staying Ahead

The real problem with Spot is the short notice. You get a 2-minute heads-up before AWS kills the instance. For workloads that aren’t designed to restart instantly—or that need to be drained carefully—that’s not enough time.

Zesty solves this with something they call Hiberscale Technology. And the core idea is this:

don’t wait for the Spot to be interrupted—spin up a new node ahead of time.

What Hiberscale Does

  • Preempts the interruption: Instead of reacting to AWS’s two-minute warning, Zesty proactively deploys new nodes within 30 seconds of detecting a risk, giving your workloads a safe place to land before termination happens.
  • Maintains availability: This buffer time means your services don’t experience disruption. Even Spot-intolerant workloads like real-time apps, API backends, or stateful workers can be migrated safely.
  • Works automatically: You don’t need to write your own draining logic or eviction hooks—Zesty handles the orchestration.

It basically gives you the illusion of reliability—without forcing your app to become interruption-aware.

Spot Automation: Migrate to Spot Without Rewriting Workloads

If you’re still running a big chunk of your workloads on On-Demand because no one has time to refactor for Spot, this is where Zesty really pulls weight.

You enable Spot Automation, and Zesty:

  • Analyzes your On-Demand usage
  • Automatically shifts safe workloads to Spot
  • Provisions, rebalances, and scales Spot nodes as needed
  • Cuts your compute costs—up to 70%—with no service downtime

It’s like having a full-time infra engineer watching Spot capacity and making smart decisions constantly—but it’s all automated.

Spot Insights: See Where the Savings Are Hiding

The other piece is visibility. Zesty gives you real-time insights into how your clusters are using Spot capacity, where you’re overspending on On-Demand, and which workloads are good candidates for migration.

You can:

  • Spot underutilized On-Demand instances
  • Get recommendations for migrating workloads to Spot
  • Monitor Spot vs. On-Demand usage per workload, team, or cluster

For me, this solved a big cultural issue too: it gave teams data. Engineers could see which of their services were burning through On-Demand unnecessarily—and that made conversations about switching to Spot a lot easier.

Final Thoughts: Spot Can Be a Weapon—If You Use It Right

Spot Instances aren’t a silver bullet, but they’re too valuable to ignore. If you treat them like the unstable compute layer they are—and build your workloads (and node strategy) accordingly—you can extract a lot of value without risking stability.

If you can tolerate interruptions, build smart.

If you can’t, consider tools that think ahead for you.

Either way, don’t settle for paying On-Demand rates when you don’t have to. Every dollar you don’t burn on compute is a dollar you can reinvest in engineering, features, and uptime.

Let Spot work for you, not against you.