Spot instances offer up to 90% savings on EC2 compute costs, but interruptions can cause outages if not handled properly. This guide shows you how to adopt Spot safely by mixing capacity types, implementing termination handlers, and using Kubernetes scheduling primitives to keep critical workloads protected.
Key Takeaways
- Never run critical stateful workloads on Spot without On-Demand fallback capacity
- Spot termination handlers cordon and drain nodes gracefully during 2-minute warning windows
- Node labels, affinity, and PodDisruptionBudgets control which workloads land on Spot
- Start with batch jobs and stateless services before expanding to broader workloads
- Monitor interruption rates and adjust instance diversification strategies accordingly
Understanding Spot Interruptions
AWS can reclaim Spot instances with 2 minutes notice when capacity is needed for On-Demand customers. The interruption frequency varies by instance type and availability zone—some combinations see interruptions weekly, others monthly.
During the 2-minute window, AWS sends a termination notice via EC2 instance metadata and EventBridge. Without handling this signal, pods are abruptly killed mid-request, potentially causing:
- Dropped database connections
- Failed in-flight transactions
- Incomplete batch jobs that must restart from scratch
- Service degradation if too many replicas disappear simultaneously
The safety-first approach treats Spot as supplemental capacity, not primary.
Architecture: Mixed Capacity Strategy
The safest Spot adoption pattern uses separate node groups for different workload classes:
- Essential nodes (On-Demand or Savings Plans): databases, message queues, critical APIs
- Preemptible nodes (Spot): batch jobs, CI/CD, stateless microservices, dev environments
This prevents cascading failures—even if all Spot nodes disappear, essential services remain available.
Creating Node Groups with Lifecycle Labels
Label nodes during bootstrap so you can target them with affinity rules:
eksctl create nodegroup \ --cluster=my-cluster \ --name=ondemand-essential \ --node-type=m5.large \ --nodes=2 \ --nodes-min=2 \ --nodes-max=5 \ --node-labels="kubernetes.io/lifecycle=essential" eksctl create nodegroup \ --cluster=my-cluster \ --name=spot-preemptible \ --node-type=m5.large,m5a.large,m5n.large \ --nodes=3 \ --nodes-min=0 \ --nodes-max=20 \ --spot \ --node-labels="kubernetes.io/lifecycle=preemptible" Note the instance type diversification in the Spot group—using multiple types across instance families reduces interruption correlation.
Installing the Spot Termination Handler
The AWS Node Termination Handler monitors for Spot interruption notices and gracefully drains nodes before termination:
kubectl apply -f https://github.com/aws/aws-node-termination-handler/releases/download/v1.19.0/all-resources.yaml This deploys a DaemonSet that:
- Polls EC2 metadata for termination notices
- Cordons the node to prevent new pod scheduling
- Drains existing pods gracefully (respects PodDisruptionBudgets)
- Allows Kubernetes to reschedule pods on healthy nodes
Verify it’s running on Spot nodes:
kubectl get daemonset -n kube-system aws-node-termination-handler kubectl logs -n kube-system ds/aws-node-termination-handler --tail=100 You should see log entries showing periodic metadata polling and ready state.
Workload Placement: Affinity and Tolerations
Pinning Critical Workloads to On-Demand
Use requiredDuringSchedulingIgnoredDuringExecution to force databases and stateful services onto essential nodes:
apiVersion: apps/v1 kind: StatefulSet metadata: name: postgres spec: template: spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: "kubernetes.io/lifecycle" operator: "In" values: - essential containers: - name: postgres image: postgres:15 This pod will remain in Pending state if no essential nodes are available—preventing it from landing on Spot.
Preferring Spot for Batch Jobs
For fault-tolerant workloads, use preferredDuringScheduling to favor Spot but allow fallback:
apiVersion: batch/v1 kind: Job metadata: name: data-processing spec: template: spec: affinity: nodeAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 preference: matchExpressions: - key: "kubernetes.io/lifecycle" operator: "In" values: - preemptible containers: - name: processor image: my-batch-job:latest restartPolicy: OnFailure The job prefers Spot nodes but can schedule on On-Demand if Spot capacity is unavailable. The restartPolicy: OnFailure ensures Kubernetes retries if a Spot interruption occurs mid-job.
Pod Disruption Budgets: Controlling Eviction Rate
PodDisruptionBudgets (PDBs) ensure a minimum number of replicas remain available during voluntary disruptions like node drains:
apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: api-pdb spec: minAvailable: 2 selector: matchLabels: app: api-server If your API has 5 replicas and a Spot node terminates, the drain operation waits until Kubernetes schedules new pods on healthy nodes before evicting the 3rd pod. This prevents service degradation.
Critical: PDBs only work during graceful drains, not forced terminations. Always run enough replicas on non-Spot nodes to satisfy minAvailable even if all Spot capacity disappears.
Per-AZ Node Groups for Persistent Volumes
EBS volumes are single-AZ resources. If you run Spot nodes in multiple AZs via a single Auto Scaling group, StatefulSets with PersistentVolumeClaims can get stuck:
- Pod with PVC in
us-east-1aruns on a Spot node inus-east-1a - That node is interrupted
- Cluster Autoscaler provisions a new Spot node in
us-east-1b - Pod remains Pending because the EBS volume can’t attach cross-AZ
Solution: Create one Auto Scaling group per AZ:
eksctl create nodegroup \ --cluster=my-cluster \ --name=spot-us-east-1a \ --node-zones=us-east-1a \ --spot \ --node-labels="topology.kubernetes.io/zone=us-east-1a,kubernetes.io/lifecycle=preemptible" eksctl create nodegroup \ --cluster=my-cluster \ --name=spot-us-east-1b \ --node-zones=us-east-1b \ --spot \ --node-labels="topology.kubernetes.io/zone=us-east-1b,kubernetes.io/lifecycle=preemptible" Then use Cluster Autoscaler’s --balance-similar-node-groups flag to distribute scale-outs evenly:
--balance-similar-node-groups=true Alternatively, use Karpenter with AZ-specific Provisioners and let it handle the topology automatically.
Testing Spot Interruptions
Don’t wait for a real interruption to discover your safety mechanisms don’t work. Simulate terminations:
Manual Drain Test
kubectl drain --ignore-daemonsets --delete-emptydir-data Watch pod eviction and rescheduling behavior. Verify PDBs are respected and critical services maintain availability.
Chaos Engineering with Spot Interruptions
Use AWS Fault Injection Simulator to trigger real Spot interruptions in a controlled test:
- Create an FIS experiment template targeting a specific Spot instance
- Run the experiment during a load test
- Measure latency spikes and error rates during node drain
- Verify termination handler logs show proper cordon/drain sequence
Monitoring and Alerting
Track Spot-related metrics to detect issues early:
- Interruption frequency: Log termination handler events to CloudWatch or Prometheus
- Pod eviction rate: Monitor
kube_pod_status_phase{phase="Pending"}spikes - PDB violations: Alert on
kube_poddisruptionbudget_status_pod_disruptions_allowed < 1 - Node replacement time: Measure time from termination notice to new node ready
Set up CloudWatch alarms for Spot instance interruptions:
aws cloudwatch put-metric-alarm \ --alarm-name spot-interruption-rate \ --metric-name SpotInstanceInterruption \ --namespace AWS/EC2Spot \ --statistic Sum \ --period 3600 \ --threshold 5 \ --comparison-operator GreaterThanThreshold Common Pitfalls and How to Avoid Them
Running databases on Spot without fallback. Even with termination handlers, you risk data inconsistency during abrupt shutdowns. Always use On-Demand or Reserved Instances for stateful workloads.
Insufficient replica counts. If you run 3 replicas and all are on Spot, a simultaneous multi-node interruption (rare but possible) violates your PDB. Keep at least minAvailable + 1 replicas with some on On-Demand.
Single instance type in Spot pools. Using only m5.large creates a single point of failure—if that instance type has high interruption rates, your entire Spot fleet churns. Diversify across families: m5.large,m5a.large,m5n.large,m6i.large.
Ignoring PV topology. StatefulSets on Spot require per-AZ Auto Scaling groups or Karpenter with zone-aware provisioning. Otherwise, pods get stuck Pending after AZ-crossing interruptions.
No termination handler. Without aws-node-termination-handler or equivalent, pods are forcefully killed with no grace period—breaking in-flight requests and database transactions.
Progressive Rollout Strategy
Don’t convert your entire cluster to Spot overnight. Use this staged approach:
- Week 1: Add a small Spot node group (10% of capacity) for dev/test namespaces
- Week 2: Deploy batch jobs and CI/CD runners to Spot with affinity rules
- Week 3: Move stateless APIs with >5 replicas and PDBs to prefer Spot
- Week 4: Increase Spot percentage to 50-70% of total capacity
- Ongoing: Monitor interruption rates and adjust instance diversification
At each stage, run load tests and chaos experiments before proceeding.
Expected Savings
Spot pricing varies by instance type and region, but typical savings range from 60-90% versus On-Demand. For a cluster spending $10,000/month on EC2:
- 50% Spot adoption at 70% discount: $3,500/month savings
- 70% Spot adoption at 70% discount: $4,900/month savings
Combine with Savings Plans on your On-Demand baseline for additional 20-40% savings on the remaining 30-50% of capacity.
Conclusion
Spot instances are the single highest-impact cost lever in EKS, but only when implemented with proper safety guardrails. Never run critical stateful workloads on Spot alone—always maintain On-Demand fallback capacity. Install the termination handler, use node affinity to control placement, protect services with PodDisruptionBudgets, and create per-AZ node groups for workloads with persistent volumes. Start with batch jobs and stateless services, measure interruption impact, and expand gradually. With these practices, you can safely capture 60-90% compute savings while maintaining production reliability.