Spot Instances on EKS: A Safety-First Implementation Guide

Spot instances offer up to 90% savings on EC2 compute costs, but interruptions can cause outages if not handled properly. This guide shows you how to adopt Spot safely by mixing capacity types, implementing termination handlers, and using Kubernetes scheduling primitives to keep critical workloads protected.

Key Takeaways

  • Never run critical stateful workloads on Spot without On-Demand fallback capacity
  • Spot termination handlers cordon and drain nodes gracefully during 2-minute warning windows
  • Node labels, affinity, and PodDisruptionBudgets control which workloads land on Spot
  • Start with batch jobs and stateless services before expanding to broader workloads
  • Monitor interruption rates and adjust instance diversification strategies accordingly

Understanding Spot Interruptions

AWS can reclaim Spot instances with 2 minutes notice when capacity is needed for On-Demand customers. The interruption frequency varies by instance type and availability zone—some combinations see interruptions weekly, others monthly.

During the 2-minute window, AWS sends a termination notice via EC2 instance metadata and EventBridge. Without handling this signal, pods are abruptly killed mid-request, potentially causing:

  • Dropped database connections
  • Failed in-flight transactions
  • Incomplete batch jobs that must restart from scratch
  • Service degradation if too many replicas disappear simultaneously

The safety-first approach treats Spot as supplemental capacity, not primary.

Architecture: Mixed Capacity Strategy

The safest Spot adoption pattern uses separate node groups for different workload classes:

  • Essential nodes (On-Demand or Savings Plans): databases, message queues, critical APIs
  • Preemptible nodes (Spot): batch jobs, CI/CD, stateless microservices, dev environments

This prevents cascading failures—even if all Spot nodes disappear, essential services remain available.

Creating Node Groups with Lifecycle Labels

Label nodes during bootstrap so you can target them with affinity rules:

eksctl create nodegroup \ --cluster=my-cluster \ --name=ondemand-essential \ --node-type=m5.large \ --nodes=2 \ --nodes-min=2 \ --nodes-max=5 \ --node-labels="kubernetes.io/lifecycle=essential" eksctl create nodegroup \ --cluster=my-cluster \ --name=spot-preemptible \ --node-type=m5.large,m5a.large,m5n.large \ --nodes=3 \ --nodes-min=0 \ --nodes-max=20 \ --spot \ --node-labels="kubernetes.io/lifecycle=preemptible"

Note the instance type diversification in the Spot group—using multiple types across instance families reduces interruption correlation.

Installing the Spot Termination Handler

The AWS Node Termination Handler monitors for Spot interruption notices and gracefully drains nodes before termination:

kubectl apply -f https://github.com/aws/aws-node-termination-handler/releases/download/v1.19.0/all-resources.yaml

This deploys a DaemonSet that:

  1. Polls EC2 metadata for termination notices
  2. Cordons the node to prevent new pod scheduling
  3. Drains existing pods gracefully (respects PodDisruptionBudgets)
  4. Allows Kubernetes to reschedule pods on healthy nodes

Verify it’s running on Spot nodes:

kubectl get daemonset -n kube-system aws-node-termination-handler kubectl logs -n kube-system ds/aws-node-termination-handler --tail=100

You should see log entries showing periodic metadata polling and ready state.

Workload Placement: Affinity and Tolerations

Pinning Critical Workloads to On-Demand

Use requiredDuringSchedulingIgnoredDuringExecution to force databases and stateful services onto essential nodes:

apiVersion: apps/v1 kind: StatefulSet metadata: name: postgres spec: template: spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: "kubernetes.io/lifecycle" operator: "In" values: - essential containers: - name: postgres image: postgres:15

This pod will remain in Pending state if no essential nodes are available—preventing it from landing on Spot.

Preferring Spot for Batch Jobs

For fault-tolerant workloads, use preferredDuringScheduling to favor Spot but allow fallback:

apiVersion: batch/v1 kind: Job metadata: name: data-processing spec: template: spec: affinity: nodeAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 preference: matchExpressions: - key: "kubernetes.io/lifecycle" operator: "In" values: - preemptible containers: - name: processor image: my-batch-job:latest restartPolicy: OnFailure

The job prefers Spot nodes but can schedule on On-Demand if Spot capacity is unavailable. The restartPolicy: OnFailure ensures Kubernetes retries if a Spot interruption occurs mid-job.

Pod Disruption Budgets: Controlling Eviction Rate

PodDisruptionBudgets (PDBs) ensure a minimum number of replicas remain available during voluntary disruptions like node drains:

apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: api-pdb spec: minAvailable: 2 selector: matchLabels: app: api-server

If your API has 5 replicas and a Spot node terminates, the drain operation waits until Kubernetes schedules new pods on healthy nodes before evicting the 3rd pod. This prevents service degradation.

Critical: PDBs only work during graceful drains, not forced terminations. Always run enough replicas on non-Spot nodes to satisfy minAvailable even if all Spot capacity disappears.

Per-AZ Node Groups for Persistent Volumes

EBS volumes are single-AZ resources. If you run Spot nodes in multiple AZs via a single Auto Scaling group, StatefulSets with PersistentVolumeClaims can get stuck:

  1. Pod with PVC in us-east-1a runs on a Spot node in us-east-1a
  2. That node is interrupted
  3. Cluster Autoscaler provisions a new Spot node in us-east-1b
  4. Pod remains Pending because the EBS volume can’t attach cross-AZ

Solution: Create one Auto Scaling group per AZ:

eksctl create nodegroup \ --cluster=my-cluster \ --name=spot-us-east-1a \ --node-zones=us-east-1a \ --spot \ --node-labels="topology.kubernetes.io/zone=us-east-1a,kubernetes.io/lifecycle=preemptible" eksctl create nodegroup \ --cluster=my-cluster \ --name=spot-us-east-1b \ --node-zones=us-east-1b \ --spot \ --node-labels="topology.kubernetes.io/zone=us-east-1b,kubernetes.io/lifecycle=preemptible"

Then use Cluster Autoscaler’s --balance-similar-node-groups flag to distribute scale-outs evenly:

--balance-similar-node-groups=true

Alternatively, use Karpenter with AZ-specific Provisioners and let it handle the topology automatically.

Testing Spot Interruptions

Don’t wait for a real interruption to discover your safety mechanisms don’t work. Simulate terminations:

Manual Drain Test

kubectl drain  --ignore-daemonsets --delete-emptydir-data

Watch pod eviction and rescheduling behavior. Verify PDBs are respected and critical services maintain availability.

Chaos Engineering with Spot Interruptions

Use AWS Fault Injection Simulator to trigger real Spot interruptions in a controlled test:

  1. Create an FIS experiment template targeting a specific Spot instance
  2. Run the experiment during a load test
  3. Measure latency spikes and error rates during node drain
  4. Verify termination handler logs show proper cordon/drain sequence

Monitoring and Alerting

Track Spot-related metrics to detect issues early:

  • Interruption frequency: Log termination handler events to CloudWatch or Prometheus
  • Pod eviction rate: Monitor kube_pod_status_phase{phase="Pending"} spikes
  • PDB violations: Alert on kube_poddisruptionbudget_status_pod_disruptions_allowed < 1
  • Node replacement time: Measure time from termination notice to new node ready

Set up CloudWatch alarms for Spot instance interruptions:

aws cloudwatch put-metric-alarm \ --alarm-name spot-interruption-rate \ --metric-name SpotInstanceInterruption \ --namespace AWS/EC2Spot \ --statistic Sum \ --period 3600 \ --threshold 5 \ --comparison-operator GreaterThanThreshold

Common Pitfalls and How to Avoid Them

Running databases on Spot without fallback. Even with termination handlers, you risk data inconsistency during abrupt shutdowns. Always use On-Demand or Reserved Instances for stateful workloads.

Insufficient replica counts. If you run 3 replicas and all are on Spot, a simultaneous multi-node interruption (rare but possible) violates your PDB. Keep at least minAvailable + 1 replicas with some on On-Demand.

Single instance type in Spot pools. Using only m5.large creates a single point of failure—if that instance type has high interruption rates, your entire Spot fleet churns. Diversify across families: m5.large,m5a.large,m5n.large,m6i.large.

Ignoring PV topology. StatefulSets on Spot require per-AZ Auto Scaling groups or Karpenter with zone-aware provisioning. Otherwise, pods get stuck Pending after AZ-crossing interruptions.

No termination handler. Without aws-node-termination-handler or equivalent, pods are forcefully killed with no grace period—breaking in-flight requests and database transactions.

Progressive Rollout Strategy

Don’t convert your entire cluster to Spot overnight. Use this staged approach:

  1. Week 1: Add a small Spot node group (10% of capacity) for dev/test namespaces
  2. Week 2: Deploy batch jobs and CI/CD runners to Spot with affinity rules
  3. Week 3: Move stateless APIs with >5 replicas and PDBs to prefer Spot
  4. Week 4: Increase Spot percentage to 50-70% of total capacity
  5. Ongoing: Monitor interruption rates and adjust instance diversification

At each stage, run load tests and chaos experiments before proceeding.

Expected Savings

Spot pricing varies by instance type and region, but typical savings range from 60-90% versus On-Demand. For a cluster spending $10,000/month on EC2:

  • 50% Spot adoption at 70% discount: $3,500/month savings
  • 70% Spot adoption at 70% discount: $4,900/month savings

Combine with Savings Plans on your On-Demand baseline for additional 20-40% savings on the remaining 30-50% of capacity.

Conclusion

Spot instances are the single highest-impact cost lever in EKS, but only when implemented with proper safety guardrails. Never run critical stateful workloads on Spot alone—always maintain On-Demand fallback capacity. Install the termination handler, use node affinity to control placement, protect services with PodDisruptionBudgets, and create per-AZ node groups for workloads with persistent volumes. Start with batch jobs and stateless services, measure interruption impact, and expand gradually. With these practices, you can safely capture 60-90% compute savings while maintaining production reliability.