The EKS Cost Optimization Checklist

Optimizing EKS costs requires a structured approach that balances quick wins with sustainable practices. This 30/60/90 day plan walks you through measurement, right-sizing, autoscaling, and Spot adoption—the four levers that collectively reduce EKS spend by 50-80% according to real-world implementations.

Key Takeaways

  • Week 1-2 focuses on cost visibility: deploy Kubecost and enable AWS Cost & Usage Reports
  • Week 3-6 tackles right-sizing using VPA recommendations and actual resource usage data
  • Week 7-10 implements autoscaling with HPA and Cluster Autoscaler or Karpenter
  • Week 11-14 introduces Spot instances with safety guardrails for non-critical workloads
  • Monthly reviews and cleanup automation sustain savings long-term

Days 1-14: Establish Cost Visibility

You can’t optimize what you can’t measure. The first two weeks focus entirely on instrumentation—no optimization yet.

Deploy Kubecost or OpenCost

Kubecost provides per-pod, per-namespace cost attribution by mapping Kubernetes resource requests to AWS pricing data.

helm repo add kubecost https://kubecost.github.io/cost-analyzer/ helm upgrade --install kubecost kubecost/cost-analyzer \ --namespace kubecost --create-namespace \ --set kubecostToken="your-token-here"

For OpenCost (the open-source alternative):

kubectl apply -f https://raw.githubusercontent.com/opencost/opencost/develop/kubernetes/opencost.yaml

Access the Kubecost UI:

kubectl port-forward -n kubecost svc/kubecost-cost-analyzer 9090:9090

Navigate to http://localhost:9090 to see cost breakdowns by namespace, deployment, and pod.

Enable AWS Cost & Usage Reports

Cost & Usage Reports (CUR) provide the authoritative source of AWS pricing data. Kubecost integrates with CUR for accurate cost allocation.

  1. Go to AWS Billing Console → Cost & Usage Reports
  2. Create a new report with hourly granularity
  3. Enable resource IDs and split cost allocation
  4. Configure S3 bucket for report delivery
  5. Update Kubecost configuration to point to your CUR S3 bucket

Tag Everything

Tags enable cost allocation by team, environment, and application. Apply tags to:

  • EC2 instances (via node group tags)
  • EBS volumes (via StorageClass parameters)
  • Load balancers (via Service annotations)

Example node group tags in eksctl:

nodeGroups: - name: production-nodes tags: Environment: production Team: platform CostCenter: engineering

Activate cost allocation tags in AWS Billing preferences so they appear in Cost Explorer.

Baseline Current Costs

Document your starting point:

aws ce get-cost-and-usage \ --time-period Start=2025-01-01,End=2025-01-31 \ --granularity DAILY \ --metrics UnblendedCost \ --group-by Type=DIMENSION,Key=SERVICE

Record EC2, EBS, data transfer, and EKS control plane costs. This becomes your benchmark for measuring progress.

Days 15-45: Right-Size Resources

Right-sizing eliminates the gap between reserved resources (requests) and actual usage—the single biggest cost waste in most clusters.

Install Metrics Server

Metrics Server provides the foundation for autoscaling and right-sizing decisions:

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

Verify it’s working:

kubectl top nodes kubectl top pods --all-namespaces

Analyze Resource Slack

Compare pod requests to actual usage:

kubectl get pods --all-namespaces -o custom-columns=\ 'NAMESPACE:.metadata.namespace,\ NAME:.metadata.name,\ CPU_REQ:.spec.containers[*].resources.requests.cpu,\ MEM_REQ:.spec.containers[*].resources.requests.memory'

In Kubecost, navigate to the “Savings” tab to see rightsizing recommendations. Look for pods with <20% CPU utilization or excessive memory requests.

Deploy Vertical Pod Autoscaler (Recommendation Mode)

VPA analyzes historical usage and recommends optimal requests and limits:

git clone https://github.com/kubernetes/autoscaler.git cd autoscaler/vertical-pod-autoscaler ./hack/vpa-up.sh

Create a VPA in recommendation-only mode:

apiVersion: autoscaling.k8s.io/v1 kind: VerticalPodAutoscaler metadata: name: my-app-vpa spec: targetRef: apiVersion: "apps/v1" kind: Deployment name: my-app updatePolicy: updateMode: "Off"

After 24-48 hours, check recommendations:

kubectl describe vpa my-app-vpa

Apply recommendations incrementally—reduce requests by 10-30% initially and monitor for OOMKilled events or CPU throttling.

Set Resource Limits and Quotas

Prevent future over-provisioning with LimitRanges and ResourceQuotas:

apiVersion: v1 kind: LimitRange metadata: name: default-limits namespace: production spec: limits: - default: cpu: 500m memory: 512Mi defaultRequest: cpu: 100m memory: 128Mi type: Container

Days 46-75: Implement Autoscaling

Autoscaling eliminates manual capacity management and ensures you pay only for what you use.

Configure Horizontal Pod Autoscaler

HPA scales pod replicas based on CPU, memory, or custom metrics:

apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: my-app-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: my-app minReplicas: 2 maxReplicas: 10 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70

Monitor HPA decisions:

kubectl get hpa --watch

Deploy Cluster Autoscaler or Karpenter

Cluster Autoscaler scales node groups based on pending pods. It’s mature and works well with managed node groups.

Create an IAM role with autoscaling permissions and annotate the ServiceAccount:

apiVersion: v1 kind: ServiceAccount metadata: name: cluster-autoscaler namespace: kube-system annotations: eks.amazonaws.com/role-arn: arn:aws:iam::ACCOUNT:role/cluster-autoscaler

Deploy Cluster Autoscaler with the --balance-similar-node-groups flag to distribute scale-outs evenly across AZs:

helm repo add autoscaler https://kubernetes.github.io/autoscaler helm upgrade --install cluster-autoscaler autoscaler/cluster-autoscaler \ --namespace kube-system \ --set autoDiscovery.clusterName=my-cluster \ --set extraArgs.balance-similar-node-groups=true

Karpenter is a newer alternative that provisions right-sized nodes faster and supports more aggressive consolidation. It’s ideal for dynamic workloads.

Choose based on your operational maturity—Cluster Autoscaler for stability, Karpenter for optimization.

Test Autoscaling Behavior

Generate load to trigger scaling:

kubectl run -i --tty load-generator --rm --image=busybox --restart=Never -- /bin/sh -c "while sleep 0.01; do wget -q -O- http://my-app; done"

Watch HPA scale pods and Cluster Autoscaler provision nodes. Verify that scale-down happens after the load subsides (default: 10 minutes).

Days 76-90: Introduce Spot Instances

Spot instances can reduce compute costs by up to 90%, but they require careful handling to avoid disruptions.

Create Mixed Node Groups

Start with a small Spot node group for non-critical workloads:

eksctl create nodegroup \ --cluster=my-cluster \ --name=spot-nodes \ --node-type=m5.large \ --nodes=3 \ --nodes-min=1 \ --nodes-max=10 \ --spot \ --node-labels="kubernetes.io/lifecycle=preemptible"

Keep a separate On-Demand node group for critical workloads:

eksctl create nodegroup \ --cluster=my-cluster \ --name=ondemand-nodes \ --node-type=m5.large \ --nodes=2 \ --node-labels="kubernetes.io/lifecycle=essential"

Install Spot Termination Handler

AWS sends a 2-minute warning before reclaiming Spot instances. A termination handler cordons and drains nodes gracefully:

kubectl apply -f https://github.com/aws/aws-node-termination-handler/releases/download/v1.19.0/all-resources.yaml

Verify it’s running:

kubectl get daemonset -n kube-system | grep aws-node-termination-handler

Use Node Affinity to Pin Critical Pods

Ensure databases and stateful workloads stay on On-Demand nodes:

affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: "kubernetes.io/lifecycle" operator: "In" values: - essential

For batch jobs, prefer Spot:

affinity: nodeAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 preference: matchExpressions: - key: "kubernetes.io/lifecycle" operator: "In" values: - preemptible

Set Pod Disruption Budgets

PDBs prevent too many pods from being evicted simultaneously:

apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: my-app-pdb spec: minAvailable: 2 selector: matchLabels: app: my-app

Ongoing: Sustain and Improve

Cost optimization isn’t a one-time project. Establish monthly rituals:

Monthly Cost Review

  • Review Kubecost “Savings” tab for new rightsizing opportunities
  • Check for orphaned EBS volumes and unused load balancers
  • Analyze Cost Explorer for anomalies and trends
  • Adjust Savings Plans coverage based on baseline usage

Automate Cleanup

Find and delete orphaned volumes:

aws ec2 describe-volumes \ --filters Name=status,Values=available \ --query 'Volumes[*].[VolumeId,Size,CreateTime]' \ --output table

Schedule non-production cluster shutdowns using kube-downscaler:

helm repo add kube-downscaler https://charts.kiwigrid.com helm upgrade --install kube-downscaler kube-downscaler/kube-downscaler \ --set env.DEFAULT_UPTIME="Mon-Fri 08:00-18:00 America/New_York"

Measuring Success

Track these metrics monthly:

  • Total EKS spend (EC2 + EBS + data transfer + control plane)
  • Cost per pod (from Kubecost)
  • Node utilization (target: >60% average CPU/memory)
  • Spot instance percentage (target: 50-70% for tolerant workloads)
  • Orphaned resource count (target: zero)

Expect 15-25% savings from right-sizing alone, 10-20% from autoscaling, and 30-50% from Spot adoption—compounding to 50-80% total reduction when combined.

Conclusion

This 90-day plan provides a structured path from measurement to meaningful savings. Start with visibility in weeks 1-2, tackle right-sizing in weeks 3-6, implement autoscaling in weeks 7-10, and carefully introduce Spot in weeks 11-14. The key is incremental progress with validation at each step—don’t skip measurement, don’t apply VPA in auto mode without testing, and don’t put critical workloads on Spot without fallback capacity. By day 90, you’ll have the foundation for sustainable cost optimization that adapts as your cluster grows.