AI News Hub Logo

AI News Hub

Why We Moved from GKE to EKS

DEV Community
Ajinkya

Why We Moved from GKE to EKS When we initially adopted Kubernetes, Google Kubernetes Engine (GKE) Autopilot seemed like the perfect choice — fully managed, minimal operational overhead, and quick to get started. But as our workloads matured, three major challenges started to surface: Rising and unpredictable costs Compliance constraints The need for deeper infrastructure control This blog walks through why we migrated to Amazon Elastic Kubernetes Service (EKS) with Karpenter, the architectural changes we made, and the lessons we learned running production workloads post-migration. GKE Autopilot pricing is based on requested resources, not actual usage. This sounds fine at small scale — but as traffic grows, the gaps between requested and actual usage start to compound. Problems we observed: Over-provisioned workloads leading to higher bills No access to Spot/Preemptible node strategies with the same level of flexibility Very few cost optimization knobs to tune As traffic grew, costs increased almost linearly with no meaningful way to optimize without restructuring our entire workload configuration. Operating in a regulated environment required: Fine-grained IAM control at the workload level Strict network isolation between services Audit-level visibility into infrastructure activity With GKE Autopilot, several configurations are abstracted away or restricted by design. This made it harder to enforce organization-wide security policies and satisfy compliance requirements from auditors. Specifically: Enforcing per-pod IAM permissions cleanly was non-trivial Network policy enforcement had gaps in our specific setup Generating audit-ready logs tied to individual workload actions required workarounds We needed something that gave us first-class integration with cloud-native IAM and security tooling — without layering on custom solutions. When performance-sensitive services started hitting bottlenecks, the inability to choose instance types became a real blocker. We had no control over: CPU vs. memory-optimized instance selection ARM-based workloads on Graviton processors Custom AMIs or low-level networking tuning For teams running general-purpose workloads, this abstraction is a feature. For us, it was a ceiling. Moving to EKS gave us direct control over: Instance families — CPU-optimized, memory-optimized, ARM (Graviton) Custom AMIs — hardened images meeting our internal security baseline Networking — VPC-native networking with fine-grained subnet and security group control This unlocked workload-specific performance tuning that simply wasn't possible before. Karpenter is not your traditional cluster autoscaler. Instead of scaling pre-defined node groups, it: Watches for unschedulable pods in real time Selects the right-sized instance based on actual pod requirements Prioritizes Spot instances where workloads allow, falling back to On-Demand seamlessly Bin-packs nodes efficiently, reducing idle capacity The result: faster scaling reactions and a dramatically lower compute bill — without sacrificing reliability. AWS gave us the compliance story we needed: IRSA (IAM Roles for Service Accounts) — precise, per-pod IAM permissions with no shared credentials VPC-level isolation — full control over ingress, egress, and inter-service communication CloudTrail integration — every API call, every node action, fully auditable out of the box AWS Config + Security Hub — continuous compliance checks against CIS benchmarks and custom rules This made our next compliance audit significantly smoother. Auditors got clear, traceable logs without us having to build custom instrumentation. Here's what the high-level migration looked like architecturally: Component Before (GKE) After (EKS) Cluster GKE Autopilot EKS (Managed Node Groups + Karpenter) Autoscaling Built-in Autopilot scaling Karpenter Spot Strategy NO Control Karpenter Spot-first provisioning IAM GCP Workload Identity AWS IRSA Audit Logging Cloud Audit Logs CloudTrail + CloudWatch Networking GKE VPC-native AWS VPC with custom subnets Karpenter replaced our traditional Cluster Autoscaler, and the difference was immediately visible. How we configured it: apiVersion: karpenter.sh/v1 kind: NodePool metadata: name: spot-arm64 spec: template: metadata: labels: # ----------------------------------------------- # These labels land on the EC2 node. # Your pod affinity rules match against these. # ----------------------------------------------- node-pool: spot-arm64 capacity-type: spot arch: arm64 workload-class: standard spec: requirements: - key: kubernetes.io/arch operator: In values: ["arm64"] - key: kubernetes.io/os operator: In values: ["linux"] - key: karpenter.sh/capacity-type operator: In values: ["spot"] - key: karpenter.k8s.aws/instance-category operator: In values: ["c", "m", "r"] # c6g, c7g, c8g — compute optimized Graviton # m6g, m7g, m8g — general purpose Graviton # r6g, r7g, r8g — memory optimized Graviton - key: karpenter.k8s.aws/instance-generation operator: Gt values: ["5"] # Graviton2+ only (gen 6,7,8) nodeClassRef: group: karpenter.k8s.aws kind: EC2NodeClass name: default expireAfter: 168h # 7 days — shorter for spot nodes limits: cpu: 500 memory: 2000Gi disruption: consolidationPolicy: WhenEmptyOrUnderutilized consolidateAfter: 2m weight: 100 Key decisions we made: Spot-first provisioning — workloads that tolerate interruptions run on Spot; stateful services stay on On-Demand Multiple instance families — Karpenter picks the cheapest right-sized option across families Interruption handling — we use the Karpenter interruption queue (SQS) to gracefully drain Spot nodes before AWS reclaims them consolidateAfter: 2m — nodes deprovision 2m seconds after going idle, eliminating ghost capacity Compute costs dropped significantly. The main drivers: Spot instances covering the majority of our non-critical workloads Karpenter's bin-packing eliminating idle node waste Right-sized instances instead of over-provisioned static node groups Faster pod scheduling — Karpenter provisions new nodes in under 60 seconds in most cases Better workload isolation through custom node selectors and taints Graviton (ARM) instances for compatible workloads gave us a meaningful price-performance improvement Audit reports now generated directly from CloudTrail without custom tooling IRSA eliminated shared IAM credential risks Security Hub provides continuous posture monitoring against our compliance framework Being honest here — this is what makes a migration story actually useful. Karpenter's provisioning model is fundamentally different from Cluster Autoscaler. Debugging why a node wasn't provisioned — or why Karpenter chose a specific instance type — required understanding its internal decision logic. The logs are verbose but not always immediately readable. What helped: Running Karpenter in dry-run mode first, and adding structured logging to correlate provisioning decisions with pod events. GKE's VPC-native networking and AWS VPC behave differently in non-obvious ways — especially around CIDR planning, secondary IP ranges, and how pod IPs are allocated. We had to redesign our subnet layout and revisit some service-to-service communication assumptions. IRSA is powerful but requires careful role design. Mapping GCP Workload Identity bindings to AWS IRSA role assumptions took time, especially for services that assumed broad IAM permissions under GCP that needed to be tightened properly. Managed ≠ always optimal at scale. Autopilot is excellent for getting started, but production-grade platforms eventually need control surfaces that fully managed offerings deliberately hide. Cost optimization requires infrastructure access. You can't tune what you can't see. Autoscaling strategy matters more than cluster size. Karpenter's approach of provisioning for the pod rather than scaling a group changed how we think about capacity planning entirely. Compliance is easier when the platform is designed for it. AWS's native compliance tooling removed a category of work that we were previously solving with custom scripts and log forwarding pipelines. Migration should always be incremental. Parallel environment, gradual DNS cutover, canary deployments — this approach meant we caught issues in staging before they became production incidents. GKE Autopilot is an excellent choice for teams that want Kubernetes without the operational overhead — and we'd still recommend it for that use case. But for production environments that require cost control at scale, fine-grained compliance posture, and workload-specific infrastructure decisions, EKS with Karpenter provided a more flexible and efficient platform. The migration wasn't trivial, but the control, visibility, and cost profile on the other side made it worth it. Have you gone through a similar migration? Or are you evaluating EKS vs GKE for your stack? Drop your questions in the comments — happy to dig into specifics. Tags: kubernetes aws devops cloud karpenter