Setting up Amazon EKS: What you must know

6 min readJul 19, 2018

Coming from the “old school” of running web services on manually-deployed EC2 instances, I was very excited to try bringing up a Kubernetes cluster using EKS.

I encountered a bunch of problems and surprises, which I’m collecting here to save time for anyone else getting started with EKS.

First, keep this in mind:

EKS is NOT a completely managed, stand-alone Kubernetes cluster.

What you really get is a managed set of Kubernetes master servers, plus some “magic glue” that links traditional AWS resources to Kubernetes.

That glue consists of:

Kubernetes API authentication that integrates with IAM
The ability to use EC2 instances as workers in your cluster, connected with some VPC networking magic
An implementation of the Kubernetes Service resource that spawns a traditional EC2 Elastic Load Balancer when you set type: LoadBalancer.
An implementation of Kubernetes persistent volumes that spawns EC2 EBS volumes.

With these pieces, you can assemble yourself a working Kubernetes cluster on AWS, although there are some problems with the flexibility and stability of ingress systems, and garbage collection of AWS resources, which I will discuss below.

Overview of the set-up process

The Amazon installation instructions worked smoothly. I was able to write a Terraform module that does all of the initial steps described there.

However, there were several points where I had to use hacky work-arounds to avoid needing manual intervention, and some points where a manual step is absolutely required.

(In some ways, EKS and Kubernetes feel like a step backwards from a fully-automated Terraformed deployment that goes from zero to a live web service just by typing “terraform apply”. I am concerned by the lack of attention to one-step automation in the Kubernetes community. So much “just kubectl this” or “just helm install that”. Automation, people!)

Anyway, the basic steps to setting up an EKS cluster are as follows:

Set up your VPC and security groups

You can use almost the same VPC settings as an ordinary EC2-based service. The only difference is a little magic tag that you must apply to the VPC and its subnets to make them visible to EKS.

Set up IAM roles

There will be separate IAM roles for the master and worker nodes. There are special AWS-provided policies you must attach to each of these roles.

Gain access to the cluster

EKS encourages you to use a tool called “heptio-authenticator-aws” that integrates with IAM to grab ephemeral access tokens for the Kubernetes API. (you can also use service accounts with pre-shared secrets, but the IAM-based method avoids the need to pass around secret keys).

Once the cluster is operating, you can add a setting into a kubectl config file, and from then on kubectl will invoke the authenticator automatically.

Note: Terraform’s Kubernetes provider cannot currently invoke this external authenticator, although there is a hacky work-around. See https://github.com/terraform-providers/terraform-provider-kubernetes/issues/161

Launch your worker nodes

These are launched just like any other EC2 instance. You must use a certain AMI and apply a special block of user data to get them connected up to the EKS cluster.

Also, there is a special ConfigMap you must deploy into the cluster to allow any worker nodes to connect. This is a manual step, unless you use the hacky Terraform auth work-around.

Configure RBAC

Amazon EKS uses the new RBAC authorization system for Kubernetes. You may need to inject service accounts and ClusterRoleBindings to allow tools like Helm to work properly.

Terraform’s Kubernetes provider lacks support for RBAC resources, so this is an unavoidable manual step for now. The Terraform team has decided not to allow arbitrary manifests, in favor of taking the time to create type-specific resources for each Kubernetes object, but they are quite far behind in support for newer types like Ingress and ClusterRoleBinding.

Ingress Issues

After setting up a cluster, I immediately ran into problems deploying typical Kubernetes applications, because EKS handles Services and Ingress differently than other platforms.

Most Kubernetes examples will set up an Ingress based on the nginx ingress controller to make themselves visible to the internet. This won’t do anything at all on EKS out of the box.

On EKS, there are a few different options:

Ingress Option #1: Service with type: LoadBalancer

EKS maps LoadBalancer services directly to AWS ELBs. This seems to be the most “native” option, but it has problems:

Each Service spawns its own ELB, incurring extra cost and preventing you from linking more than one Service to one hostname. There is no notion of path-based routing in ELBs.
Most recent Kubernetes codebases have already switched to the newer Ingress system and do not configure themselves with LoadBalancer services anyway.
LoadBalancer services lack the flexibility of modern ingress controllers like Nginx. You are missing features like automatic TLS certificate mangement and OAuth mix-ins.

Ingress Option #2: alb-ingress-controller

This is a third-party project that spawns AWS ALBs to correspond to specially-marked Ingress resources in Kubernetes. It tries to automatically manage Target Groups and routing roles to match the Ingress specification.

Advantages:

Works with the new Ingress resources rather than services
Includes support for lots of tweakable options on the ALBs and target groups

But there are some serious drawbacks:

Creates a new ALB for each Ingress resource. So again you can’t really mount more than one Service on an ALB, unless you manually hack the Kubernetes configuration to look very different from the current standard. (different logical services don’t like to share one Ingress).
Doesn’t always properly maintaining the links between Target Groups and worker nodes. Sometimes it fails to get a Target Group into a “healthy” state, or drops live nodes for no apparent reason.

Note, the health check settings on Target Groups are a little delicate. By default, ALBs want to see “200 OK” responses on “/” before enabling a target group, and this may not happen if you are still in the set-up process.

Ingress Option #3: nginx-ingress-controller with ELB support

Recent versions of the standard Nginx ingress controller now have the ability to create AWS ELBs to accept traffic. I did not try this approach because it didn’t offer as much flexibility as alb-ingress-controller, and integrates awkwardly with EKS authentication.

Note that many Kubernetes examples assume you are using the Ngninx ingress controller, because it has a lot of nice flexibility for routing and manipulating the traffic passing through it.

A working compromise: alb-ingress-controller + Nginx

To get things working, I settled on a combination of two ingress controllers.

I installed both alb-ingress-controller and nginx-ingress-controller. Then I manually deployed a single ALB Ingress resource that creates one ALB for the whole cluster, with all the AWS-specific settings like health checks and managed TLS certificates. This main Ingress has only one route, which forwards all “/” traffic to the nginx-ingress-controller Service.

The nginx-ingress-controller does not create any AWS resources and is not visible to the internet. It just sits there as a reverse proxy, passing requests from the ALB inward to services in the cluster that register with it. Furthermore, the Nginx controller is smart enough not to create any duplicate Kubernetes resources when multiple Ingresses register with it. Each new Ingress just adds routing rules into the existing controller service.

With all this set up, I can now deploy standard Kubernetes code that creates Nginx ingress controllers — normal ones, without any ELB support — and they will receive traffic from the ALB.

At this point you just add CNAMEs for the ALB into DNS, and you are live on the internet!

But, I still have the problem where alb-ingress-controller randomly decides to remove healthy hosts from its Target Group. Need to investigate why this is happening.

(By the way, all of the “installations” above are done with Helm charts managed by Terraform via the terraform-provider-helm plugin).

Important note on Garbage Collection

All of the above options involve the Kubernetes cluster creating AWS resources like ELBs, ALBs, and Target Groups. Often these resources don’t get cleaned up automatically, so you have to do some manual garbage collection. ELBs/ALBs are of particular concern because you pay every hour they are running, even if they are not receiving traffic.

Spinnaker

Update: I have been able to successfully deploy Spinnaker on EKS!

The Helm charts do not work, but Halyard does! (note, as of this writing, the Haylard Docker image includes a broken Heptio authenticator).

After installing the Helm chart, and configuring OAuth security, I created an nginx Ingress resource for the two UI services, Deck and Gate, and made them visible to the internet.

I encountered one glitch: EC2 worker instances can only run a limited number of pods due to contention for secondary IP addresses. t2.medium instances can run at most 17 pods. My basic cluster running the Kubernetes Dashboard plus Spinnaker requires at least 22 pods, so that means at least 2 worker instances are necessary just for administration. Some of the Spinnaker pods are memory-intensive (three or four require ~500MB each) so choose the instance size accordingly.