Producer/Consumer Queue with Autoscaling on Google Kubernetes Engine

3 min readAug 24, 2018

The Kubernetes documentation about Jobs stops short of explaining exactly how to implement a long-running fleet of workers to process tasks from a work queue.

In particular, it assumes you know in advance how many tasks are in the queue, which is not possible for a long-running service. It also does not address how to scale up and down to optimize latency and cost.

Assume that you have:

A Kubernetes cluster running on GKE.
An infinite queue of tasks coming in from a queueing system like Amazon SQS. (sometimes the queue will have lots of tasks, and sometimes it will go empty for long periods of time).
A Dockerized worker that can grab tasks from this queue and execute them.
— One execution takes on the order of 10 minutes; assume container start-up time is negligible in comparison.
—The container can be configured to loop and await new tasks indefinitely, or exit after a single execution.

You want Kubernetes to do the following:

Spawn worker pods and nodes automatically, in numbers proportional to the system load, so as to process tasks with reasonable latency and cost.
Any time the queue goes empty for an extended period of time, shut down all workers/nodes that would be dedicated to this work queue.

There are probably many ways to implement this, but I have settled on the following:

Use a Deployment to manage the workers, not a Job
Configure the worker containers to poll the queue in an infinite loop
Add a separate auto-scaler that watches the work queue and scales the number of replicas in the worker deployment up or down to meet latency/cost goals.
— I found code for an auto-scaler at https://github.com/sideshowbandana/k8s-sqs-autoscaler; there are many others out there.
Create a GKE node pool to run the worker pods. Add taints and tolerations so that the worker pods have to execute on nodes in this pool.
Ensure the worker pods have accurate resource requests, in terms of the load and capacity of these work-pool nodes.
Enable autoscaling on the GKE node pool.

With minor tweaking of the auto-scaler parameters, I was able to get this working smoothly. The in-cluster autoscaler moves the number of replicas in the deployment up or down, and GKE’s node pool autoscaler responds by adding or removing compute nodes proportionally. The tracking is quite accurate; I was able to get the node pool to scale up, run jobs, and then scale back down to zero when the queue goes empty.

The necessary auto-scaler tweaks included making sure to wait long enough after a scale-up to let the new nodes come on line before scaling up again, and also making sure to account for work-in-progress tasks when deciding whether to scale down. I’m sure it is possible to dig deeper into the precise latency/cost metrics to come up with optimal scaling rules, but I have not bothered with that yet.

In summary, despite what the Kubernetes docs say, an auto-scaling Deployment seems a better fit for this system than a Job. Perhaps jobs are meant more for situations like parallel builds or data processing where the tasks are all known in advance.

Producer/Consumer Queue with Autoscaling on Google Kubernetes Engine

Written by Dan Maas