Tuning cluster autoscaler

Background

During one of my assignments, we were evaluating Karpenter. While being a rather cool piece of software, it caused us some pain. The general idea was to deploy Grafana Loki and use an autoscaling tool to maintain node pools automatically.

Karpenter seemed to be the perfect tool for this use case. However, it had (maybe due to my misconfiguration) some undesired side effects. It was quite aggressive in adding and removing nodes, which caused disruptions.

Grafana Loki is a fairly complex piece of software with various components, each serving a different purpose. It handles varying data loads very well and scales nicely, but unfortunately, due to its consistency guarantees, it requires a stable underlying infrastructure. It contains mechanisms for stream reconciliation and is not completely stateless.

What Went Wrong

Karpenter was randomly adding and removing nodes, which made the deployment very unstable. Each node removal triggered a ring rebalancing, which disrupted our logging systems.

On the same cluster, we also had other workloads with varying loads, so to save costs we needed to react to input changes somehow.

Back to Basics

I thought about possible solutions and realized I had overlooked the obvious one: the in-tree Cluster Autoscaler.

I’m a big fan of old, well-known solutions — which in the cloud world is not always obvious.

Sometimes I wonder if cloud tools and projects don’t resemble the Node.js/npm environment a bit: lots of hype, but the basics still work just fine. As an old dev who loves make (yes, I really do), I decided to stick with the plain-old way.

I installed Cluster Autoscaler on our AWS EKS clusters. It wasn’t without issues — but that’s a topic for another post.

It needed some tuning, as the default behavior is quite conservative. I added the following parameters:

- '--skip-nodes-with-local-storage=false'
- '--expander=least-waste'
- '--scale-down-enabled=true'
- '--scale-down-unneeded-time=5m'
- '--scale-down-utilization-threshold=0.8'

By default, scale-down-unneeded-time is 10 minutes; we decreased it to 5. But the more interesting tweak is the utilization-threshold. By default, it’s 0.5, which often leaves you with underutilized nodes — especially with varying loads.

Yes, maybe it’s a bit slower or less sophisticated than Karpenter, but it’s not tied to a specific vendor, which is a big plus. If we ever need to move to another provider, that’s one less piece to reconfigure. Of course, many of our Helm charts would still need adjustments — but that’s a story for another day.

In the end, sometimes boring and proven solutions just work. If you run into similar issues, don’t underestimate the classics.