Mistake that cost thousands (Kubernetes, GKE)

Lessons learned scaling Kubernetes cluster

No exaggeration, unfortunately. As a disclaimer, I will add that this is a really stupid mistake and shows my lack of experience managing auto-scaling deployments. However, it all started with a question with no answer and I feel obliged to share my learnings to help others avoid similar pitfalls.

What is the difference between a Kubernetes cluster using 100x n1-standard-1 (1 vCPU) VMs VS having 1x n1-standard-96 (vCPU 96), or 6x n1-standard-16 VMs (vCPU 16)?

I asked this question multiple times in Kubernetes community. No one suggested an answer. If you are unsure about the answer, then there is something for you to learn from my experience (or skip to Answer for the impatient). Here it goes:

I woke up middle of the night with a determination to reduce our infrastructure costs.

We are running a large Kubernetes cluster. “large” is relative of course. In our case that is 600 vCPUs during normal business hours. This number goes double during peak hours and goes to near 0 during some hours of the night.

Invoice for the last month was USD 3,500.

This is already pretty darn good given the computing power that we get, and Google Kubernetes Engine (GKE) made cost management mostly easy:

We use the least expensive data center (europe-west2 (London) is ≈15% more expensive than europe-west4 (Netherlands))
We use different machine types for different deployments (memory heavy vs CPU heavy)
We use Horizontal Pod Autoscaler (HPA) and Custom Metrics to scale deployments
We use cluster autoscaler (https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-autoscaler) to scale node pools
We use preemptible VMs

Using exclusively preemptible VMs is what allows us to keeps the costs low. To illustrate the savings, in case of n1-standard-1 machine type hosted in europe-west4, the difference between dedicated and preemptible VM is USD 26.73/ month VS USD 8.03/ month. That is 3.25x lower cost. Of course, preemptible VMs have their limitations that you need to familiarise with and counteract, but that is a whole different topic.

With all of the above in place, it felt like we are doing all the right things to keep the costs low. However, I always had a nagging feeling that something is off.

About that nagging feeling:

Average CPU usage per Node was low (10%-20%). This didn’t seem right.

My first thought was that I have misconfigured compute resources. What resource are required depends entirely on the program that you are running. Therefore, the best thing to do is to deploy your program without resource limits, observe how your program behaves during idle/ regular and peak loads, and set requested/ limit resources based on the observed values.

I will illustrate my mistake through an example of a single deployment “admdesl”.

In our case, resource requirements are sporadic:

NAME                       CPU(cores)   MEMORY(bytes)

admdesl-5fcfbb5544-lq7wc   3m           112Mi

admdesl-5fcfbb5544-mfsvf   3m           118Mi

admdesl-5fcfbb5544-nj49v   4m           107Mi

admdesl-5fcfbb5544-nkvk9   3m           103Mi

admdesl-5fcfbb5544-nxbrd   3m           117Mi

admdesl-5fcfbb5544-pb726   3m           98Mi

admdesl-5fcfbb5544-rhhgn   83m          119Mi

admdesl-5fcfbb5544-rhp76   2m           105Mi

admdesl-5fcfbb5544-scqgq   4m           117Mi

admdesl-5fcfbb5544-tn556   49m          101Mi

admdesl-5fcfbb5544-tngv4   2m           135Mi

admdesl-5fcfbb5544-vcmjm   22m          106Mi

admdesl-5fcfbb5544-w9dsv   180m         100Mi

admdesl-5fcfbb5544-whwtk   3m           103Mi

admdesl-5fcfbb5544-wjnnk   132m         110Mi

admdesl-5fcfbb5544-xrrvt   4m           124Mi

admdesl-5fcfbb5544-zhbqw   4m           112Mi

admdesl-5fcfbb5544-zs75s   144m         103Mi

Pods that average 5m are “idle”: there is a task in the queue for them to process, but we are waiting for some (external) condition to clear before proceeding. In case of this particular deployment, these pods will change between idle/ active state multiple times every minute and spend 70%+ in idle state.

A minute later the same set of pods will look different:

NAME                       CPU(cores)   MEMORY(bytes)

admdesl-5fcfbb5544-lq7wc   152m         107Mi

admdesl-5fcfbb5544-mfsvf   49m          102Mi

admdesl-5fcfbb5544-nj49v   151m         116Mi

admdesl-5fcfbb5544-nkvk9   105m         100Mi

admdesl-5fcfbb5544-nxbrd   160m         119Mi

admdesl-5fcfbb5544-pb726   6m           103Mi

admdesl-5fcfbb5544-rhhgn   20m          109Mi

admdesl-5fcfbb5544-rhp76   110m         103Mi

admdesl-5fcfbb5544-scqgq   13m          120Mi

admdesl-5fcfbb5544-tn556   131m         115Mi

admdesl-5fcfbb5544-tngv4   52m          113Mi

admdesl-5fcfbb5544-vcmjm   102m         104Mi

admdesl-5fcfbb5544-w9dsv   18m          125Mi

admdesl-5fcfbb5544-whwtk   173m         122Mi

admdesl-5fcfbb5544-wjnnk   31m          110Mi

admdesl-5fcfbb5544-xrrvt   91m          126Mi

admdesl-5fcfbb5544-zhbqw   49m          107Mi

admdesl-5fcfbb5544-zs75s   87m          148Mi

Looking at the above I thought that it makes sense to have a configuration such as:

resources:

  requests:

    memory: '150Mi'

    cpu: '20m'

  limits:

    memory: '250Mi'

    cpu: '200m'

This translates to:

idle pods don’t consume more than 20m
active (healthy) pods peak at 200m

However, when I used this configuration, it made deployments hectic.

admdesl-78fc6f5fc9-xftgr  0/1    Terminating                3         21m

admdesl-78fc6f5fc9-xgbcq  0/1    Init:CreateContainerError  0         10m

admdesl-78fc6f5fc9-xhfmh  0/1    Init:CreateContainerError  1         9m44s

admdesl-78fc6f5fc9-xjf4r  0/1    Init:CreateContainerError  0         10m

admdesl-78fc6f5fc9-xkcfw  0/1    Terminating                0         20m

admdesl-78fc6f5fc9-xksc9  0/1    Init:0/1                   0         10m

admdesl-78fc6f5fc9-xktzq  1/1    Running                    0         10m

admdesl-78fc6f5fc9-xkwmw  0/1    Init:CreateContainerError  0         9m43s

admdesl-78fc6f5fc9-xm8pt  0/1    Init:0/1                   0         10m

admdesl-78fc6f5fc9-xmhpn  0/1    CreateContainerError       0         8m56s

admdesl-78fc6f5fc9-xn25n  0/1    Init:0/1                   0         9m6s

admdesl-78fc6f5fc9-xnv4c  0/1    Terminating                0         20m

admdesl-78fc6f5fc9-xp8tf  0/1    Init:0/1                   0         10m

admdesl-78fc6f5fc9-xpc2h  0/1    Init:0/1                   0         10m

admdesl-78fc6f5fc9-xpdhr  0/1    Terminating                0         131m

admdesl-78fc6f5fc9-xqflf  0/1    CreateContainerError       0         10m

admdesl-78fc6f5fc9-xrqjv  1/1    Running                    0         10m

admdesl-78fc6f5fc9-xrrwx  0/1    Terminating                0         21m

admdesl-78fc6f5fc9-xs79k  0/1    Terminating                0         21m

This would happen whenever a new Node is brought in/ out of the cluster (which happens often due to auto-scaling).

As such, I kept increasing requested pod resources until I have ended up with the following configuration for this deployment:

resources:

  requests:

    memory: '150Mi'

    cpu: '100m'

  limits:

    memory: '250Mi'

    cpu: '500m'

With this configuration the cluster was running smoothly, but it meant that even idle Pods were pre-allocated more CPU time than they need. This is the reason why the average CPU usage per Node was low. However, I didn’t know what is the solution (reducing requested resources resulted in hectic cluster state/ outages) and as such I rolled with a variation of generous resource allocation for all the deployments.

Back to my question:

What is the difference between a Kubernetes cluster using 100x n1-standard-1 (1 vCPU) VMs VS having 1x n1-standard-96 (vCPU 96), or 6x n1-standard-16 VMs (vCPU 16)?

For starters, there is no price-per-vCPU difference between n1-standard-1 and n1-standard-96. Therefore, I reasoned that using a machine with fewer vCPUs is going to give me more granular control over the price.

The other consideration I had was how fast the cluster will auto-scale, i.e. if there is a sudden surge, how fast can the cluster auto scaler provision new nodes for the unscheduled pods. This was not a concern though — our resource requirements grow and shrink gradually.

And so I went with mostly 1 vCPU nodes, the consequence of which I have described in Premise.

Retrospectively, it was an obvious mistake: distributing pods across nodes with a single vCPU does not allow efficient resource utilisation as individual deployments change between idle and active states. Put it another way, the more vCPUs you have on the same machine, the more tightly you can pack many pods because as a portion of pods go over their required quota, there are readily available resources to take.

What worked:

I switched to 16 vCPU machines because they provide a balanced solution between fine resource control when auto-scaling the cluster and sufficient resources per machine to enable tight scheduling of pods that are going through idle/ active states.
I used resource configuration that requests only marginally more than the resources that are needed during an idle state, but have generous limits. It allows to have many pods scheduled on the same machine when majority of the pods are in an idle state, but still allows resource intensive bursts.
I switched to n2 machine type: n2 machines are more expensive, but they have 2.8 GHz base frequency (compare with ~2.2 GHz available to n1-* machines). We are taking an advantage of a higher clock frequency to process resource intensive tasks as fast as possible and put pods into the earlier described idle state as quick as possible.

Mistake that cost thousands (Kubernetes, GKE)

Lessons learned scaling Kubernetes cluster

Extended Validation not so… extended? How I revoked $1,000,000 worth of EV certificates!

Cocu calls for VAR in Championship

You may also like

Leave a Comment Cancel Reply