You Don’t Need a HA Control Plane
Let me preface this post by expressly and emphatically stating these are my opinions, and most definitely not those of my employer.
Note: in an effort to curb unintentionally hostile language, below I refer to master nodes as controller nodes.
Let me first give you a rather un-nuanced opening statement: no, you don’t need HA control plane for your Kubernetes clusters.
To understand why I’m making such claim we need to first lay couple of axiomata:
- Control plane HA does not directly influence or guarantee your applications’ availability
- Kubernetes worker nodes are designed to be able to continue operating without any controller nodes for a considerable time (say, half hour, for sake of argument) without disruption to applications and services running on it
Presently there is quite a bit of confusion and a fundamental misunderstanding what the role of the controller nodes (i.e. the control plane) is and how Kubernetes was built with fault tolerance as one of its tenets. Part of it is old, legacy, cover-your-butt corporate mentality (“more redundancy is better”), part of it is deliberate from vendors whose bottom line benefits from market confusion, but a big part of it is until relatively recently Kubernetes documentation left a lot to be desired in many areas — this, fortunately, is changing for the better rapidly.
So what’s the role of controller nodes? At a very high level I like to think of them as “state reconcilers” they are the ones in charge to make sure whatever description of your applications’ world you provided is realized and maintained. Whether number of replicas, affinity, container image, etc. the control plane will make sure it’s scheduled where it should, with correct cluster permissions, controllers and access and so forth. From there on, it keeps a control loop to make sure the state matches the description provided.
Sure there a lot of technical and theoretical nuance I’m leaving aside — but, in a way, that’s the point: you don’t need to know that level of detail to make an educated decision whether your cluster actually requires HA control plane (hint: very likely not).
Back to my initial statements, having 3,5, or 7 masters only improves the SLA of the control plane itself, not your application. If your application is under-provisioned, can’t handle the workload, can’t handle number of requests and so forth, there is nothing the controller nodes will do unless you expressly specify how/when or if it should scale out or otherwise fix the situation ahead of time — or if you change the description of the world after the fact. You need deliberately architect, design and implement your application (and description of the world it needs to live in) to meet whatever SLA/SLO you have established. Having more controller nodes won’t help you magically become a systems architect super star. In short, you can’t Kubernetes your way to application high availability and fault tolerance.
Now, these aren’t just arm-chair opinions. I have seen and worked with operations teams whose Kubernetes cluster was pushing over 1 Pb (yes, Petabyte) of web traffic per month with very small (literally t2.xlarge AWS instances) controller nodes which, by the way, they often took offline for upgrade, tweaking without any disruption to their application or users. I’ve also seen and worked with operation teams running complex CI/CD pipelines for 50-plus developers on a single controller node as well.
So, when is a HA controller nodes setup actually needed? There are a fair number of cases where this is true. For example, they’re a necessity in highly dynamic environments — read: aggressive continuous delivery schedules, nodes come-and-go based on demand, large clusters (say, 50-plus worker nodes). But, for the “80% case” HA control plane setup means additional cost and complexity for very little benefit in terms of application availability.