When running applications in Kubernetes one might quickly take for granted all the great HA functionality that comes out of the box. If one node running pods goes down, those pods will automatically be rescheduled on another node if available. As a developer there is nothing you need to actively do in order to achieve this.
For a lot of applications this might be more than enough, but when your project reaches a certain size or level of complexity you might find it lacking. SLA's might come into play, and you will want to make sure that your services can handle a variety of different disruptions.
This guide is written with OpenShift as the platform of choice, but should be applicable to most Kubernetes distributions.
In cases where the provided examples are not extensive enough, please see the documentation referenced in each section.
The first recommendation is the most simple: run multiple copies (replicas) of your application.
If your application supports it, you should ensure that there are at least 2 or more replicas of it. The number of replicas should be considered from how many underlying nodes that are available. Scaling up to a hundred replicas when you only have a single node available is pretty pointless from an HA perspective.
In a <highlight-mono>Deployment<highlight-mono> this can be set in <highlight-mono>.spec.replicas<highlight-mono>:
Kubernetes will now ensure that we always have 3 replicas of our nginx pod, however it does not take availability of the pods into account. We will cover how to handle this in the next section.
A <highlight-mono>PodDisruptionBudget<highlight-mono> (pdb) protects your pods from voluntary disruption, which can happen if nodes are drained for maintenance or during upgrades.
As the name suggests, it does this by creating a budget. You are essentially telling Kubernetes how many pods you can afford to lose.
You can create a pdb like this:
This pdb would ensure that there is always 2 pod matching the <highlight-mono>app=nginx<highlight-mono> label available.
In our example this would target these pods:
If two your pods happen to be scheduled on the same node and it gets drained for maintenance, the Kubernetes scheduler will ensure that after evicting one of the pods, the other will not be evicted before the the first one is up and running (there are always 2 pods available).
To check the status of a pdb:
Availability must be specified with either <highlight-mono>.spec.minAvailable<highlight-mono> or <highlight-mono>.spec.maxUnavailable<highlight-mono>
For<highlight-mono>.spec.maxUnavailable<highlight-mono> this value can be set with an integer or in percentage, <highlight-mono>.spec.minAvailable<highlight-mono> can only be set with an integer.
A pdb works on the following resources:
Note: If you set <highlight-mono>minAvailable: 100%<highlight-mono>, that means the same as <highlight-mono>maxUnavailable: 0%<highlight-mono>. This is impossible in practice and will prevent the scheduler from evicting your pods, making life difficult for your administrator.
Said another way: <highlight-mono>disruptionsAllowed<highlight-mono> cannot be 0. This happens for example if you set <highlight-mono>minAvailable<highlight-mono> to 2 when your application only runs with 2 replicas.
The point of a pdb is to tell the scheduler that a certain amount of pods can go down without majorly affecting your application. A sort of compromise between the developer and the administrator.
Kubernetes will try to spread pods across nodes based on resource usage by default. This can be customized by using a scheduling profile[link] but we will not cover that in this guide.
By specifying pod affinity, you ensure certain pods run on the same node(s).
Two types of affinity can be used:
You can use either one or both of these, here is an example from the Kubernetes docs of a pod with affinity:
What we are interested in however is to make sure our pods are spread across nodes to maximize availability, without trusting the default scheduling blindly. For this we will configure pod anti-affinity.
Anti-Affinity is defined in the <highlight-mono>Pod<highlight-mono> spec:
Since we rarely deploy pods individually, lets's add that to our <highlight-mono>Deployment<highlight-mono> from the previous section:
Let's see what this would look like in practice.
In this environment we have 3 nodes available:
Since we are using <highlight-mono>topologyKey=kubernetes.io/hostname<highlight-mono> in our Anti-Affinity configuration, we can expect that our pods will be spread like this:
Checking the pods in our deployment, this is indeed the case:
Even after doubling the amount of replicas, the pods are spread out nice and evenly:
If your nodes are labeled with availability zone / datacenter you could also use that to spread pods if preferred. The label usually used for this is <highlight-mono>topology.kubernetes.io/zone<highlight-mono>.
Nodes should be spread across multiple datacenters (or zones). This helps avoid downtime in the event of an infrastructure failure on one datacenter.
Topology is reflected inside Kuberenetes with the previously mentioned <highlight-mono>topology.kubernetes.io/zone<highlight-mono> label.
Health checks helps Kubernetes understand the state of your application. Without health checks, Kuberentes will only know that your application has crashed if it throws error code 1.
There are three different types of probes: startup, readiness and liveness.
Depending on your application you might want to use all, some or none of them. It is up to you to consider.
Here's a useful diagram that visualizes the three different probes:
You can configure probes to use different types of tests:
Being able to recover fast after an incident is key.
Usually in the event of an infrastructure failure, Kubernetes will recover automatically as soon as the underlying problem is fixed.
However, in the rare case where a cluster would have to be recreated or migrated from, there's a few things you can do to reduce time spent on recovery.
One of the best way to recover quickly in Kubernetes is by using GitOps / Infrastructure-as-Code.
By having your resources and even entire cluster defined in Git (be it with Helm, Kustomize or just pure YAML files), you always have the blueprint for how to host your application if you should ever need to deploy somewhere.
Argo CD and Flux are some good examples of open source projects that can be used to do GitOps in Kubernetes.
If you have all your Kubernetes resources in Git, that can be considered a sort of backup and you can pat yourself on the back.
If your database is running in Kuberenetes this alone is of course not enough, so you need to make sure that you take proper backups if you value your data. The same applies if your have any important data in persistent volumes.
There are a few ways to do this, here is one example:
You can also use tools like Velero to back up all resources in your Kubernetes cluster including volumes.
A very important factor in HA is to not overload the nodes your pods are running on. If you do not take preventative steps, a memory leak in one of your applications for example could bring down your entire production environment.
This section describes how resources and scheduling works and how you can reserve CPU and memory for your workloads.
The Kubernetes documentation describes scheduling like this:
In Kubernetes, scheduling refers to making sure that Pods are matched to Nodes so that Kubelet can run them.
A scheduler watches for newly created Pods that have no Node assigned. For every Pod that the scheduler discovers, the scheduler becomes responsible for finding the best Node for that Pod to run on.
Because there is no way for the Kubernetes scheduler to know how much resources your application is going to use before it is scheduled, you need to provide this information beforehand. If you don't, there is nothing stopping a pod from using all the available resources on a node.
There are two pieces of information that has to be provided for both CPU and memory:
Memory is specified in bytes. You can use a plain integer or one of there suffixes: E, P, T, G, M, k. As the Kubernetes documentation mentions, you can also use the equivalent power-of-two suffixes: Ei, Pi, Ti, Gi, Mi, Ki
You need to be careful with the casing of suffixes, as warned about in the documentation:
If you request <highlight-mono>400m<highlight-mono> of memory, this is a request for 0.4 bytes. Someone who types that probably meant to ask for 400 mebibytes (<highlight-mono>400Mi<highlight-mono>) or 400 megabytes (<highlight-mono>400M<highlight-mono>)
CPU resources are specified in cores, which can be specified like this: <highlight-mono>1.0<highlight-mono>, <highlight-mono>0.5<highlight-mono>, <highlight-mono>100m<highlight-mono>. The <highlight-mono>m<highlight-mono> suffix means millicore (or millicpu), a thousand of a core.
To quote the Kubernetes documentation once again:
In Kubernetes, 1 CPU unit is equivalent to 1 physical CPU core, or 1 virtual core, depending on whether the node is a physical host or a virtual machine running inside a physical machine.
For a node with 4 CPU's, we have 4 cores which also means 4000 millicores total.
In a Pod spec, requests and limits can be set in <highlight-mono>.spec.containers[].resources<highlight-mono>:
Let's have a closer look:
This tells the scheduler to reserve atleast 64 mebibytes of memory and 250 millicores for the <highlight-mono>app<highlight-mono> container in the <highlight-mono>frontend<highlight-mono> pod.
Here we tell the scheduler to stop the <highlight-mono>app<highlight-mono> container from using more than 128 mebibytes of memory or 500 millicores. If it goes over the limits, it will be terminated. Without a limit, a container could potentially consume so much CPU or memory that it affects the performance of the underlying node (worst case bringing it to a halt).
<info>When setting limits, Kubernetes is actually using something called CFS to throttle the container from using more CPU than the set limit. For some workloads this can have a significant impact on pod startup and response times. There are a lot of good blog posts on this topic, so if you struggle with it I suggest you take a look at the links below for more info.<info>
If you want to avoid setting requests and limits on all your deployments individually, you create a <highlight-mono>LimitRange<highlight-mono> for an entire namespace.
This will add limits and/or requests to all pods in the namespace without it.
A simple <highlight-mono>LimitRange<highlight-mono> looks like this:
The naming used here is a bit confusing, so just to clarify:
Note: This only applies to OpenShift
For nodes with extreme HA needs (>=99.9% SLA), one can consider setting them up for canary updates.
This allows an administrator to perform updates on a specific set of nodes within a set maintenance window. Because of this we can also move away pods from the nodes to be updated in a controlled manner.
If you want to go to the next step in terms of High Availability that would probably involve some sort of multi/hybrid cloud solution with multiple Kubernetes clusters running in different clouds, maybe even connected with a service mesh.
That is out of scope for this guide but if we ever go down that path in the future then who knows, a part two might show up.
That sums up this guide to High Availability in Kubernetes.
Much of this is based on my personal experience managing Kubernetes for many years, in addition to a lot of best practices found in the official documentation of both OpenShift and Kubernetes.
If you have any feedback, please don't hesitate to send it to me at stian.froystein@intility.