Architecting a Scalable Kubernetes Platform

Written by

Stian Frøystein

September 25, 2024

Approximately a 00 minutes read

Platform

As customer demand for Kubernetes increases, not just for applications, but also for an array of cloud-native tools, the urgent need for a new platform became evident. This sparked the creation of a more robust, scalable Kubernetes platform designed to meet our clients' evolving needs.

Initially, our Kubernetes platform was built around a single, large cluster. In this model, while each tenant's workloads operated on dedicated nodes, they shared a common control plane. Network segmentation was achieved by assigning nodes to specific tenants, with pod scheduling restricted to designated namespaces through node selectors. Access to these namespaces was controlled by RBAC, and only cluster administrators could modify node selector labels. For example, tenant-a had exclusive access to namespace-a, which utilized node-a-1 and node-a-2, whereas tenant-b utilized namespace-b backed by node-b-1 and node-b-2.

This configuration offered several advantages that were particularly appealing during our early adoption phase of Kubernetes:

Simplified setup and deployment
Manageability by a relatively small team.

However, as the scale and complexity of workloads have grown, the limitations of a single shared cluster have become increasingly apparent:

Changes affect all tenants, introducing widespread complexity and preventing the customization of solutions to meet individual customer needs.
The deployment of new features and operators is complicated by the shared environment.
Cluster upgrades have become more time-consuming.
Larger customers with more demanding needs find the single-cluster approach too restrictive.

Additionally, our shared cluster platform was deployed on our pre-existing virtualization stack. While generally effective, this setup posed challenges for running clusters without deep integration with the underlying vendor technology.

Inforgraphic of Control Plane + Workers configuration

Introducing Intility Developer Platform 2.0

In response to these challenges, we have developed the Intility Developer Platform 2.0, or Platform 2.0 for short, which represents a significant evolution in our approach to Kubernetes infrastructure.

Platform 2.0 is built on a robust bare-metal OpenShift cluster, referred to as the "hub cluster". This foundational cluster hosts multiple tenant clusters using two key technologies: OpenShift Hosted Control Planes (HyperShift) and OpenShift Virtualization (KubeVirt).

OpenShift is an enterprise distribution of Kubernetes by Red Hat, and is as always our platform of choice.

Hosted Control Planes

Hosted Control Planes (HCP/HyperShift) is a feature that was added as a Technology Preview in OpenShift 4.13 and is now in general availability and is fully supported.

HCP allows us to run the Kubernetes control plane components, such as the API server, etcd, and controller manager, as managed services. This model significantly reduces the overhead associated with maintaining separate control planes for each tenant cluster, enhancing our ability to scale while reducing resource consumption and provisioning time. Hosting clusters in this manner also simplifies management at scale, as they all exist as native Kubernetes resources within the hub cluster.

HCP-based clusters are fully compliant OpenShift clusters.

Inforgraphic showing control plane + workers in a decoupled instance

OpenShift Virtualization

OpenShift Virtualization (KubeVirt) is a technology that enables running virtual machines in Kubernetes and managing them like other resources. Below is a simple example of a KubeVirt VM yaml manifest. This declarative approach of handling VMs makes it extremely easy to provision and scale nodes for each cluster, and it integrates perfectly with hosted control planes.

yaml

apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  name: test-vm
spec:
  running: false
  template:
    metadata:
      labels:
        kubevirt.io/os: linux
    spec:
      domain:
        devices:
          disks:
            - name: containerdisk
              disk:
                bus: virtio
          interfaces:
          - name: default
            masquerade: {}
        resources:
          requests:
            memory: 64M
      networks:
      - name: default
        pod: {}
      volumes:
        - name: containerdisk
          containerDisk:
            image: quay.io/kubevirt/cirros-container-disk-demo

KubeVirt enables the use of Kubernetes as a hypervisor for virtual machines; combined with HCP, this transforms Kubernetes into a hypervisor for Kubernetes clusters.

Infographic depicting a hypervisor for Kubernetes clusters

Why bare-metal?

If you had asked a Kubernetes admin 5 years ago whether running a cluster on bare-metal was a good idea, their answer would likely have been no. However, thanks to projects like metal3, assisted installer, metallb, and many others, it is now a completely viable strategy. Since we are going to run hosted cluster nodes as VMs in the hub cluster, bare-metal was the obvious choice.

Platform Architecture

A bare-metal Kubernetes cluster serves as the foundation of the platform, as mentioned we call this the hub cluster. It is spread across three sites in Intility's existing data center infrastructure. Almost all automation and integration of the platform will go through this layer in some way.

The hub cluster will run operators like Multi-Cluster Engine, OpenShift Data Foundation / Rook, KubeVirt, HCP and Crossplane to execute its main purpose which is to host other Kubernetes clusters along with their nodes in the form of KubeVirt VMs.

Network

Inside the hosted clusters, we have standardized on Cilium as the default CNI. Cilium is built on eBPF and gives us detailed information about all traffic in a cluster in real-time, without any performance overhead. This visibility is crucial for both troubleshooting and optimizing network performance.

Cilium also supports advanced network policies, which are essential for ensuring the security and compliance of our multi-tenant environment. With Cilium, we can define fine-grained network policies that control the flow of traffic between different parts of the cluster, ensuring that each tenant’s data remains isolated and secure.

Automation with Crossplane

Our new platform leverages Kubernetes to enhance automation and scalability, with Crossplane playing a pivotal role. By integrating Crossplane, we can automate complex tasks and centralize management, significantly simplifying the deployment and maintenance processes.

Crossplane allows us to define a single resource that encapsulates all the necessary components for a deployment, abstracting away the complexity. For instance, deploying a cluster can be as simple as defining a single YAML resource:

Inforgraphic of a custom API managing a group of resources

This single definition triggers the deployment of all underlying resources required for a cluster.

A complete Intility Managed Cluster requires close to 20 different resources both inside and outside the Hub Cluster to meet our standard, so with Crossplane we are able to simplify this significantly.

Infographic showing a simplified workflow for an Engineer configuring a Intility Managed Cluster

In development: Threat Detection Agent

Hosted Clusters will be equipped with a suite of security tools, integrated by Intility into what we call the Threat Detection Agent. This agent monitors abnormal and suspicious behavior in the Linux kernel and Audit logs from the nodes, generating alarms processed by AI. Examples of such behavior include unexpected spikes in network traffic, unauthorized access attempts, and unusual resource consumption.The Threat Detection Agent gathers more information about the affected container and, when a high likelihood of a threat is detected, implements mitigation measures. For instance, it can automatically isolate or delete affected containers to prevent further damage. In other cases, the agent suggests possible mitigation measures, such as applying patches or adjusting network policies. The Developer Portal will display the alarms alongside detailed reports from the agent.

Putting it all together

Even though all this new tech backing the new platform is very exciting for an administrator, from the users perspective it might all be irrelevant. Why would you want a cluster managed by Intility in a solution like this when you could just get one with a few clicks in DigitalOcean or GCP?

As an end user you might not care that your control plane is hosted or that the nodes are actually KubeVirt VMs running in an underlying Kubernetes – you just want your cluster and you want it to work, with the ability to create and destroy them with ease if need be.

To truly add value on top of all this, we have made a management interface with unique features that simplify and streamline cluster operations. Our interface offers intuitive controls, real-time monitoring, and automated maintenance tasks, ensuring that your clusters are not only easy to deploy but also effortless to manage. With built-in scalability, security enhancements, and seamless integration with your existing workflows, we aim to provide a robust, user-friendly solution that meets your needs and exceeds your expectations.

if (wantUpdates == true) {followIntilityOnLinkedIn();}