16 Jan, 2023

Kubernetes Operators: When not to create one

Over the past few years in the realm of DevOps and Kubernetes in particular, Kubernetes Operator pattern has been a trending topic. In my personal experience, while working with different projects for different applications, it was evident that some of the software development teams, or organizations were too quick to jump on the Kubernetes Operator bandwagon, without analyzing the real-world problem they were trying to solve through the implementation of a Kubernetes Operator.

More often than not, the implementation of a kubernetes operator was done only because it was understood as the most trending topic or the implementation pattern in the kubernetes domain at that point. I have seen certain software development teams going to the extent where they implement different wrapper CRDs, or controllers around an open source kubernetes operator, so that certain organizational practices and standards were encapsulated to these wrapper controllers while giving the organizations the ability to deploy and use the open-source Kubernetes Operator in their application stack. If simply put, all that cost, time and effort from the software developers were invested in these kinds of implementations only to give the ability to use a specific, well known Kubernetes Operator in their application stack, while they had completely forgotten why anyone should really have a Kubernetes Operator or what actual problems a Kubernetes Operator is supposed to solve fundamentally in your application.

The purpose of this article is not to criticize the use of Kubernetes Operator pattern or manifest it as a bad practice. It is in fact quite the opposite. Kubernetes Operator pattern is a highly useful concept in a Kubernetes stack, which helps devops engineers or software developers to tackle certain complexities of the application lifecycle while deployed on a Kubernetes platform. What we rather intend to illustrate through this article are the actual real world use cases that Kubernetes Operator pattern is supposed to solve, by first discussing the use cases which might not be applicable for it.

Let us first look at what is meant as an operator pattern in Kubernetes. There are three key points we should aim to suffice through an operator implementation.

It should be a piece of software that automates a repeatable task of a stateful application and replaces the necessity of a human operator.
It should be a software extension to the Kubernetes API that makes use of a Custom Resource.
It should follow the Kubernetes principals, mainly the Control Loop.

While all three points above have an equal importance, we think the most important point out of three is the first one.

Now let us jump into our main topic, when should we not create an operator, or if we try to rephrase it in a more diplomatic way, “When should we rethink our decision to write an operator”.

Is my application a stateful application?

Even though official documentation for Kubernetes on operator pattern does not strictly mention about the application you are building the operator for, to be a stateful application, what we should understand is, an application which does not have a state will not really require some controller to handle its deployment lifecycle. Notice how we have used the term “controller” here instead of “operator”. We will discuss this in a moment.

The simple reason being, if a particular application does not have a state, what it really means is you can pretty much use the native features of Kubernetes such as a deployment controller to handle the full life cycle of that application. But when an application is a stateful application then there is a possibility that you cannot freely replace a given replica of that particular application instance with a new replica (like it is supposed to happen in the kubernetes world all the time).

There is a chance some additional work such as leader elections, handling checkpoints, managing quorums, or restoring a backup is there that must be done when bringing up a new replica. Now, some of these tasks may not essentially be a part of the application code itself, maybe these are some manual steps that a human operator must execute. So, this is where the real use of an operator on kubernetes can pay off well. You can write a piece of software to encapsulate all that domain / application specific logic and then combine it to the Kubernetes API as an extension.

Where do the CRDs and Control Loop come into play in this context? It is quite simple, CRDs pave the way for the end user to declare the desired state of the application that the operator has to maintain. End user will create / update a CR declaring the spec of the desired state, and then through the control loop, the operator will try to bring the application to that desired state. It will also report back the status of the application to the same CR.

Does this mean, we cannot write a similar operator to a stateless application? You absolutely can. However, in that case this should rather be considered as a controller not an operator. Also, if you are thinking to write such a controller to manage the lifecycle of a stateless application, it would rather be an overkill because there should be other means to achieve the same thing just using the native Kubernetes API without having to extend it with a whole new set of CRDs. Or this could even mean that your actual problem must be lying somewhere else, and you are misusing the operator pattern simply because you either want to use a Kubernetes object to handle a non-kubernetes resource in your application or you are trying to fix a configuration management problem through it.

Any code is a liability

There are many frameworks available now to make the implementation of an operator an easy task. However, it still requires a certain effort from the developers to understand the complexity of the application and the actual business requirement you need to address through the operator’s control loop.

It is also seen now-a-days that certain teams, organizations develop operators for 3rd party open-source applications that are not owned by these teams or organizations. The logic written into an operator is an imperative workflow that couples your application’s business logic into the Kubernetes control loop. This may initially not look concerning to anyone, however the team that develops the operator will eventually be responsible to maintain the operator codebase to support the potential changes in the actual application that can impact its deployment lifecycle. Even though writing an operator is not a difficult task now with the availability of different frameworks, it will still be something more complex than writing a piece of configuration to support deployment lifecycle with the use of native Kubernetes resources.

Also, unless you have a clear understanding of the potential changes coming into the particular application you are writing the operator for, you are putting yourself in a position where you have to invest a dedicated time and effort just to maintain the operator codebase to adhere to those changes from time to time, as they come.

This is the reason why it is recommended to leave the decision of writing an operator to the application owners or at least get enough understanding of the future changes the application may face, before starting writing an operator yourself for that application.

In either case, it is better to make yourself liable for a piece of configuration that you can easily modify than to make yourself liable for an entire codebase of an operator. So, it is a wise decision to seek alternative ways that you can handle complexities in the lifecycle of an application rather than writing a piece of code that works as an operator and making yourself liable to it, especially if you do not own that application.

Resource Concerns: Operators are not exactly part of your actual workload

In certain cases, you may have to run your application in a resource critical cluster. When you have an operator to manage the lifecycle of this application, the operator itself will require a certain level of resources (cpu, memory, network) allocated towards it from the same cluster where you run the workload.

What we expect from an operator is to maintain the lifecycle of a given number of application instances, by reconciling their state to a desired state, as specified by the user through the custom resource. Therefore, operators are a part of your control plane rather than the workload itself. Now imagine a situation where you must provision a large number of application instances.

This means a few things,

You will have to create an equal number of custom resource objects.
The operator or operator instances will be running reconciliation loops to handle the state of all the application instances represented by each individual custom object. This could be resource intensive operation.
You are using the Kubernetes etcd store to keep track of all of the custom objects and your operator will be communicating with Kubernetes API quite frequently.
On all the other occasions where your custom objects are in the desired state, the operator will sit idle, but it still requires some resources from the cluster.

As you can see, having an operator to manage such a large number of custom objects could impact your cluster resource wise, in multiple ways.

Therefore, when you want to run your stateful application in a Kubernetes stack, writing an operator may sound appealing but you must remember the impact it may have on your cluster resources which is primarily meant for running your workload.

Security Concerns: Is it worth running an operator with elevated privileges for the duration of your application?

This is one of the reasons why writing an operator should be your last resort. An operator is a highly privileged entity compared to your actual application.

If we take a step back and consider what your operator does, all it does is maintain the state of your application instances to a desired state as specified by the CR. For doing so, it requires a certain level of privileges to your Kubernetes API. Now depending on the type of resource objects the operator is supposed to manage, you can grant permission to specific Kubernetes resources in either “namespace” scope or “cluster” scope. In most cases, what we have seen in certain existing open-source operators is that they sometimes get deployed with RBAC permission to your cluster resources than they require.

Nevertheless, what is important here is that the operator will be running with all those permission to your Kubernetes cluster for the duration of your application, even when an actual reconciliation of the custom resources happens occasionally. So, the amount of time the Kubernetes Operator will require these permissions to carry out its functionality is only a fraction compared to the time it will actually be running.

Considering these aspects of an operator, if you are thinking of writing one that will most probably do one-off tasks or tasks that will happen in a less frequent window (e.g., backup, restore), it could be worthwhile to analyze the possibility of using jobs or cron-jobs available in the standard Kubernetes API than investing all your effort to build a complex piece like a Kubernetes operator.

Configuration Management: Operators should not be a solution to your configuration management problem

Something that we have seen in common in most operator implementations is that it helps end-users to manage the application configurations through a well-structured object like a custom resource. A custom resource has a predefined schema, managed through a custom resource definition (CRD). This CRD will be a dedicated one for the application, so validating the inputs a user can provide to configure the application state is more controlled and streamlined. When you consider the standard ways that a user can pass application configurations, it is either using a configmap or a kubernetes secret which are more generic approaches.

A frequent implementation pattern that we have seen is that certain operators are sometimes implemented to just make use of this structured configuration that can be achieved using a CRD. These operators mainly target to expose the application configurations via a CRD, so there is a control over the user inputs. They should rather be called as controllers than operators because they do not essentially do anything specific to handle any application state related activities during the application lifecycle. While anyone is free to write a piece of software that is meant for handling configurations of an application through a CRD, it also may be an overkill. Because there is much more generic tooling available in the Kubernetes ecosystem, to achieve the same thing. For example, for someone using helm to manage a deployment and lifecycle of an application, certain features such as “--verify”, or json-schema validation are available to validate the user inputs which can eventually be mapped to a generic resource such as a configmap.

Question we should ask here really is, “is it really worth writing an application specific piece of software to manage the configuration, when there is much more generic tooling available to specifically address configuration related issues of applications deployed on Kubernetes?”

These are the key areas that we would like to think a devops engineer or a software developer should consider, before starting to write a Kubernetes operator. We would like to end this article with the following note, “The fact that it is possible to write a Kubernetes operator as a solution to a given problem does not always mean you should write one.”

References:

https://kubernetes.io/docs/concepts/extend-kubernetes/operator/

https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/

https://thenewstack.io/kubernetes-when-to-use-and-when-to-avoid-the-operator-pattern/

https://sdk.operatorframework.io/docs/best-practices/best-practices/

Written by Nayanajith Chandradasa.