Purpose

This article reflects the current state of technical optimization engineering in cloud infrastructure operations. The Objective is to raise awareness of the topic to cloud engineering individuals. To emphasize, this is not about ML or any similar form of data driven, analytical optimization approaches. The focus here is on continuous configuration space exploring and arranging techniques.

What are Metaheuristics?

It’s not feasible nor intended to comprehensively cover optimization algorithms in this article. Such details are also not prerequisites to permeate the content.

Only to give a brief introduction for the curious ones, metaheuristics are optimization algorithms that are geared to be problem agnostic or independent and therefore generic, unlike heuristics, which often come with a one-to-one algorithm for problem association and can not be reused well across various problems. All heuristics have in common to not necessarily yield a globally optimal result, but the results those deliver are correct. There are many widely known metaheuristics algorithms, often metaphor based, e.g. nature inspired ones are genetic evolutionary algorithm, particle swarm optimization and so forth.

Essentially, the gist of the algorithms revolves around invoking an objective function and evaluating the result by applying various techniques. That flow is repeated in a cycle as long as seen fit.

General-flowchart-for-metaheuristic-algorithm.png

As the notions around metaheuristics are well established, there is no shortage of better, more detailed material in the public domain.

A few good introductory sources of content are:

Interesting Metaheuristics Frameworks to be named are:

How do Metaheuristics fit into the Cloud Engineering|Operations Context?

Metaheuristics have shown to be well applicable to tackle optimization problems of dynamic character. Infrastructure Operations are inherently a fairly dynamic setting over time. Cloud Infrastructure Ops is no exception there. Traffic patterns change, loads are volatile, complex interplays of self-run and cloud-managed services, services structures and interactions change rapidly. The list is long.

Currently, optimizing, tuning, performance improvements are often neglected on premise as much as in the cloud infrastructure field.

A non-comprehensive listing of reasons for that are:

  • intractable complexity as main challenge
  • lacking resources in terms of engineering knowledge or capacity

By operationalizing and wielding generic optimization algorithms, one can industrialize and mechanize optimization endeavors. That tackles the widespread neglection of broader performance tuning efforts as it gives operation engineers a standard at hand and drastically reduces the prior knowledge needed about technology specific configuration changes and ensuing implications of such an action. Toil is reduced massively; less engineering hours are required. Although not deterministic, comprehensive global optimal state will be attained by that. Still a much better operational state can be reached at any point in time than in the case of simply letting everything run. Completely new needs for existing knobs can be explored, maybe even alluded to adding additional knobs to technologies by exploring their combinatorial configuration spaces.

Especially noteworthy is that metaheuristics can significantly benefit from acceleration through parallelization. That means the ambient and convenient availability of GPU/TPU acceleration is an excellent advantage of running such optimization instruments in the cloud.

Also, running metaheuristics in a highly distributed way is not new. It has been used in many scientific applications run as HPC loads.

Scenarios of Application

Let’s detail some scenarios of application based on the Google Cloud Platform.

IaaS Instances|Network Appliances

As cloud provider infrastructure customer, one often has some requirements falling out of the supported features. Many build their own IaaS Instances or Appliances in that case. Examples are NVAs, simply some glue compute instances bridging some on premise setup with prevailing cloud services.

That’s legimitate overall, and also part of GCP recommendations, e.g., in the form of custom network appliances.

Yet, it’s opening up performance optimization needs as perceived on-premise, maybe even more intensely, because optimization targets are employed in an even more varied and dynamic environment.

GCP ships images with some configuration as preoptimization for their platform; there are many recommendations in place how to go about performance tuning. Also, other vendors do so for their images. But these preparative settings are rather static and its frail to think those are sufficient to call what is running optimally configured. Therefore, for the major part of the dynamic runtime, one is on ones own end though and has to face the same woes as for onprem instances if one intends to optimize the setup.

K8S Node Pools

GCP allows you to run your self managed, self configured K8S Node Pools. Here, one can intervene deeper for performance optimization that are geared to your particular workload characteristics. One can automatically pre-tune an image in the cloud and make it consumable by GKE pools. Additionally, it’s possible to adjust the online configuration. For example, adjust TCP windowing for network latency-sensitive workloads while serving traffic.

K8S Workloads

Any technology that we run in GKE as workload usually boasts a broad set of knobs for tuning or rearranging towards better performance over time.

Hierarchical Optimization of Stacks

Interesting it gets when one seeks to optimize stacked technology solutions. That’s where adjustments further down the application stack influence the performance further down and vice versa. E.g. a set of database tunables depending on load over time coming from a frontend service. The count of combinations in such a case is certainly intractable for humans.

Here is an attempt by the autotune project to illustrate the problem statement: autotune-it-admin.png

Sprouting AIops Optimization Projects leveraging Metaheuristics

Indeed, a systemic application of optimization frameworks for infrastructure operations purposes is far from novel. There are many large global web companies or cloud providers wielding optimization frameworks to great extent for their internal infrastructure, for certain also boasting metaheuristics in their arsenal.

Although some are disclosing their internal optimization frameworks, most are not opened up.

Moreover, none of the most prominent cloud platform providers currently offers a fully managed, publicly available service that eases operationalizing metaheuristics optimization frameworks for infrastructure use cases.

Hence, there is room for open source technologies that can run essentially anywhere. The cloud native landscape shows a Subsection called “Continuous Optimization”, though those projects are mostly about attempting to optimize business aspects like run costs of active cloud deployments. There are projects with technical orientation also, but more data sciene or ML data analytics purpose bound. The maturity or even viability of those appears vague to the author.

Hence, as of writing this, only two sprouting open source technology projects engraining metaheuristics at the core were findable.

cherusk/godon

The godon project engrains the apache/airflow(cloud compose) data engineering technology as a core component.

Custom logic is tried to be kept to a minimum. It’s rather developing gluing logic to bind together preexisting open source technologies to achieve the mission.

It attempts to be as target technology agnostic as possible, an universal and generic open source optimization service. Everything that is reachable via a network shall be a viable optimization target.

Further, it does not have to run in kubernetes but it potentially can.

The first metaheuristics frameworks it is supporting as backend are DEAP and optuna.

kruize/autotune

autotune is strongly kubernetes oriented.

It’s meant to run only kubernetes orchestrated, leveraging its APIs as part of the solution.

Moreover, it also targets mainly workloads on kubernetes.

Metaheurisitc frameworks it is currently leveraging is optuna.