Designing complex Big Data system with myriad of parameters and design choices is a daunting task. It’s almost a black art. Typically we stay with the default parameter settings, unless it fails to meet your requirement which forces you venture out of comfort zone of default settings. Essentially what we are dealing with is a complex optimization problem with no closed form solution. We have to perform a search in a multi dimensional parameter space, where the choice of parameter value combinations may run into hundreds of thousands if not millions.

With limited time and resource, the brute force approach of running tests for all the configuration value combinations is not a viable option. It’s clear that we have to do a guided search through the parameters space, so that we can arrive at the desired parameters values with a limited number of tests. It this post we will discuss an optimization technique called Bayesian optimization, which is popular for solving this kind of complex and intractable optimization problem.

Imagine yourself running tests to get optimum throughput out of a Big Data system. You have some test data based on various parameters settings. But you are hopeful of squeezing even better performance and you have half a dozen more candidate parameter settings in mind. Instead of running all of those 6 tests, what if an algorithm told which among those 6 setting is likely to give you the best result. That’s what Bayesian Optimization will do.

## Design Optimization

In a typical optimization problem, we try to maximize or minimize an objection function, which is a function of set of parameters. Additionally there may be set of constraints defined for the parameters. The problem may be exacerbated if we consider the following additional factors.

*The objective function may be multi modal, with many local minimum or maximum values**The constrains may be non linear.*

For Big Data systems the typical objective is throughput which we want to maximize. For storage system like Cassandra or messaging system like Kafka, throughput could be defined as number of reads and writes per sec. For processing systems like Hadoop or Spark, the throughput could be defined as the number of records processed per sec.

When posed as an optimization problem, Big Data system design has the following characteristics

*The objective function can not be expressed in closed form as a function of the parameters. We can only measure it. In optimization literature it’s known as black box objective function.**We can not assume that the objective function uni modal. It’s most likely to be multi modal.**The constraints on the parameters are inequality constraints. To be more specific, they are constrained to be within a range.**Because the objective function is a black box, there is no analytical closed form solution for this optimization problem. We have to run tests and measure objective function i.e. throughput for various parameter values**Because of time and resource constraint, we can perform only limited number of tests.*

The problem can be described as below, mathematical terms, whwre we to find the X that maximizes f(X)

*argmax(f(X)) | x _{i} > l_{i} and x_{i} < m_{i}*

*where*

*f(X) = objective function (throughput)*

*X = parameter vector*

*x*

_{i}= hardware or software parameter## Big Data System Parameters

In a typical Big Data system, there are tuneable parameters in all the layers of the hardware and software stack as shown below

**Hardware**: CPU, RAM etc.**OS**: Page cache, Swap space etc.**JVM**: Heap space etc.**Big Data System**: Depends on the system

Combining them all, we may have hundreds of parameters. Even if we consider 2 or 3 values for each, there will be millions of unique combination of values. Throughput will depend upon many of these parameters.

Any hope of mathematically expressing throughput i.e. our objective function in terms of these parameters is doomed. The best we can hope for is that for some of the parameters, we may have some intuition about how they will affect throughput.

## Bayesian Optimization

Our design problem is fraught with myriad of choices in a multi dimensional parameter space. Doing search through the parameter space looking for the optimum objective i.e. throughput is very challenging, if not possible even when the search is guided by intuitions.

I will side step the math and statistics and describe the Bayesian Optimization in plain English. Here are some references for the gory technical details. This article is a very good overview of Bayesian Optimization and it’s a good place to start. This article is also a very good in depth overview of Bayesian Optimization. Here is an article on application of Bayesian Optimization for hyper parameter tuning in Machine Learning.

Bayesian Optimization is a sequential search algorithm. After performing k measurements, the algorithm will suggest the next X to try that is most likely to give you f(X) higher that what we have found so far. Given that we are allowed to perform up to N tests, here is the algorithm

*for k = 1 to N*

*Based on the k values for X and f (X) that we have, calculate mean and variance of f(X) for some m number of X values that we have not tried before.**Among the m candidate X, choose the most promising X using an acquisition function.**Conduct the test for the chosen X and measure f (X)*

The function f(x) is modeled as Gaussian Process (GP). A Gaussian Process is a distribution over a function, specified by a mean function m(X) and a co variance function k(X, X1). A co variance function indicates how two points influence each other. A GP when evaluated at X returns a mean and variance of the function at that X.

The mean of f(X) for a previously unexplored point is function of the following and it’s computation requires several matrix operations.

*Co variance matrix of the previously explored points**Co variance between the point being considered and the previously explored points.**The function value f(X) for all previously explored points*

The variance of f(X) for a new unexplored point is function of the following, again involving matrix operations

*Co variance matrix of the previously explored points**Co variance between the new point being considered and the previously explored points.*

As we measure f(X) at new X values, the distribution of f(X) at some yet unexplored point X gets modified. This is precisely the reason, it’s called Bayesian, because we are updating the posterior distribution of f(X), where X is unexplored, based on the new evidence we gather at some other X.

Suppose we did measurement at X_{t} and X_{t+1} is data point not yet explored. The distribution of f(X_{t}+1) will be modified through the influence of f(X_{t}), becoming more accurate. For some X, f(X) is random, except for those X values where f(X) has been measured and then it becomes deterministic.

## Co Variance Function

The co variance function k(X1, X2) determines how two data points influence each other. Some of the choices are as follows

*Squared exponential**Squared exponential with automatic relevance determination**Matern kernel*

## Acquisition Function

Once we have estimated the function mean and variance at several candidate points, we have to choose one of them as the next data point to try. The choice should be made in way such that the new data is likely to yield better result i.e. higher f(X), compared to what we have seen so far.

The choice is driven by a trade off between exploitation and exploration as in Reinforcement Learning. If we choose X with high mean for f(X), we are leaning towards exploitation of what is already known. On the other hand, If we choose X with high variance for f(X), we are leaning towards exploration, venturing more into uncharted territory.

An acquisition function chooses an X among the candidate X values. There are several options for acquisition function as below.

*Maximizing the probability of improvement.**Maximizing the probability of improvement with trade off factor**Expected improvement**Gaussian process upper confidence bound**Thompson sampling*

## Some Practical Considerations

For some of the parameters, choice is obvious and does not require measuring f(X) e.g., throughput. For example, for RAM, we know that higher the value better it is for throughput. At least it won’t be worse. Given a choice between 8 GB and 16 GB, we will always choose 16 GB.

When we compute function mean variance for unexplored points, we can’t possibly do that for all unexplored points, as that will involve too much computation. A more pragmatic approach is to short list some candidate unexplored points based on intuition and domain expertise. Bayesian Optimization could run on the candidate points and find the best among them.

## Summing Up

Bayesian Optimization is a powerful tool for designing complex systems for optimum performance. It has various other applications, including for scientific experiment, hyper parameter tuning in Machine Learning etc. I am considering implementing Bayesian Optimization in near future.

Pingback: Machine Learning at Scale with Parallel Processing | Mawazo