The most challenging part of building supervised machine learning model is optimization for algorithm selection, feature selection and algorithm specific hyper parameter value selection that yields the best performing model. Undertaking such a task manually is not feasible, unless the model is very simple.

The purpose of Automated Machine Learning (*AutoML*) tools is to democratize Machine Learning by making this optimization process automated. In this post we will use one such *autoML* tool called *Hyperopt* along with *Scikitlearn*. and show how to choose the optimum *Scikitlearn* classification algorithm, feature subset and associated hyper parameters for the algorithm. The solution is available in my open source project *avenir* on github.

## Automated Machine Learning

Like any other optimization problem, in automated Machine Learning the goal is to select a set of parameter values that will minimize cost or error. All the parameters constitute the search space for the optimization problem.

With Machine Learning, unlike many other optimization problems, the cost function is not mathematically expressible and computable. Given a set of parameter values you have to train a model and then find the generalization error with k cross fold validation or some other technique.

There are 3 areas in a Machine Learning pipeline that can be optimized with *AutoML* tools.

*Feature selection**Algorithm selection**Hyper parameter selection for a given algorithm*

Optimizing in all the 3 areas manually is time consuming and will incur prohibitively high computation cost. With hyper parameter tuning, even for moderately complex model there is a combinatorial explosion problem, when all possible combination of parameter values need to be considered.

For example, if there are 10 parameters and if each parameter has 3 different values, there will be more that 60000 possible combination of values to explore. With complex models, you may have hundreds of thousands of parameter value combinations to search.

The two common and popular tuning or optimization techniques are grid search and random search. As the the name suggests, grid search is a brute force approach and searches the whole search space exhaustively. In random search, parameter values to explore are selected randomly.

## Bayesian Optimization and Hyperopt

None of the two above mentioned optimization techniques is particularly intelligent. What we need is a more intelligent optimization algorithm that will focus more on the promising areas of the parameter search space.

Bayesian optimization is one such algorithm. In Bayesian Optimization, a probability distribution model for the cost is built based on all the parameters. The distribution model is refined as a new point in the search space is explored and cost obtained. The next point to explore is obtained by sampling the probability distribution model. *Hyperopt* is based on Bayesian Optimization.

Here are some salient characteristics of Bayesian Optimization. It elaborates where Bayesian Optimization is most appropriate and how it works

- Input space dimension should not be too large. Preferably, it should be less than 20
- The objective or cost function should be continuous.
- Cost is expensive to evaluate
- The cost function lacks concavity, derivatives, so that traditional optimization methods are not applicable. In other words, cost function is a black box.
- Uses Gaussian process regression for statistical inference
- Uses acquisition function to decide where to sample next

The search space in *Hyperopt* is defined based on the range of values for each parameters. The range definition is as follows based on the data type of the parameters.

*Categorical : Set of discrete values**Integer : A range of values. Hyperopt will sample within the range**Floating point : Uniform distribution with two boundary values provided*

For floating point, *Hyperopt* also supports other distributions e.g normal and log normal. Unless you have good intuition about how an algorithm behaves with respect to certain parameter, you may not want use these and stay with simple uniform distribution.

For categorical parameters, you can associate probability with each value and it will be sampled accordingly by *Hyperopt*. Classifiers names are all categorical variables.

If you have the intuition that certain classifier will work better for a given problem, you can assign separate probability for each classifier, making the probability distribution skewed towards the preferred classifier. If you intuition happens to be incorrect, your result is likely to be worse than the case with no probability.

## Code Free Hyperopt Optimization

Earlier I had built wrapper classes around some of the *Scikitlearn* classification algorithms. One of the goals for the abstraction was to enable someone to build classification models without writing any Python code. This is accomplished by defining meta data about the data and all the algorithm specific parameters is a properties configuration file.

Using the properties configuration file, you can train and validate classification models without writing any Python code. Following classification algorithms are currently supported in this framework. There is a configuration file for each learning algorithm.

*Support Vector Machine**Random Forest**Gradient Boosted Trees*

Although these abstractions help, Machine Learning expertise is still needed. Because you still have to manually set the appropriate parameter values in the configuration file. To bring Machine Learning and model training within the reach of domain experts who are not necessarily Machine Learning experts, I have added support for *Hyperopt* parameter search space definition in the configuration file.

There are two steps in this automation process. Through a particular configuration parameter, you provide a list of names of parameters to be included in the *Hyperopt* search space. In the next step, for each of the parameters in the list you provide a list of values. They get used to define the range of each parameter. Each parameters is one of the three types listed earlier. Here is one example

train.search.params=train.search.num.estimatorsgb:int,train.search.max.depth:int train.search.num.estimatorsgb=140,180 train.search.max.depth=3,5

Here, I am searching with 2 parameters. The next 2 parameters provide the range for the two parameters, both of which happen to be integer. You could expand the search space by adding more parameters by adding to the list in the first parameter value and then adding a parameter to provide the range of values for the corresponding parameter.

## Service Ticket Escalation

The use case is an example of service desk automation with Machine Learning. It has to with escalation in a customer service ticket. Consider a customer service ticketing system, where a support person will periodically escalate different tickets, based on various ticket parameters and statuses.

This escalation process could be automated, by using manual escalation data to train a supervised Machine Learning model. The model then could be used to predict and recommend the tickets that need to be escalated. Here are the different fields in the hypothetical customers service tickets. These are the features for the model to be trained.

*Number of days open**Number of re open**Number of messages exchanged**Number of past tickets on the same issue**Number of hours before the first response message**Average number of hours before response messages**Number of re assignments**Customer type*

There could be other features e.g. keywords in the text message or emails. I am using message as a generic term, which includes test message, email and phone call.

Here are the 3 configuration files for the 3 classification algorithms. For the most part they can be re used. For different data set, some changes are necessary, which will be discussed later.

*Support Vector Machine configuration**Random Forest configuration**Gradient Boosted Trees configuration*

## Running Hyperopt

When running the python script for *Hyperopt*, you have to provide the number of iterations as a command line argument. You also need to have the 3 configuration files ready. The script has a callback function, which *Hyperopt* calls as combination of parameters sampled. It will be called as many times as the number of iterations you provide. *Hyperopt* keeps track of the best combination of parameter values that result in least cost or error.

The algorithms to be used along with corresponding configuration files are specified in the command line argument. You can choose all the 3 classification algorithms or a subset of the 3. In future, as I implement abstraction of other classification algorithms, more algorithms will be added to the repertoire. Here is how to run it with 50 iteration along with the tail end of the output.

./autosupv.py 50 svm:esc_svm.properties rf:esc_rf.properties gbt:esc_gbt.properties ............... next evaluation ...building svm model ...training and kfold cross validating model average error with k fold cross validation 0.025 100% 50/50 [02:41<00:00, 3.24s/it, best loss: 0.0253984195984] {'train.penalty': 1.0453502981033078, 'train.kernel.function': 0, 'classifier': 0}

After *Hyperopt* has run, the script will output the best set of parameters found. For categorical and integer parameters it will output an index into the corresponding array of values. For floating point parameters, it will output the actual value.

Here we find that SVM(with index o in the classifier list) is the optimum model. For *train.penalty* we get the actual value. The parameter *train.kernel.function* with index 0 happens to be selected., which is *rbf*. I have used a restricted search space with only few parameters. In reality you would want to include more search parameters and use a larger number of iterations.

You can also run *Hyperopt* with biased classifier choice, assigning separate probability to each classifier as below. In this example, classifier choice is biased towards SVM.

./autosupv.py 50 svm:esc_svm.properties:0.4 rf:esc_rf.properties:0.3 gbt:esc_gbt.properties:0.3

## Training and Deploying the Optimal Model

The last step that needs to be performed is to train and save the model with the parameters chosen by Hyperopt. It is done by taking the optimal parameters found by *Hyperopt* and setting the corresponding parameter values in the configuration and training the model with appropriate parameters set to save the trained model.

You can deploy the trained model as a Rest service. Rest service implementations for the 3 classifiers are also available in *avenir*. More details can be found in the tutorial document.

## Using a Different Data Set

For you own use case and data set, some changes need to be made in the example configuration files . The first set of changes are related to meta data of your data set as below.

parameter |
comment |

train.data.fields |
Coma separated index of fields used to extract data form the training data file |

train.data.feature.fields |
Coma separated index of fields for features in the extracted data |

train.data.class.field |
Index of class variable field in extracted data |

The next set of changes relate to the parameter search space. Here is an example for SVM.

parameter |
comment |

train.search.params |
List of search parameters. Update as needed, adding and removing parameters |

xxx |
Add parameter with range of values, one for each parameter in the list |

xxx |
Add parameter with range of values, one for each parameter in the list |

The parameter with value range can be modified to expand or shrink the range. If you expand the search space, you should also increase the number of iterations to allow sufficient sampling of the search space by Hyperopt.

## Optimum Feature Set

Although I have not done it in the example, feature subset selection can be optimized with *Hyperopt*. This is accomplished by adding *train.search.data.feature.fields* to the list of values for train.search.params.

Then *train.search.data.feature.fields* could be defined with a list of values. Each value in the list is a colon separated list of feature column indexes in the data set. The list of indexes will be subset of all feature column indexes.

## Running Hyperopt in Parallel

Although Bayesian Optimization is essentially a sequential algorithm, *Hyperopt* can be run in parallel with *MomgoDB*. We can also run *Hyperopt* in parallel with PySpark. One option is to split the parameter search space and use use PySpark to process each parameter search subspace in parallel.

For example, we could split along different classifiers and feature sub sets to create separate parameter search subspaces . *Hyperopt* could process each subspace in a separate Spark task. The results for each search sub space could be combined to select the best parameter values.

## Summing Up

As it has been exhibited in this post, you can build your own AutoML pipeline with *Hyperopt* and *Scikit* without writing any Python code. *Hyperopt* which is based on Bayesian optimization, is better than commonly used grid search or random search for hyper parameter tuning.

Please follow the tutorial, if you are interested in executing the autoML use case in this post.

Pingback: Machine Learning Model Interpretation and Prescriptive Analytic with Lime | Mawazo

Pingback: Evaluation of Time Series Predictability with Kaboudan Metric using Prophet | Mawazo