Supervised Machine Learning Parameter Search and Tuning with Simulated Annealing

The most challenging phase in supervised Machine Learning pipeline is parameter tuning. There are many parameters, each with a range of values. The so called grid search is brute force approach that tries all possible combinations of values for the parameters looking for a combination of values that gives smallest test error.

Most supervised Machine Learning algorithm involves a significant number of parameters and grid search is not practical. To put things in perspective, if there 10 parameters and if each can take on 5 possible values there will 5¹⁰ possible combinations to try.

In this post we demonstrate that with stochastic optimization technique called Simulated Annealing, near optimal solution can be found with significantly less number of iterations. The implementation can be found in my open source project avenir.

Simulated Annealing

Details of Simulated Annealing can be found in my earlier post, although it was applied for a different optimization problem. The algorithm can be summarized as below.

Generate an initial random solution. Generate new solution by perturbing the current solution
If the new solution is better, accept it as the current solution. If it’s better that the best solution found so far, set best solution to next solution
If the next solution is worse, accept it in a probabilistic way, where the probability is a function of the cost difference between current and next solution and temperature. The temperature is a a parameter of the of the algorithm. It’s value drops as the iterations progress.
When the temperature is high, a worse solution is more likely to be accepted, which enables more exploration of the search space during the initial iterations.

There are 2 important temperature related parameters. The first one is initial temperature. A higher initial temperature will induce more exploration of the search space. The second one is the temperature reduction factor. A higher reduction factor will make the algorithm switch to the exploitation mode earlier as the iterations progress.

Gradient Boosted Trees

Although we will use Gradient Boosted Trees as implemented python scikit-learn Machine Learning library, the search and optimization technique could be used for any Supervised Machine Learning algorithm.

Please refer to my earlier post on Gradient Boosted Trees and it’s application for a CRM lead conversion prediction problem. We will use the same use case for the parameter search and tuning problem.

Reading through the article, you will find that I have written some wrapper classes around some of the scikit-learn Supervised Machine Learning algorithms. It enables you to train, validate and perform parameter search and tuning, just by making appropriate changes in a properties configuration file, without having to write any python code. Meta data about your data set in also defined in the configuration file.

Parameter Space Search

The following 3 parameter search and tuning algorithms are supported by my implementation. Here is the python implementation.

Grid
Random
Simulated Annealing

As alluded to earlier, in Grid search, the whole parameter space is searched exhaustively. In random search, in every iteration a random combination of parameter values are generated. Search by Simulated Annealing has been described earlier. The search and optimization techniques differ in how the next set of parameters are set for the next iteration.

The following steps are performed for every iteration through the parameter search space

A new set of parameter values are selected
A model is built and trained with with the set of parameters selected for that iteration
The model is validated with k fold cross validation and the test error is found
The combination of parameter values that provide the best solution is kept track of

The best solution is the combination of parameters values that minimizes the test or generalization error.

Grid Search

First, we will perform grid search, For brevity we have kept the parameter search space small. Search space is defined with a parameter that defines a list of parameters that will constitute the search space. For each such parameter, a list of values is provided through another set of parameters. You can always expand the search space by including more parameters and wider range of values for each parameter.

Here is the summary of result from grid search. There are 27 combination of values for parameters and hence 27 iterations. The best error rate we got 0.046.

all parameter search results
train.learning.rate=0.14  train.num.estimators=140  train.max.depth=3  	0.050
train.learning.rate=0.14  train.num.estimators=140  train.max.depth=4  	0.048
train.learning.rate=0.14  train.num.estimators=140  train.max.depth=5  	0.048
train.learning.rate=0.14  train.num.estimators=160  train.max.depth=3  	0.049
train.learning.rate=0.14  train.num.estimators=160  train.max.depth=4  	0.049
train.learning.rate=0.14  train.num.estimators=160  train.max.depth=5  	0.049
train.learning.rate=0.14  train.num.estimators=180  train.max.depth=3  	0.047
train.learning.rate=0.14  train.num.estimators=180  train.max.depth=4  	0.049
train.learning.rate=0.14  train.num.estimators=180  train.max.depth=5  	0.049
train.learning.rate=0.16  train.num.estimators=140  train.max.depth=3  	0.049
train.learning.rate=0.16  train.num.estimators=140  train.max.depth=4  	0.049
train.learning.rate=0.16  train.num.estimators=140  train.max.depth=5  	0.050
train.learning.rate=0.16  train.num.estimators=160  train.max.depth=3  	0.048
train.learning.rate=0.16  train.num.estimators=160  train.max.depth=4  	0.049
train.learning.rate=0.16  train.num.estimators=160  train.max.depth=5  	0.051
train.learning.rate=0.16  train.num.estimators=180  train.max.depth=3  	0.046
train.learning.rate=0.16  train.num.estimators=180  train.max.depth=4  	0.051
train.learning.rate=0.16  train.num.estimators=180  train.max.depth=5  	0.051
train.learning.rate=0.18  train.num.estimators=140  train.max.depth=3  	0.048
train.learning.rate=0.18  train.num.estimators=140  train.max.depth=4  	0.049
train.learning.rate=0.18  train.num.estimators=140  train.max.depth=5  	0.051
train.learning.rate=0.18  train.num.estimators=160  train.max.depth=3  	0.049
train.learning.rate=0.18  train.num.estimators=160  train.max.depth=4  	0.049
train.learning.rate=0.18  train.num.estimators=160  train.max.depth=5  	0.050
train.learning.rate=0.18  train.num.estimators=180  train.max.depth=3  	0.049
train.learning.rate=0.18  train.num.estimators=180  train.max.depth=4  	0.050
train.learning.rate=0.18  train.num.estimators=180  train.max.depth=5  	0.051
best parameter search result
train.learning.rate=0.16  train.num.estimators=180  train.max.depth=3  	0.046

Simulated Annealing Search

With simulated annealing, we don’t do exhaustive search. The maximum number of iterations was set to 10. A new solution is generated from the current solution by randomly selecting a parameter and then randomly selecting a value of the parameter. Essentially, only one parameter value in the current solution is mutated.

Here is some detailed log as Simulated Annealing searched through the parameter search space.

running mode: trainValidate
...starting train validate with parameter search
...next parameter set
...building model
...training and kfold cross validating model
average error with k fold cross validation 0.050
...next parameter set
...building model
...training and kfold cross validating model
average error with k fold cross validation 0.049
next soln better
next soln better than best
...next parameter set
...building model
...training and kfold cross validating model
average error with k fold cross validation 0.049
next soln better
next soln better than best
...next parameter set
...building model
...training and kfold cross validating model
average error with k fold cross validation 0.049
next soln worst
next soln worst and rejected
...next parameter set
...building model
...training and kfold cross validating model
average error with k fold cross validation 0.050
next soln worst
next soln worst but accepted
...next parameter set
...building model
...training and kfold cross validating model
average error with k fold cross validation 0.049
next soln better
...next parameter set
...building model
...training and kfold cross validating model
average error with k fold cross validation 0.049
next soln better
...next parameter set
...building model
...training and kfold cross validating model
average error with k fold cross validation 0.049
next soln worst
next soln worst but accepted
...next parameter set
...building model
...training and kfold cross validating model
average error with k fold cross validation 0.049
next soln better
...next parameter set
...building model
...training and kfold cross validating model
average error with k fold cross validation 0.049
next soln worst
next soln worst and rejected

Here is the summary of final results. The best solution has an error rate of 0.049, slightly worse than what we got through grid search, achieved through about one third of the number of iterations in the grid search.

all parameter search results
train.num.estimators=180  train.max.depth=4  train.learning.rate=0.18  	0.050
train.num.estimators=160  train.max.depth=4  train.learning.rate=0.18  	0.049
train.num.estimators=160  train.max.depth=4  train.learning.rate=0.14  	0.049
train.num.estimators=160  train.max.depth=4  train.learning.rate=0.18  	0.049
train.num.estimators=160  train.max.depth=5  train.learning.rate=0.18  	0.050
train.num.estimators=160  train.max.depth=4  train.learning.rate=0.18  	0.049
train.num.estimators=160  train.max.depth=4  train.learning.rate=0.14  	0.049
train.num.estimators=160  train.max.depth=5  train.learning.rate=0.14  	0.049
train.num.estimators=180  train.max.depth=5  train.learning.rate=0.14  	0.049
train.num.estimators=160  train.max.depth=5  train.learning.rate=0.14  	0.049
best parameter search result
train.num.estimators=160  train.max.depth=4  train.learning.rate=0.14  	0.049

Since Simulated Annealing is a stochastic optimization technique, every time you run it, you are expected to get a different result.

Last Words

We have discussed various search and optimization techniques for tuning Supervised Machine Learning parameters. The focus in this post has been a stochastic search optimization technique called Simulated Annealing.

Although the focus of this post has been optimization has been performed at the algorithm parameter levels, in a supervised Machine Learning pipeline, optimization can be performed at various other stages also e.g feature selection and the selection of the learning algorithm.

To run the parameter search use case in this post, please refer to the last section of the tutorial.