The most challenging phase in supervised Machine Learning pipeline is parameter tuning. There are many parameters, each with a range of values. The so called grid search is brute force approach that tries all possible combinations of values for the parameters looking for a combination of values that gives smallest test error.
Most supervised Machine Learning algorithm involves a significant number of parameters and grid search is not practical. To put things in perspective, if there 10 parameters and if each can take on 5 possible values there will 510 possible combinations to try.
In this post we demonstrate that with stochastic optimization technique called Simulated Annealing, near optimal solution can be found with significantly less number of iterations. The implementation can be found in my open source project avenir.
Simulated Annealing
Details of Simulated Annealing can be found in my earlier post, although it was applied for a different optimization problem. The algorithm can be summarized as below.
- Generate an initial random solution. Generate new solution by perturbing the current solution
- If the new solution is better, accept it as the current solution. If it’s better that the best solution found so far, set best solution to next solution
- If the next solution is worse, accept it in a probabilistic way, where the probability is a function of the cost difference between current and next solution and temperature. The temperature is a a parameter of the of the algorithm. It’s value drops as the iterations progress.
- When the temperature is high, a worse solution is more likely to be accepted, which enables more exploration of the search space during the initial iterations.
There are 2 important temperature related parameters. The first one is initial temperature. A higher initial temperature will induce more exploration of the search space. The second one is the temperature reduction factor. A higher reduction factor will make the algorithm switch to the exploitation mode earlier as the iterations progress.
Gradient Boosted Trees
Although we will use Gradient Boosted Trees as implemented python scikit-learn Machine Learning library, the search and optimization technique could be used for any Supervised Machine Learning algorithm.
Please refer to my earlier post on Gradient Boosted Trees and it’s application for a CRM lead conversion prediction problem. We will use the same use case for the parameter search and tuning problem.
Reading through the article, you will find that I have written some wrapper classes around some of the scikit-learn Supervised Machine Learning algorithms. It enables you to train, validate and perform parameter search and tuning, just by making appropriate changes in a properties configuration file, without having to write any python code. Meta data about your data set in also defined in the configuration file.
Parameter Space Search
The following 3 parameter search and tuning algorithms are supported by my implementation. Here is the python implementation.
- Grid
- Random
- Simulated Annealing
As alluded to earlier, in Grid search, the whole parameter space is searched exhaustively. In random search, in every iteration a random combination of parameter values are generated. Search by Simulated Annealing has been described earlier. The search and optimization techniques differ in how the next set of parameters are set for the next iteration.
The following steps are performed for every iteration through the parameter search space
- A new set of parameter values are selected
- A model is built and trained with with the set of parameters selected for that iteration
- The model is validated with k fold cross validation and the test error is found
- The combination of parameter values that provide the best solution is kept track of
The best solution is the combination of parameters values that minimizes the test or generalization error.
Grid Search
First, we will perform grid search, For brevity we have kept the parameter search space small. Search space is defined with a parameter that defines a list of parameters that will constitute the search space. For each such parameter, a list of values is provided through another set of parameters. You can always expand the search space by including more parameters and wider range of values for each parameter.
Here is the summary of result from grid search. There are 27 combination of values for parameters and hence 27 iterations. The best error rate we got 0.046.
all parameter search results train.learning.rate=0.14 train.num.estimators=140 train.max.depth=3 0.050 train.learning.rate=0.14 train.num.estimators=140 train.max.depth=4 0.048 train.learning.rate=0.14 train.num.estimators=140 train.max.depth=5 0.048 train.learning.rate=0.14 train.num.estimators=160 train.max.depth=3 0.049 train.learning.rate=0.14 train.num.estimators=160 train.max.depth=4 0.049 train.learning.rate=0.14 train.num.estimators=160 train.max.depth=5 0.049 train.learning.rate=0.14 train.num.estimators=180 train.max.depth=3 0.047 train.learning.rate=0.14 train.num.estimators=180 train.max.depth=4 0.049 train.learning.rate=0.14 train.num.estimators=180 train.max.depth=5 0.049 train.learning.rate=0.16 train.num.estimators=140 train.max.depth=3 0.049 train.learning.rate=0.16 train.num.estimators=140 train.max.depth=4 0.049 train.learning.rate=0.16 train.num.estimators=140 train.max.depth=5 0.050 train.learning.rate=0.16 train.num.estimators=160 train.max.depth=3 0.048 train.learning.rate=0.16 train.num.estimators=160 train.max.depth=4 0.049 train.learning.rate=0.16 train.num.estimators=160 train.max.depth=5 0.051 train.learning.rate=0.16 train.num.estimators=180 train.max.depth=3 0.046 train.learning.rate=0.16 train.num.estimators=180 train.max.depth=4 0.051 train.learning.rate=0.16 train.num.estimators=180 train.max.depth=5 0.051 train.learning.rate=0.18 train.num.estimators=140 train.max.depth=3 0.048 train.learning.rate=0.18 train.num.estimators=140 train.max.depth=4 0.049 train.learning.rate=0.18 train.num.estimators=140 train.max.depth=5 0.051 train.learning.rate=0.18 train.num.estimators=160 train.max.depth=3 0.049 train.learning.rate=0.18 train.num.estimators=160 train.max.depth=4 0.049 train.learning.rate=0.18 train.num.estimators=160 train.max.depth=5 0.050 train.learning.rate=0.18 train.num.estimators=180 train.max.depth=3 0.049 train.learning.rate=0.18 train.num.estimators=180 train.max.depth=4 0.050 train.learning.rate=0.18 train.num.estimators=180 train.max.depth=5 0.051 best parameter search result train.learning.rate=0.16 train.num.estimators=180 train.max.depth=3 0.046
Simulated Annealing Search
With simulated annealing, we don’t do exhaustive search. The maximum number of iterations was set to 10. A new solution is generated from the current solution by randomly selecting a parameter and then randomly selecting a value of the parameter. Essentially, only one parameter value in the current solution is mutated.
Here is some detailed log as Simulated Annealing searched through the parameter search space.
running mode: trainValidate ...starting train validate with parameter search ...next parameter set ...building model ...training and kfold cross validating model average error with k fold cross validation 0.050 ...next parameter set ...building model ...training and kfold cross validating model average error with k fold cross validation 0.049 next soln better next soln better than best ...next parameter set ...building model ...training and kfold cross validating model average error with k fold cross validation 0.049 next soln better next soln better than best ...next parameter set ...building model ...training and kfold cross validating model average error with k fold cross validation 0.049 next soln worst next soln worst and rejected ...next parameter set ...building model ...training and kfold cross validating model average error with k fold cross validation 0.050 next soln worst next soln worst but accepted ...next parameter set ...building model ...training and kfold cross validating model average error with k fold cross validation 0.049 next soln better ...next parameter set ...building model ...training and kfold cross validating model average error with k fold cross validation 0.049 next soln better ...next parameter set ...building model ...training and kfold cross validating model average error with k fold cross validation 0.049 next soln worst next soln worst but accepted ...next parameter set ...building model ...training and kfold cross validating model average error with k fold cross validation 0.049 next soln better ...next parameter set ...building model ...training and kfold cross validating model average error with k fold cross validation 0.049 next soln worst next soln worst and rejected
Here is the summary of final results. The best solution has an error rate of 0.049, slightly worse than what we got through grid search, achieved through about one third of the number of iterations in the grid search.
all parameter search results train.num.estimators=180 train.max.depth=4 train.learning.rate=0.18 0.050 train.num.estimators=160 train.max.depth=4 train.learning.rate=0.18 0.049 train.num.estimators=160 train.max.depth=4 train.learning.rate=0.14 0.049 train.num.estimators=160 train.max.depth=4 train.learning.rate=0.18 0.049 train.num.estimators=160 train.max.depth=5 train.learning.rate=0.18 0.050 train.num.estimators=160 train.max.depth=4 train.learning.rate=0.18 0.049 train.num.estimators=160 train.max.depth=4 train.learning.rate=0.14 0.049 train.num.estimators=160 train.max.depth=5 train.learning.rate=0.14 0.049 train.num.estimators=180 train.max.depth=5 train.learning.rate=0.14 0.049 train.num.estimators=160 train.max.depth=5 train.learning.rate=0.14 0.049 best parameter search result train.num.estimators=160 train.max.depth=4 train.learning.rate=0.14 0.049
Since Simulated Annealing is a stochastic optimization technique, every time you run it, you are expected to get a different result.
Last Words
We have discussed various search and optimization techniques for tuning Supervised Machine Learning parameters. The focus in this post has been a stochastic search optimization technique called Simulated Annealing.
Although the focus of this post has been optimization has been performed at the algorithm parameter levels, in a supervised Machine Learning pipeline, optimization can be performed at various other stages also e.g feature selection and the selection of the learning algorithm.
To run the parameter search use case in this post, please refer to the last section of the tutorial.
Pingback: Missing Value Imputation with Restricted Boltzmann Machine Neural Network | Mawazo
Pingback: Building SciKitLearn Random Forest Model and Tuning Parameters without writing Python Code | Mawazo