Machine Learning can leverage modern parallel data processing platforms like Hadoop and Spark in several ways. In this post we will discuss how to have Machine Learning at scale with Hadoop or Spark. We will consider three different ways parallel processing can benefit Machine Learning.
When thinking about parallel processing in the context Machine Learning, what immediately jumps in our mind is data partitioning along with divide and conquer learning algorithms. However as we will find out that data partitioning is not necessarily the best way to exploit parallel processing. There are other more fruitful areas.
Our discussion is focused on classical Machine Learning algorithms and not Deep Learning. Moreover we are only considering Hadoop and Spark for parallel processing platform.
With data partitioning and parallelism, the learning algorithm operates on each partition in parallel and finally the results for each partition are stitched together. This is essentially the divide and conquer pattern. The critical assumption is that learning algorithms can be recast to execute in parallel. However not all learning algorithms are amenable to parallel processing.
At first blush, this may sound very attractive. However careful consideration of the following observations will lead us to a different conclusion.
- Amount of training data required has a complex relationship with model complexity, expressed as a VC dimension. Higher model complexity demands more training data.
- Most Machine Learning algorithms are not capable of handling very complex model because of high generalization error. Deep learning is an exception to this.
- As a practical matter, we don’t build very complex model. Complexity is chosen based on tradeoff between error due to bias and error due to variance.
Based on these observations, we can conclude that Machine Learning algorithms as applied to real world problems do not require very large training data set and hence do not require parallel processing capabilities of Hadoop or Spark. More on relationship between model complexity and training data size can be found from my earlier post.
We could still use Hadoop or Spark. We can use a sequential learning algorithm that will operate on the whole data set without any partitioning.
This is the flip side of data partitioning. The function is decomposed into into number of independent function. Each function operates in parallel on the whole data set. Results are consolidated when all the functions have completed.
There is no learning algorithm that is amenable to functional decomposition. Moreover Hadoop and Spark provide provide parallel processing capabilities only through data partitioning.
Parallel processing through functional decomposition is not relevant in our discussion. I mentioned it only for the sake of completeness.
Hyper Parameter Learning
This is an area where we can exploit parallel processing in Hadoop and Spark very effectively. Generally, any learning algorithm has large number of parameters that influence the final result i.e. test error or generalization error. Our goal is to select a set of parameter values that will give us best performance i.e. minimum error.
This is essentially an optimization problem where we want to minimize error in the multi dimensional parameter space. However, the caveat s that error can not be expressed as function of the parameters in closed form and hence many classical optimization techniques can not be used.
Here are some of the optimization techniques that can be used for finding the optimum set of parameter values. The number of optimization techniques available is by no means limited to this list. For some set of parameter values we build the predictive model and test it to find the test or generalization error.
Not all of these optimization algorithms can be executed in a parallel fashion. Only the first two algorithms in the list below can be executed in parallel.
- Grid search
- Random search
- Gradient based search
- Bayesian optimization
- Simulated annealing
Grid search is a greedy algorithm. We consider an exhaustive set of all possible parameter values. The set of parameter values can be partitioned and executed in parallel. grid search is not practical when there are may parameters and each one having many possible values. The number of possible values is
In Random Search, we take a random sub set of parameter values and try them all. With random search, we will get sub optimal values for parameters. The probability of finding the parameters values for minimum error increases with the size of the set of parameter values. Random search can be performed in parallel.
In Gradient Based Search, gradients are taken with respect to parameters and used to select the next data point in the parameter space. It’s a sequential optimization algorithm.
Bayesian Optimization is smart guided search algorithm through the parameter space. Using the parameters values that have been tried so far and Bayesian Statistics, the next point in the parameter space is chosen. It’s a sequential algorithm.
Simulated Annealing is also a sequential search technique. Based on the current solution, a neighborhood point is selected in the parameter space. The neighborhood point is accepted based on a probability. Sometimes a point worse than the current one may be selected, which enables escaping from local minima.
In ensemble learning, multiple predictive models are built. Random forest is a good example where it is an ensemble of decision trees. The ensemble of models is used for prediction e.g., by taking majority vote. With ensemble learning, error due to variance can be reduced.
The models in the ensemble are generally built using a sub set of training data and / or subset of features. There are other generic ways to create ensemble e.g. bagging, boosting and stacking. There are also learning algorithm specific ensemble techniques. Since the models in the ensemble are trained independently, they can be trained in parallel.
Prediction at Scale
Having built a predictive model, some times the model needs to be deployed to predict on massive amount of data and with low latency. Additionally, the prediction may have to be made in near real time.
Here is an example where predictions are to be made in near real time and large amount of data is involved. Consider a a predictive model that predicts the probability of a customer buying something in the current visit to an eCommerce site, based real time click steam data.
This could be done with Spark Streaming with click stream data arriving through a Kafka topic. To maximize throughput, the data could be processed with multiple Spark partitions. Each Spark task processing a partition will load the predictive model.
The output of the prediction could be written back to another Kafka topic. The web site could personalize content based on prediction from the model.
Parallel processing offered by modern data processing platforms like Hadoop and Spark can be leveraged for Machine Learning in multiple ways, especially for hyper parameter learning and ensemble learning.