Machine Learning at Scale with Parallel Processing


Machine Learning can leverage modern parallel data processing platforms like Hadoop and Spark in several ways. In this post we will discuss how to have Machine Learning at scale with Hadoop or Spark. We will consider three different ways parallel processing can benefit Machine Learning.

When thinking about parallel processing in the context Machine Learning, what immediately jumps in our mind is data partitioning along with divide and conquer learning algorithms. However as we will find out that  data partitioning is not necessarily the best way to exploit parallel processing. There are other more fruitful areas.

Our discussion is focused on classical Machine Learning algorithms and not Deep Learning. Moreover we are only considering Hadoop and Spark for parallel processing platform.

Data Partitioning

With data partitioning and parallelism, the learning algorithm operates on each partition in parallel and finally the results for each partition are stitched together. This is essentially the divide and conquer pattern. The critical assumption is that learning algorithms can be recast to execute in parallel. However not all learning algorithms are amenable to parallel processing.

At first blush, this may sound very attractive. However careful consideration of the following observations will lead us to a different conclusion.

  1. Amount of training data required has a complex relationship with model complexity, expressed as a VC dimension. Higher model complexity demands more training data.
  2. Most Machine Learning algorithms are not capable of handling very complex model because of high generalization error. Deep learning is an exception to this.
  3. As a practical matter, we don’t build very complex model. Complexity is chosen based on tradeoff between error due to bias and error due to variance.

Based on these observations, we can conclude that Machine Learning algorithms as applied to real world problems do not require very large training data set and hence do not require parallel processing capabilities of Hadoop or Spark. More on relationship between model complexity and training data size can be found from my earlier post.

We could still use Hadoop or Spark. We can use a sequential learning algorithm that will operate on the whole data set without any partitioning.

Function Partitioning

This is the flip side of data partitioning. The function is decomposed into into number of independent function. Each function operates in parallel on the whole data set. Results are consolidated when all the functions have completed.

There is no learning algorithm that is amenable to functional decomposition. Moreover Hadoop and Spark provide provide parallel processing capabilities only through data partitioning.

Parallel processing through functional decomposition is not relevant in our discussion. I mentioned it only for the sake of completeness.

Hyper Parameter Learning

This is an area where we can exploit parallel processing in Hadoop and Spark very effectively. Generally, any learning algorithm has large number of parameters that influence the final result i.e. test error or generalization error. Our goal is to select a set of parameter values that will give us best performance i.e. minimum error.

This is essentially an optimization problem where we want to minimize error in the multi dimensional parameter space. However, the caveat s that error can not be expressed as function of the parameters in closed form and hence many classical optimization techniques can not be used.

Here are some of the optimization techniques that can be used for finding the optimum set of parameter values. The number of optimization techniques available is by no means limited to this list. For some set of parameter values we build the predictive model and test it to find the test or generalization error.

Not all of these optimization algorithms can be executed in a parallel fashion. Only the first two algorithms in the list below can be executed in parallel.

  1. Grid search
  2. Random search
  3. Gradient based search
  4. Bayesian optimization
  5. Simulated annealing

Grid search is a greedy algorithm. We consider an exhaustive set of all possible parameter values. The set of parameter values can be partitioned and executed in parallel. grid search is not practical when there are may parameters and each one having many possible values. The number of possible values is

N = ∏i=1 to n ki   where
N = number of possible parameter values
n = number of parameters
ki = number of possible values for ith parameter

In Random Search, we take a random sub set of parameter values and try them all. With random search, we will get sub optimal values for parameters. The probability of finding the parameters values for minimum error increases with the size of the set of parameter values. Random search can be performed in parallel.

In Gradient Based Search, gradients are taken with respect to parameters and used to select the next data point in the parameter space. It’s a sequential optimization algorithm.

Bayesian Optimization is smart guided search algorithm through the parameter space. Using the parameters values that have been tried so far and Bayesian Statistics, the next point in the parameter space is chosen. It’s a sequential algorithm.

Simulated Annealing is also a sequential search technique. Based on the current solution, a neighborhood point  is selected in the parameter space. The neighborhood point is accepted based on a probability. Sometimes a point worse than the current one may be selected, which enables escaping from local minima.

Ensemble Learning

In ensemble learning, multiple predictive models are built. Random forest is a good example where it is an ensemble of decision trees.  The ensemble of models is used for prediction e.g., by taking majority vote. With ensemble learning, error due to variance can be reduced.

The models in the ensemble are generally built using a sub set of training data and / or subset of features.  There are other generic ways to  create ensemble e.g. bagging, boosting and stacking. There are also learning algorithm specific ensemble techniques.  Since the models in the ensemble are trained independently, they can be trained in parallel.

Prediction at Scale

Having built a predictive model, some times the model needs to be deployed to predict on massive amount of data and with low latency. Additionally, the prediction may have to be made in near real time.

Here is an example where predictions are to be made in near real time and large amount of data is involved.  Consider a a predictive  model that predicts the probability of a customer buying something in the current visit to an eCommerce site, based real time click steam data.

This could be done with Spark Streaming with click stream data arriving through a Kafka topic.   To maximize throughput, the data could be processed with multiple Spark partitions. Each Spark task processing a partition will load the predictive model.

The output of the prediction could be written back to another Kafka topic. The web site could personalize content based on prediction from the model.

Summing Up

Parallel processing offered by modern data processing platforms like Hadoop and Spark can be leveraged for Machine Learning in multiple ways, especially for hyper parameter learning and ensemble learning.

Advertisements

About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source project owner. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Machine Learning and Programming languages. I am fascinated by problems that don't have neat closed form solution.
This entry was posted in Hadoop and Map Reduce, Machine Learning, Spark and tagged , , . Bookmark the permalink.

One Response to Machine Learning at Scale with Parallel Processing

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s