Machine Learning can leverage modern parallel data processing platforms like Hadoop and Spark in several ways. In this post we will discuss how to have Machine Learning at scale with Hadoop or Spark. We will consider three different ways parallel processing can benefit Machine Learning.
When thinking about parallel processing in the context Machine Learning, what immediately jumps in our mind is data partitioning along with divide and conquer learning algorithms. However as we will find out Continue reading
Insights gained from analyzing mobile phone usage data can be extremely valuable in marketing campaign and customer engagement efforts. For example, hour of the day when an user engages most with his or her mobile device could be used to choose the time to send a marketing message or email. Most frequent tower locations could be used to for promotional efforts of nearby businesses.
In this a post, we will go over a Spark based implementation for histogram and other simple statistics for mobile phone usage data. The solution is available Continue reading
This kind of broad brush statements about Machine Learning algorithms are made often and there are lot of online content alluding to this simplistic view of Machine Learning. It’s tempting to gravitate towards simplistic views and use recipe like approach while solving Machine Learning problems.
Unfortunately, the reality of Machine Learning is complex. The choice of learning algorithms and the algorithm tuning process can be confounding. Choosing the right learning algorithm for a given problem is not a trivial task.
In this post we will delve into the computational learning theory and use it to debunk the myth of the so called Top Ten Machine Learning algorithms. Hopefully, after reading this post, you will be more cautious about blindly using one of the algorithms from the so called top 10 list. In this post, I will keep the discussion Continue reading
If there one data format that’s ubiquitous, it’s JSON. Whether you are calling an API, or exporting data from some system, the format is most likely to be JSON these days. However many databases can not handle JSON and you want to store the data in relational format. You may want to do offline batch processing with Spark or Hadoop and and your Hadoop or Spark application may only support flat field delimited record oriented input.
Whatever the case, a tool is needed to map JSON to flat record oriented data. In this post we will go over a Spark based solution for converting JSON to flat relational data. The implementation is part of my open source project chombo on github. A Hadoop Continue reading
Although the goal for most predictive analytic problem is to make prediction, sometimes we are more interested in the model learnt by the learning algorithm. If the learnt model could be expressed as s set of rules, then those rules could be source of valuable insight into the data. For most Machine Learning algorithms, the learnt model is a black box and can not be interpreted in an intuitive way. However, Decision Tree is an exception. Decision Tree result is essentially a set of rules. There are few other rule extraction algorithms.
In this post I will cover solutions for mining simple rules based on one attribute only. It’s quick and dirty way of gaining insight into data. The solution is part of my open source project avenir. The implementation Continue reading
In a supply chain, quantity ordered from a down stream supplier or manufacturer are not necessarily always completely fulfilled, because of various factors. If the extent of under fulfillment could be predicted over a time horizon, then the shortfall items could be ordered from another fallback supplier. In this post we will go through a prediction model over a time horizon based on Continuous Time Markov Chain and how it can be used to solve the supply chain problem.
The Spark implementation for CTMC is available in my open source project avenir. It involves two Spark jobs. The first job Continue reading
Sometimes when running a complex data processing pipeline with Hadoop or Spark, you may encounter data, where most of the data is just grossly invalid. It might save lot of pain and headache, if we could do some simple sanity checks before feeding the data into a complex processing pipe line. If you suspect that the data is mostly invalid, validation checks could be performed on a fraction of the data by sub sampling. We will go though a Spark job that does what I just described. It’s part of my open source project chombo. A Hadoop based implementation is also available.
Solution for a more rigorous data validation with many out of the box data validators Continue reading