Machine Learning at Scale with Parallel Processing

Machine Learning can leverage modern parallel data processing platforms like Hadoop and Spark in several ways. In this post we will discuss how to have Machine Learning at scale with Hadoop or Spark. We will consider three different ways parallel processing can benefit Machine Learning.

When thinking about parallel processing in the context Machine Learning, what immediately jumps in our mind is data partitioning along with divide and conquer learning algorithms. However as we will find out Continue reading

Posted in Hadoop and Map Reduce, Machine Learning, Spark | Tagged , , | 1 Comment

Mobile Phone Usage Data Analytics for Effective Marketing Campaign

Insights gained from analyzing mobile phone usage data can be extremely valuable in marketing campaign and customer engagement efforts. For example, hour of the day when an user engages most with his or her mobile  device could be used to choose  the time to send a marketing message or email. Most frequent tower locations could be used to for promotional efforts of nearby businesses.

In this a post, we will go over a Spark based implementation for histogram and other simple statistics for mobile phone usage data. The solution is available Continue reading

Posted in Big Data, Data Profiling, Marketing Analytic, Spark, Statistics | Tagged , | Leave a comment

Debunking the Myth of Top Ten Machine Learning Algorithms

This kind of broad brush statements about Machine Learning algorithms are made often and there are lot of online content alluding to this simplistic view of Machine Learning. It’s tempting to gravitate towards simplistic views and use recipe like approach while solving Machine Learning problems. 

Unfortunately, the reality of Machine Learning is complex. The choice of learning algorithms and the algorithm tuning process can be confounding. Choosing the right learning algorithm for a given problem is not a trivial task.

In this post we will delve into the computational learning theory and use it to debunk the myth of the so called Top Ten Machine Learning algorithms. Hopefully, after reading this post, you will be more  cautious about blindly using one of the algorithms from the so called top 10 list. In this post, I will keep the discussion Continue reading

Posted in Machine Learning | Tagged | Leave a comment

JSON to Relational Mapping with Spark

If there one data format that’s ubiquitous, it’s JSON. Whether  you are calling an API, or exporting data from some system, the format is most likely to be JSON these days. However many databases can not handle  JSON and you want to store the data in relational format. You may want to do offline batch processing with Spark or Hadoop and and your Hadoop or Spark application may only support flat field delimited record oriented input.

Whatever the case, a tool is needed to map JSON to flat record oriented data. In this post we will go over a Spark based solution for converting JSON to flat relational data. The implementation is part of my open source project chombo on github. A Hadoop Continue reading

Posted in Big Data, ETL, Spark | Tagged , | Leave a comment

Gaining Insight by Mining Simple Rules from Customer Service Call Data

Although the goal for most predictive analytic problem is to make prediction, sometimes we are more interested in the model learnt by the learning algorithm. If the learnt model could be expressed as s set of rules, then those rules  could be source of valuable insight into the data. For most Machine Learning algorithms, the learnt model is a black box and can not be interpreted in an intuitive way. However, Decision Tree is an exception. Decision Tree result is essentially a set of rules. There are few other rule extraction algorithms.

In this post I will cover solutions for mining simple rules based on one attribute only. It’s quick and dirty way of gaining insight into data. The solution is part of my open source project avenir. The implementation Continue reading

Posted in Big Data, Data Science, Hadoop and Map Reduce, Machine Learning, Rule Mining | Tagged , , | Leave a comment

Supplier Fulfillment Forecasting with Continuous Time Markov Chain using Spark

In a supply chain, quantity ordered from a down stream supplier or manufacturer are not necessarily always completely fulfilled, because of various factors. If the extent of under fulfillment could be predicted over a time horizon, then the shortfall items could be ordered from another fallback supplier. In this post we will go through a prediction model over a time horizon based on Continuous Time Markov Chain and how it can be used to solve the supply chain problem.

The Spark implementation for CTMC  is available in my open source project avenir. It involves two Spark jobs. The first job Continue reading

Posted in Big Data, Data Science, Machine Learning, Scala, Spark | Tagged , , | Leave a comment

Simple Sanity Checks for Data Correctness with Spark

Sometimes when running a complex data processing pipeline with Hadoop or Spark, you may encounter data, where most of the data is just grossly invalid. It might save lot of pain and headache, if we could do some simple sanity checks before feeding the data  into a complex processing pipe line. If you suspect that the data is mostly invalid, validation checks could be performed on a fraction of the data by sub sampling. We will go though a Spark job that does what I just described. It’s part of my open source project chombo. A Hadoop based implementation is also available.

Solution for a more rigorous data validation with many out of the box data validators Continue reading

Posted in ETL, Hadoop and Map Reduce, Spark | Tagged | Leave a comment