Debunking the Myth of Top Ten Machine Learning Algorithms

This kind of broad brush statements about Machine Learning algorithms are made often and there are lot of online content alluding to this simplistic view of Machine Learning. It’s tempting to gravitate towards simplistic views and use recipe like approach while solving Machine Learning problems. 

Unfortunately, the reality of Machine Learning is complex. The choice of learning algorithms and the algorithm tuning process can be confounding. Choosing the right learning algorithm for a given problem is not a trivial task.

In this post we will delve into the computational learning theory and use it to debunk the myth of the so called Top Ten Machine Learning algorithms. Hopefully, after reading this post, you will be more  cautious about blindly using one of the algorithms from the so called top 10 list. In this post, I will keep the discussion Continue reading

Posted in Machine Learning | Tagged | Leave a comment

JSON to Relational Mapping with Spark

If there one data format that’s ubiquitous, it’s JSON. Whether  you are calling an API, or exporting data from some system, the format is most likely to be JSON these days. However many databases can not handle  JSON and you want to store the data in relational format. You may want to do offline batch processing with Spark or Hadoop and and your Hadoop or Spark application may only support flat field delimited record oriented input.

Whatever the case, a tool is needed to map JSON to flat record oriented data. In this post we will go over a Spark based solution for converting JSON to flat relational data. The implementation is part of my open source project chombo on github. A Hadoop Continue reading

Posted in Big Data, ETL, Spark | Tagged , | Leave a comment

Gaining Insight by Mining Simple Rules from Customer Service Call Data

Although the goal for most predictive analytic problem is to make prediction, sometimes we are more interested in the model learnt by the learning algorithm. If the learnt model could be expressed as s set of rules, then those rules  could be source of valuable insight into the data. For most Machine Learning algorithms, the learnt model is a black box and can not be interpreted in an intuitive way. However, Decision Tree is an exception. Decision Tree result is essentially a set of rules. There are few other rule extraction algorithms.

In this post I will cover solutions for mining simple rules based on one attribute only. It’s quick and dirty way of gaining insight into data. The solution is part of my open source project avenir. The implementation Continue reading

Posted in Big Data, Data Science, Hadoop and Map Reduce, Machine Learning, Rule Mining | Tagged , , | Leave a comment

Supplier Fulfillment Forecasting with Continuous Time Markov Chain using Spark

In a supply chain, quantity ordered from a down stream supplier or manufacturer are not necessarily always completely fulfilled, because of various factors. If the extent of under fulfillment could be predicted over a time horizon, then the shortfall items could be ordered from another fallback supplier. In this post we will go through a prediction model over a time horizon based on Continuous Time Markov Chain and how it can be used to solve the supply chain problem.

The Spark implementation for CTMC  is available in my open source project avenir. It involves two Spark jobs. The first job Continue reading

Posted in Big Data, Data Science, Machine Learning, Scala, Spark | Tagged , , | Leave a comment

Simple Sanity Checks for Data Correctness with Spark

Sometimes when running a complex data processing pipeline with Hadoop or Spark, you may encounter data, where most of the data is just grossly invalid. It might save lot of pain and headache, if we could do some simple sanity checks before feeding the data  into a complex processing pipe line. If you suspect that the data is mostly invalid, validation checks could be performed on a fraction of the data by sub sampling. We will go though a Spark job that does what I just described. It’s part of my open source project chombo. A Hadoop based implementation is also available.

Solution for a more rigorous data validation with many out of the box data validators Continue reading

Posted in ETL, Hadoop and Map Reduce, Spark | Tagged | Leave a comment

Alarm Flooding Control with Event Clustering Using Spark Streaming

You show up at work in the morning and open your email to find 100 alarm emails in your inbox for the same error from an application running on some server within a short time window of 1 minute. You are off to to bad start, struggling to find other emails. I was motivated by this unpleasant experience to come up with a solution to stop the deluge of the same alarm emails in a small time window.

When there is a burst of events it’s essentially a cluster on the temporal dimension. If we can identify the clusters from the real time stream of events, then we can send only one or few alarms per cluster, instead of of one alarm per event. If the cluster extends over an long  period, we could send multiple alarms.

I have implemented the solution in Spark Streaming and it’s available in my OSS project ruscello in github.   Continue reading

Posted in Anomaly Detection, Big Data, Real Time Processing, Spark, stream processing | Tagged , , , | 1 Comment

Big Data System Design with Bayesian Optimization

Designing complex Big Data system with myriad of  parameters and design choices is a daunting task. It’s almost a black art. Typically we stay with the default parameter settings, unless it fails to meet your requirement which forces you venture out of comfort zone of default settings. Essentially what we are dealing with is a complex optimization problem with no closed form solution. We have to perform a search in a multi dimensional parameter space, where the  choice of parameter value combinations may run into hundreds of thousands if not millions.

With limited time and resource, the brute force approach of running  tests for all the configuration value combinations is not a viable option. It’s clear that we have to do a guided search through the parameters space, so that we can arrive at the desired parameters values with a limited number of tests.  It this post we will discuss an optimization technique called Bayesian optimization, which is popular for solving Continue reading

Posted in Big Data, Cluster Computation, Data Science, Optimization | Tagged , | Leave a comment