This kind of broad brush statements about Machine Learning algorithms are made often and there are lot of online content alluding to this simplistic view of Machine Learning. It’s tempting to gravitate towards simplistic views and use recipe like approach while solving Machine Learning problems.
Unfortunately, the reality of Machine Learning is complex. The choice of learning algorithms and the algorithm tuning process can be confounding. Choosing the right learning algorithm for a given problem is not a trivial task.
In this post we will delve into the computational learning theory and use it to debunk the myth of the so called Top Ten Machine Learning algorithms. Hopefully, after reading this post, you will be more cautious about blindly using one of the algorithms from the so called top 10 list. In this post, I will keep the discussion Continue reading
If there one data format that’s ubiquitous, it’s JSON. Whether you are calling an API, or exporting data from some system, the format is most likely to be JSON these days. However many databases can not handle JSON and you want to store the data in relational format. You may want to do offline batch processing with Spark or Hadoop and and your Hadoop or Spark application may only support flat field delimited record oriented input.
Whatever the case, a tool is needed to map JSON to flat record oriented data. In this post we will go over a Spark based solution for converting JSON to flat relational data. The implementation is part of my open source project chombo on github. A Hadoop Continue reading
Although the goal for most predictive analytic problem is to make prediction, sometimes we are more interested in the model learnt by the learning algorithm. If the learnt model could be expressed as s set of rules, then those rules could be source of valuable insight into the data. For most Machine Learning algorithms, the learnt model is a black box and can not be interpreted in an intuitive way. However, Decision Tree is an exception. Decision Tree result is essentially a set of rules. There are few other rule extraction algorithms.
In this post I will cover solutions for mining simple rules based on one attribute only. It’s quick and dirty way of gaining insight into data. The solution is part of my open source project avenir. The implementation Continue reading
In a supply chain, quantity ordered from a down stream supplier or manufacturer are not necessarily always completely fulfilled, because of various factors. If the extent of under fulfillment could be predicted over a time horizon, then the shortfall items could be ordered from another fallback supplier. In this post we will go through a prediction model over a time horizon based on Continuous Time Markov Chain and how it can be used to solve the supply chain problem.
The Spark implementation for CTMC is available in my open source project avenir. It involves two Spark jobs. The first job Continue reading
Sometimes when running a complex data processing pipeline with Hadoop or Spark, you may encounter data, where most of the data is just grossly invalid. It might save lot of pain and headache, if we could do some simple sanity checks before feeding the data into a complex processing pipe line. If you suspect that the data is mostly invalid, validation checks could be performed on a fraction of the data by sub sampling. We will go though a Spark job that does what I just described. It’s part of my open source project chombo. A Hadoop based implementation is also available.
Solution for a more rigorous data validation with many out of the box data validators Continue reading
You show up at work in the morning and open your email to find 100 alarm emails in your inbox for the same error from an application running on some server within a short time window of 1 minute. You are off to to bad start, struggling to find other emails. I was motivated by this unpleasant experience to come up with a solution to stop the deluge of the same alarm emails in a small time window.
When there is a burst of events it’s essentially a cluster on the temporal dimension. If we can identify the clusters from the real time stream of events, then we can send only one or few alarms per cluster, instead of of one alarm per event. If the cluster extends over an long period, we could send multiple alarms.
I have implemented the solution in Spark Streaming and it’s available in my OSS project ruscello in github. Continue reading
Designing complex Big Data system with myriad of parameters and design choices is a daunting task. It’s almost a black art. Typically we stay with the default parameter settings, unless it fails to meet your requirement which forces you venture out of comfort zone of default settings. Essentially what we are dealing with is a complex optimization problem with no closed form solution. We have to perform a search in a multi dimensional parameter space, where the choice of parameter value combinations may run into hundreds of thousands if not millions.
With limited time and resource, the brute force approach of running tests for all the configuration value combinations is not a viable option. It’s clear that we have to do a guided search through the parameters space, so that we can arrive at the desired parameters values with a limited number of tests. It this post we will discuss an optimization technique called Bayesian optimization, which is popular for solving Continue reading