Simple Sanity Checks for Data Correctness with Spark

Sometimes when running a complex data processing pipeline with Hadoop or Spark, you may encounter data, where most of the data is just grossly invalid. It might save lot of pain and headache, if we could do some simple sanity checks before feeding the data  into a complex processing pipe line. If you suspect that the data is mostly invalid, validation checks could be performed on a fraction of the data by sub sampling. We will go though a Spark job that does what I just described. It’s part of my open source project chombo. A Hadoop based implementation is also available.

Solution for a more rigorous data validation with many out of the box data validators Continue reading

Posted in ETL, Hadoop and Map Reduce, Spark | Tagged | Leave a comment

Alarm Flooding Control with Event Clustering Using Spark Streaming

You show up at work in the morning and open your email to find 100 alarm emails in your inbox for the same error from an application running on some server within a short time window of 1 minute. You are off to to bad start, struggling to find other emails. I was motivated by this unpleasant experience to come up with a solution to stop the deluge of the same alarm emails in a small time window.

When there is a burst of events it’s essentially a cluster on the temporal dimension. If we can identify the clusters from the real time stream of events, then we can send only one or few alarms per cluster, instead of of one alarm per event. If the cluster extends over an long  period, we could send multiple alarms.

I have implemented the solution in Spark Streaming and it’s available in my OSS project ruscello in github.   Continue reading

Posted in Anomaly Detection, Big Data, Real Time Processing, Spark, stream processing | Tagged , , , | 1 Comment

Big Data System Design with Bayesian Optimization

Designing complex Big Data system with myriad of  parameters and design choices is a daunting task. It’s almost a black art. Typically we stay with the default parameter settings, unless it fails to meet your requirement which forces you venture out of comfort zone of default settings. Essentially what we are dealing with is a complex optimization problem with no closed form solution. We have to perform a search in a multi dimensional parameter space, where the  choice of parameter value combinations may run into hundreds of thousands if not millions.

With limited time and resource, the brute force approach of running  tests for all the configuration value combinations is not a viable option. It’s clear that we have to do a guided search through the parameters space, so that we can arrive at the desired parameters values with a limited number of tests.  It this post we will discuss an optimization technique called Bayesian optimization, which is popular for solving Continue reading

Posted in Big Data, Cluster Computation, Optimization | Tagged , | Leave a comment

Customer Segmentation Based on Online Behavior using ScikitLearn

Customer segmentation or clustering is useful in various ways. It could be used for targeted marketing. Sometimes when building predictive model, it’s more effective to cluster the data and build a separate predictive model for each cluster. In this post, we will segment customers based on their online behavior in an eCommerce web site.

The focus of this post is on solving a specific problem and interpret the results and not  a broad overview of clustering techniques.  We will use python scikit-learn machine learning library. The python implementation can be found in Continue reading

Posted in Data Mining, Machine Learning | Tagged , , , , | 2 Comments

Inventory Forecasting with Markov Chain Monte Carlo

Sometimes you want to calculate statistics about some variable which has complex, possibly non linear relationship with another variable for which probability distribution is available, which may be non standard or non parametric. That’s the situation we face when trying predict and plan inventory in the face of demand with some arbitrary probability distribution. For this problem, the goal is to choose an inventory level, given a an arbitrary demand distribution, so that some statistic on earning is maximized. The fact  that the demand  distribution has arbitrary non standard distribution and earning has a complex non linear relation with inventory and demand crushes any hope for an analytical solution.

One way out of this quagmire is simulate earning by sampling from the demand distribution and applying the nonlinear function to convert inventory and demand to earning. That’s the approach Continue reading

Posted in Machine Learning, Optimization, Python, Simulation | Tagged , , , | Leave a comment

Exactly Once Stream Processing Semantics ? Not Exactly

Stream processing systems  are characterized by at least once, at most once and exactly once processing semantics. These are important characteristics that should be carefully considered from the point of view of  consistency and durability of a stream processing application. However if a stream processing product claims to guarantee exactly once processing semantics, you should carefully read the fine prints.

The inconvenient truth is that a stream processing product can not unilaterally guarantee exactly once processing semantics.  It’s true under certain assumptions or when the application and the stream processing frame work collaborate in certain ways.

From a system architecture point of view a stream processing framework can only implement Continue reading

Posted in Big Data, Real Time Processing, Spark Streaming, Storm, stream processing | Tagged , , , | 1 Comment

Customer Churn Prediction with SVM using Scikit-Learn

Support Vector Machine (SVM) is unique among the supervised machine learning algorithms in the sense that it focuses on training data points along the separating hyper planes. In this post, I will go over the details of how I have used SVM from the excellent python machine learning library scikit-learn to predict customer churn for a hypothetical telecommunication company.

Along the way we will also explore the interplay between model complexity, training data size and generalization error rate to gain deeper  insight into learning problems.

The python implementation is available in my open source project avenir on github. The implementation provides a nice abstraction on the SVM implementation of scikit-learn. It can handle Continue reading

Posted in Machine Learning, Predictive Analytic, Python | Tagged , , , , | 2 Comments