Exactly Once Stream Processing Semantics ? Not Exactly

Stream processing systems  are characterized by at least once, at most once and exactly once processing semantics. These are important characteristics that should be carefully considered from the point of view of  consistency and durability of a stream processing application. However if a stream processing product claims to guarantee exactly once processing semantics, you should carefully read the fine prints.

The inconvenient truth is that a stream processing product can not unilaterally guarantee exactly once processing semantics.  It’s true under certain assumptions or when the application and the stream processing frame work collaborate in certain ways.

From a system architecture point of view a stream processing framework can only implement Continue reading

Posted in Big Data, Real Time Processing, Spark Streaming, Storm, stream processing | Tagged , , , | 1 Comment

Customer Churn Prediction with SVM using Scikit-Learn

Support Vector Machine (SVM) is unique among the supervised machine learning algorithms in the sense that it focuses on training data points along the separating hyper planes. In this post, I will go over the details of how I have used SVM from the excellent python machine learning library scikit-learn to predict customer churn for a hypothetical telecommunication company.

Along the way we will also explore the interplay between model complexity, training data size and generalization error rate to gain deeper  insight into learning problems.

The python implementation is available in my open source project avenir on github. The implementation provides a nice abstraction on the SVM implementation of scikit-learn. It can handle Continue reading

Posted in Machine Learning, Predictive Analytic, Python | Tagged , , , , | Leave a comment

Is Neural Network Better Off with Big Data

How does neural network or for that matter any machine learning model relates to Big Data. Do we get a better quality learning model with bigger data. That’s what we will explore in this post. We will explore sample complexity i.e. the way model performance varies with training sample size. This will be particularly interesting from a Big Data point of view.  We will also look at model complexity which tells us how model performance varies with model complexity.

Although I have used a multi layer neural network for my experiments, the findings should Continue reading

Posted in Big Data, Machine Learning, Optimization, Predictive Analytic, Uncategorized | Tagged , , , , , , , | 3 Comments

Customer Lifetime Value, Present and Future

Customer lifetime value for a business is the monetary value associated with relationship with a customer, although there have been attempts to include non monetary value associated  with a customer. It’s an important metrics to have for any marketing initiative e.g., customer retention. The metric is also useful if preferential treatment is to be given to  high value customer, during various interactions with customers e.g. customer service.

In this post we will cover a Hadoop based solution for customer life time value. The solution starts with customer transaction history and computes customer lifetime value score using multiple map reduce jobs. The solution is part of open source project visitante. Continue reading

Posted in Big Data, Hadoop and Map Reduce, Marketing Analytic, Statistics | Tagged , , , | Leave a comment

Detecting Incidents with Context from Log Data

Analyzing vast amount of machine generated unstructured or semi structured data is Hadoop’s forte. Many of us have gone through the exercise of searching log files, most likely with grep,  for some pattern and then looking at surrounding log lines as a context to get a better understanding of some incident or event reported in the log.

This effort could be necessary as part of troubleshooting some problem or to gain insight from log data about some significant event .  The Hadoop based solution presented in this post can be thought of as grep on steroid.

When dealing with large volumes of log data, the manual way of searching  is not viable. One of my earlier posts outlined the Hadoop based solution for this problem. In this post, the focus will be on the implementation, which is part of my open source project visitante. Continue reading

Posted in Big Data, Hadoop and Map Reduce, Log Analysis, Uncategorized, Web Analytic | Tagged , , , | Leave a comment

Association Mining with Improved Apriori Algorithm

Association mining solves many real life  problems e.g., frequent items bought together, songs frequently listened together in one session etc. Apriori is a popular algorithm for mining frequent items sets. In this post, we will go over a Hadoop based implementation of Apriori available in my open source project avenir.  Frequent items mining results can also be used for collaborative filtering based recommendation and rule mining.

We will use retail sales data with a twist. Our interest will be Continue reading

Posted in Association Mining, Big Data, Data Mining, Hadoop and Map Reduce, Marketing Analytic, Rule Mining | Tagged , , | Leave a comment

Transforming Big Data

This is a sequel to my earlier posts on Hadoop based ETL covering validation and profiling. Considering  the fact that in most data projects more than 50% of the time is spent on  data cleaning and munging, I have added significant ETL function to my OSS project chombo in github, including validation, transformation and profiling.

This post will focus on data transformation. Validation, transformation and profiling are the key activities in any ETL process. I will be using the retail sales data for a fictitious multinational retailer to showcase Continue reading

Posted in Big Data, Data Transformation, ETL, Hadoop and Map Reduce | Tagged , | 1 Comment