Stream processing systems are characterized by at least once, at most once and exactly once processing semantics. These are important characteristics that should be carefully considered from the point of view of consistency and durability of a stream processing application. However if a stream processing product claims to guarantee exactly once processing semantics, you should carefully read the fine prints.
The inconvenient truth is that a stream processing product can not unilaterally guarantee exactly once processing semantics. It’s true under certain assumptions or when the application and the stream processing frame work collaborate in certain ways.
From a system architecture point of view a stream processing framework can only implement Continue reading
Support Vector Machine (SVM) is unique among the supervised machine learning algorithms in the sense that it focuses on training data points along the separating hyper planes. In this post, I will go over the details of how I have used SVM from the excellent python machine learning library scikit-learn to predict customer churn for a hypothetical telecommunication company.
Along the way we will also explore the interplay between model complexity, training data size and generalization error rate to gain deeper insight into learning problems.
The python implementation is available in my open source project avenir on github. The implementation provides a nice abstraction on the SVM implementation of scikit-learn. It can handle Continue reading
How does neural network or for that matter any machine learning model relates to Big Data. Do we get a better quality learning model with bigger data. That’s what we will explore in this post. We will explore sample complexity i.e. the way model performance varies with training sample size. This will be particularly interesting from a Big Data point of view. We will also look at model complexity which tells us how model performance varies with model complexity.
Although I have used a multi layer neural network for my experiments, the findings should Continue reading
Posted in Big Data, Machine Learning, Optimization, Predictive Analytic, Uncategorized
Tagged bias, fortunately, generalization error, model complexity, neural network, sample complexity, variance, VC diemension
Customer lifetime value for a business is the monetary value associated with relationship with a customer, although there have been attempts to include non monetary value associated with a customer. It’s an important metrics to have for any marketing initiative e.g., customer retention. The metric is also useful if preferential treatment is to be given to high value customer, during various interactions with customers e.g. customer service.
In this post we will cover a Hadoop based solution for customer life time value. The solution starts with customer transaction history and computes customer lifetime value score using multiple map reduce jobs. The solution is part of open source project visitante. Continue reading
Analyzing vast amount of machine generated unstructured or semi structured data is Hadoop’s forte. Many of us have gone through the exercise of searching log files, most likely with grep, for some pattern and then looking at surrounding log lines as a context to get a better understanding of some incident or event reported in the log.
This effort could be necessary as part of troubleshooting some problem or to gain insight from log data about some significant event . The Hadoop based solution presented in this post can be thought of as grep on steroid.
When dealing with large volumes of log data, the manual way of searching is not viable. One of my earlier posts outlined the Hadoop based solution for this problem. In this post, the focus will be on the implementation, which is part of my open source project visitante. Continue reading
Association mining solves many real life problems e.g., frequent items bought together, songs frequently listened together in one session etc. Apriori is a popular algorithm for mining frequent items sets. In this post, we will go over a Hadoop based implementation of Apriori available in my open source project avenir. Frequent items mining results can also be used for collaborative filtering based recommendation and rule mining.
We will use retail sales data with a twist. Our interest will be Continue reading
This is a sequel to my earlier posts on Hadoop based ETL covering validation and profiling. Considering the fact that in most data projects more than 50% of the time is spent on data cleaning and munging, I have added significant ETL function to my OSS project chombo in github, including validation, transformation and profiling.
This post will focus on data transformation. Validation, transformation and profiling are the key activities in any ETL process. I will be using the retail sales data for a fictitious multinational retailer to showcase Continue reading