Support Vector Machine (SVM) is unique among the supervised machine learning algorithms in the sense that it focuses on training data points along the separating hyper planes. In this post, I will go over the details of how I have used SVM from the excellent python machine learning library scikit-learn to predict customer churn for a hypothetical telecommunication company.
Along the way we will also explore the interplay between model complexity, training data size and generalization error rate to gain deeper insight into learning problems.
The python implementation is available in my open source project avenir on github. The implementation provides a nice abstraction on the SVM implementation of scikit-learn. It can handle Continue reading
How does neural network or for that matter any machine learning model relates to Big Data. Do we get a better quality learning model with bigger data. That’s what we will explore in this post. We will explore sample complexity i.e. the way model performance varies with training sample size. This will be particularly interesting from a Big Data point of view. We will also look at model complexity which tells us how model performance varies with model complexity.
Although I have used a multi layer neural network for my experiments, the findings should Continue reading
Posted in Big Data, Machine Learning, Optimization, Predictive Analytic, Uncategorized
Tagged bias, fortunately, generalization error, model complexity, neural network, sample complexity, variance, VC diemension
Customer lifetime value for a business is the monetary value associated with relationship with a customer, although there have been attempts to include non monetary value associated with a customer. It’s an important metrics to have for any marketing initiative e.g., customer retention. The metric is also useful if preferential treatment is to be given to high value customer, during various interactions with customers e.g. customer service.
In this post we will cover a Hadoop based solution for customer life time value. The solution starts with customer transaction history and computes customer lifetime value score using multiple map reduce jobs. The solution is part of open source project visitante. Continue reading
Analyzing vast amount of machine generated unstructured or semi structured data is Hadoop’s forte. Many of us have gone through the exercise of searching log files, most likely with grep, for some pattern and then looking at surrounding log lines as a context to get a better understanding of some incident or event reported in the log.
This effort could be necessary as part of troubleshooting some problem or to gain insight from log data about some significant event . The Hadoop based solution presented in this post can be thought of as grep on steroid.
When dealing with large volumes of log data, the manual way of searching is not viable. One of my earlier posts outlined the Hadoop based solution for this problem. In this post, the focus will be on the implementation, which is part of my open source project visitante. Continue reading
Association mining solves many real life problems e.g., frequent items bought together, songs frequently listened together in one session etc. Apriori is a popular algorithm for mining frequent items sets. In this post, we will go over a Hadoop based implementation of Apriori available in my open source project avenir. Frequent items mining results can also be used for collaborative filtering based recommendation and rule mining.
We will use retail sales data with a twist. Our interest will be Continue reading
This is a sequel to my earlier posts on Hadoop based ETL covering validation and profiling. Considering the fact that in most data projects more than 50% of the time is spent on data cleaning and munging, I have added significant ETL function to my OSS project chombo in github, including validation, transformation and profiling.
This post will focus on data transformation. Validation, transformation and profiling are the key activities in any ETL process. I will be using the retail sales data for a fictitious multinational retailer to showcase Continue reading
Time sequence data which is all around us may contain seasonal components. Data is seasonal when there is a seasonal component e.g month of the year, day of the week, hour of week day etc in the data. It is defined by a time range and a period.
My open source project chombo has solutions for seasonality analysis. The solution is two fold. First, there is a map reduce job to detect seasonality in data. Second, there is another map reduce job to calculate statistics of the seasonal components. In this post we will go through the steps for analysing operational data, with seasonal Continue reading