Big Data System Design with Bayesian Optimization

Designing complex Big Data system with myriad of  parameters and design choices is a daunting task. It’s almost a black art. Typically we stay with the default parameter settings, unless it fails to meet your requirement which forces you venture out of comfort zone of default settings. Essentially what we are dealing with is a complex optimization problem with no closed form solution. We have to perform a search in a multi dimensional parameter space, where the  choice of parameter value combinations may run into hundreds of thousands if not millions.

With limited time and resource, the brute force approach of running  tests for all the configuration value combinations is not a viable option. It’s clear that we have to do a guided search through the parameters space, so that we can arrive at the desired parameters values with a limited number of tests.  It this post we will discuss an optimization technique called Bayesian optimization, which is popular for solving Continue reading

Posted in Big Data, Cluster Computation, Data Science, Optimization | Tagged , | Leave a comment

Customer Segmentation Based on Online Behavior using ScikitLearn

Customer segmentation or clustering is useful in various ways. It could be used for targeted marketing. Sometimes when building predictive model, it’s more effective to cluster the data and build a separate predictive model for each cluster. In this post, we will segment customers based on their online behavior in an eCommerce web site.

The focus of this post is on solving a specific problem and interpret the results and not  a broad overview of clustering techniques.  We will use python scikit-learn machine learning library. The python implementation can be found in Continue reading

Posted in Data Mining, Data Science, Machine Learning | Tagged , , , , | 2 Comments

Inventory Forecasting with Markov Chain Monte Carlo

Sometimes you want to calculate statistics about some variable which has complex, possibly non linear relationship with another variable for which probability distribution is available, which may be non standard or non parametric. That’s the situation we face when trying predict and plan inventory in the face of demand with some arbitrary probability distribution. For this problem, the goal is to choose an inventory level, given a an arbitrary demand distribution, so that some statistic on earning is maximized. The fact  that the demand  distribution has arbitrary non standard distribution and earning has a complex non linear relation with inventory and demand crushes any hope for an analytical solution.

One way out of this quagmire is simulate earning by sampling from the demand distribution and applying the nonlinear function to convert inventory and demand to earning. That’s the approach Continue reading

Posted in Data Science, Machine Learning, Optimization, Python, Simulation | Tagged , , , | 1 Comment

Exactly Once Stream Processing Semantics ? Not Exactly

Stream processing systems  are characterized by at least once, at most once and exactly once processing semantics. These are important characteristics that should be carefully considered from the point of view of  consistency and durability of a stream processing application. However if a stream processing product claims to guarantee exactly once processing semantics, you should carefully read the fine prints.

The inconvenient truth is that a stream processing product can not unilaterally guarantee exactly once processing semantics.  It’s true under certain assumptions or when the application and the stream processing frame work collaborate in certain ways.

From a system architecture point of view a stream processing framework can only implement Continue reading

Posted in Big Data, Real Time Processing, Spark Streaming, Storm, stream processing | Tagged , , , | 1 Comment

Customer Churn Prediction with SVM using Scikit-Learn

Support Vector Machine (SVM) is unique among the supervised machine learning algorithms in the sense that it focuses on training data points along the separating hyper planes. In this post, I will go over the details of how I have used SVM from the excellent python machine learning library scikit-learn to predict customer churn for a hypothetical telecommunication company.

Along the way we will also explore the interplay between model complexity, training data size and generalization error rate to gain deeper  insight into learning problems.

The python implementation is available in my open source project avenir on github. The implementation provides a nice abstraction on the SVM implementation of scikit-learn. It can handle Continue reading

Posted in Data Science, Machine Learning, Predictive Analytic, Python | Tagged , , , , | 2 Comments

Is Neural Network Better Off with Big Data

How does neural network or for that matter any machine learning model relates to Big Data. Do we get a better quality learning model with bigger data. That’s what we will explore in this post. We will explore sample complexity i.e. the way model performance varies with training sample size. This will be particularly interesting from a Big Data point of view.  We will also look at model complexity which tells us how model performance varies with model complexity.

Although I have used a multi layer neural network for my experiments, the findings should Continue reading

Posted in Big Data, Data Science, Machine Learning, Optimization, Predictive Analytic, Uncategorized | Tagged , , , , , , , | 3 Comments

Customer Lifetime Value, Present and Future

Customer lifetime value for a business is the monetary value associated with relationship with a customer, although there have been attempts to include non monetary value associated  with a customer. It’s an important metrics to have for any marketing initiative e.g., customer retention. The metric is also useful if preferential treatment is to be given to  high value customer, during various interactions with customers e.g. customer service.

In this post we will cover a Hadoop based solution for customer life time value. The solution starts with customer transaction history and computes customer lifetime value score using multiple map reduce jobs. The solution is part of open source project visitante. Continue reading

Posted in Big Data, Data Science, Hadoop and Map Reduce, Marketing Analytic, Statistics | Tagged , , , | Leave a comment