Handling Rare Events and Class Imbalance in Predictive Modeling for Machine Failure

Most supervised Machine Learning algorithms face difficulty when there is class imbalance in the training data i.e., amount of data belonging one class heavily outnumber the other class. However, there are may real life problems where we encounter this situation e.g., fraud, customer churn and machine failure. There are various techniques to address this thorny problem of class imbalance.

In this post we will go over a technique based on oversampling o the minority class data called Synthetic Minority Over-sampling Technique (SMOTE). We will go into the details of a Hadoop based implementation using machine failure data Continue reading

Posted in Big Data, Data Science, ETL, Hadoop and Map Reduce | Tagged , , , , | Leave a comment

Measuring Campaign Effectiveness for an Online Service on Spark

Measuring campaign effectiveness is critical for any company to justify the marketing money being spent. Consider a company providing a free online service on signup. It’s critical for the company to convert them so that they subscribe to a paid service as soon as possible.

In this post, we will use  simple statistical techniques to find the relative merits for different campaigns in terms of effectiveness which is measured by conversions. The Spark based solution is available Continue reading

Posted in Big Data, Data Science, Marketing Analytic, Spark | Tagged , , | Leave a comment

Processing Missing Values with Hadoop

Missing values are just part of life in the data processing world. In most cases you can not simply ignore the missing values as it may adversely affect whatever analytic processing you are going to do. Broadly speaking, handling missing data consists of two steps, gaining some insight on missing fields in the data and then taking some actions based on the insight gained from the first step. In this post the focus will be primarily on the first step.

The Hadoop based implementation is available in my OSS project chombo on github. In future Continue reading

Posted in Big Data, Data Profiling, Data Science, ETL, Hadoop and Map Reduce | Tagged , , | Leave a comment

Project Assignment Optimization with Simulated Annealing on Spark

Optimizing assignment of people to projects is a very complex problem and classical optimization techniques are not very useful. The topic this post is a project assignment optimization problem where people should be assigned to projects in a way that will minimize the cost.

This kind of optimization problems involving discrete or categorical variables are called combinatorial optimization problems and they generally don’t have analytical solution. You have to resort to other non conventional techniques. However, these alternative techniques won’t guarantee optimal solution. Simulated Annealing is one such technique and broadly comes under category of algorithms called Stochastic Optimization.

We will discuss solution of the project assignment optimization problem using Simulated Annealing implemented on Spark. The Scala based implementation Continue reading

Posted in Data Science, Optimization, Spark | Tagged , , | 1 Comment

Mining Seasonal Products from Sales Data

The other day someone asked me how to include products with seasonal demand in recommendations based on collaborative filtering or some other technique. The solution to the problem involves two steps. The first step is to identify products with seasonal demand. The second step involves merging two ranked lists, one list being the recommendation list based on some recommendation algorithm, the other being the list of products with seasonal demand. The second list is a function of time.

Our focus in this post is on the first problem. As we will explore here in this post, products with seasonal demand can be found with some simple statistical technique. The solution Continue reading

Posted in Big Data, Data Mining, Data Science, eCommerce, Map Reduce, Recommendation Engine | Tagged , , , | Leave a comment

Predicting Call Hangup in Customer Service Calls with Decision Tree and Random Forest

When customers hangup after a long wait in a call, it’s money wasted for the company. Moreover, it leaves the customer with a poor experience. It would have been nice, if we could predict in real time while the customer is on hold, how likely is the customer to hangup and based on the prediction give queue priority to the customer or take some other action.

In this post, we will discuss a  Decision Tree and Random Forest based solution implemented on Hadoop. Random forest is Continue reading

Posted in Big Data, Customer Service, Hadoop and Map Reduce, Machine Learning, Predictive Analytic | Tagged , , | 1 Comment

Machine Learning at Scale with Parallel Processing

Machine Learning can leverage modern parallel data processing platforms like Hadoop and Spark in several ways. In this post we will discuss how to have Machine Learning at scale with Hadoop or Spark. We will consider three different ways parallel processing can benefit Machine Learning.

When thinking about parallel processing in the context Machine Learning, what immediately jumps in our mind is data partitioning along with divide and conquer learning algorithms. However as we will find out Continue reading

Posted in Hadoop and Map Reduce, Machine Learning, Spark | Tagged , , | 2 Comments