Project Assignment Optimization with Simulated Annealing on Spark

Optimizing assignment of people to projects is a very complex problem and classical optimization techniques are not very useful. The topic this post is a project assignment optimization problem where people should be assigned to projects in a way that will minimize the cost.

This kind of optimization problems involving discrete or categorical variables are called combinatorial optimization problems and they generally don’t have analytical solution. You have to resort to other non conventional techniques. However, these alternative techniques won’t guarantee optimal solution. Simulated Annealing is one such technique and broadly comes under category of algorithms called Stochastic Optimization.

We will discuss solution of the project assignment optimization problem using Simulated Annealing implemented on Spark. The Scala based implementation Continue reading

Posted in Data Science, Optimization, Spark | Tagged , , | Leave a comment

Mining Seasonal Products from Sales Data

The other day someone asked me how to include products with seasonal demand in recommendations based on collaborative filtering or some other technique. The solution to the problem involves two steps. The first step is to identify products with seasonal demand. The second step involves merging two ranked lists, one list being the recommendation list based on some recommendation algorithm, the other being the list of products with seasonal demand. The second list is a function of time.

Our focus in this post is on the first problem. As we will explore here in this post, products with seasonal demand can be found with some simple statistical technique. The solution Continue reading

Posted in Big Data, Data Mining, Data Science, eCommerce, Map Reduce, Recommendation Engine | Tagged , , , | Leave a comment

Predicting Call Hangup in Customer Service Calls with Decision Tree and Random Forest

When customers hangup after a long wait in a call, it’s money wasted for the company. Moreover, it leaves the customer with a poor experience. It would have been nice, if we could predict in real time while the customer is on hold, how likely is the customer to hangup and based on the prediction give queue priority to the customer or take some other action.

In this post, we will discuss a  Decision Tree and Random Forest based solution implemented on Hadoop. Random forest is Continue reading

Posted in Big Data, Customer Service, Hadoop and Map Reduce, Machine Learning, Predictive Analytic | Tagged , , | 1 Comment

Machine Learning at Scale with Parallel Processing

Machine Learning can leverage modern parallel data processing platforms like Hadoop and Spark in several ways. In this post we will discuss how to have Machine Learning at scale with Hadoop or Spark. We will consider three different ways parallel processing can benefit Machine Learning.

When thinking about parallel processing in the context Machine Learning, what immediately jumps in our mind is data partitioning along with divide and conquer learning algorithms. However as we will find out Continue reading

Posted in Hadoop and Map Reduce, Machine Learning, Spark | Tagged , , | 2 Comments

Mobile Phone Usage Data Analytics for Effective Marketing Campaign

Insights gained from analyzing mobile phone usage data can be extremely valuable in marketing campaign and customer engagement efforts. For example, hour of the day when an user engages most with his or her mobile  device could be used to choose  the time to send a marketing message or email. Most frequent tower locations could be used to for promotional efforts of nearby businesses.

In this a post, we will go over a Spark based implementation for histogram and other simple statistics for mobile phone usage data. The solution is available Continue reading

Posted in Big Data, Data Profiling, Marketing Analytic, Spark, Statistics | Tagged , | Leave a comment

Debunking the Myth of Top Ten Machine Learning Algorithms

This kind of broad brush statements about Machine Learning algorithms are made often and there are lot of online content alluding to this simplistic view of Machine Learning. It’s tempting to gravitate towards simplistic views and use recipe like approach while solving Machine Learning problems. 

Unfortunately, the reality of Machine Learning is complex. The choice of learning algorithms and the algorithm tuning process can be confounding. Choosing the right learning algorithm for a given problem is not a trivial task.

In this post we will delve into the computational learning theory and use it to debunk the myth of the so called Top Ten Machine Learning algorithms. Hopefully, after reading this post, you will be more  cautious about blindly using one of the algorithms from the so called top 10 list. In this post, I will keep the discussion Continue reading

Posted in Machine Learning | Tagged | Leave a comment

JSON to Relational Mapping with Spark

If there one data format that’s ubiquitous, it’s JSON. Whether  you are calling an API, or exporting data from some system, the format is most likely to be JSON these days. However many databases can not handle  JSON and you want to store the data in relational format. You may want to do offline batch processing with Spark or Hadoop and and your Hadoop or Spark application may only support flat field delimited record oriented input.

Whatever the case, a tool is needed to map JSON to flat record oriented data. In this post we will go over a Spark based solution for converting JSON to flat relational data. The implementation is part of my open source project chombo on github. A Hadoop Continue reading

Posted in Big Data, ETL, Spark | Tagged , | Leave a comment