Research has shown that customers who have abandoned shopping carts, when subjected to retargeting email campaign, often come back and in many cases end up buying more than what was originally in the shopping cart.
There are many attributes of such email campaigns. In this post, we will find the attribute values that produce the maximum effectiveness for such retargeting campaigns, by including some of those attributes. A Hadoop based decision tree algorithm will be used to mine existing retargeting campaign data. Continue reading
Real time fraud detection is one of the use cases, where multiple components of the Big Data eco system come into play in a significant way, Hadoop batch processing for building the predictive model and Storm for predicting fraud from real time transaction stream using the predictive model. Additionally, Redis is used as the glue between the different sub systems.
In this post I will go through the end to end solution for real time fraud detection, using credit card transactions as an example, although the same solution can be used for any kind of sequence based outlier detection. I will be building a Markov chain model using the Hadoop based implementation in my open source project avenir. The prediction algorithm implementation Continue reading
Customer loyalty is the strength of the relationship a customer has with a business as manifested by customer purchasing more and at high frequency. There are various signal or events related to a customer’s engagement with a business. Some examples are transactions, customer service calls and social media comments. These events are indicative of a customer’s loyalty to a business. Loyalty is an internal state, that can not be directly observed and measured, but can be inferred probabilistically.
The customer events over a time window reflect a corresponding sequence of internal states of loyalty. The theme of this post is to predict the sequence of internal loyalty states by using Hidden Markov Model (HMM). Continue reading
The number of choices for big data solutions sometimes makes it overwhelming and confusing. Purpose of this post is to layout a road map for the big data solutions. I will be categorizing the products under four different category of solutions 1. Query 2. Analytic 3. Real time processing 4.Search. My list will include only well known open source solutions. I won’t be doing comprehensive analysis of the pros and cons of different products. There are lot posts out there with comparative studies, sometimes very strongly opinionated. I will be providing links for further exploration of the products listed for interested readers. Continue reading
Posted in Big Data
Tagged Cassandra, druid, elastic search, Hadoop, HBase, Hive, impala, MongoDB, road map, shark, Solr, spark, stinger
I was prompted to write this post in response to a recent discussion thread in linkedin Hadoop Users Group regarding fuzzy string matching for duplicate record identification with Hadoop. As part of my open source Hadoop based recommendation engine project sifarish, I have a MapReduce class for fuzzy matching between entities with multiple attributes. It’s used for content based recommendation. My initial reaction was that the same MapReduce class could be used for duplicate detection, by modelling all the fields of a record as text field and calculating Jaccard similarity between corresponding text fields.
On further thought, I realized that the Jaccard similarity may not be right choice and Levenshtein Distance is more appropriate. Continue reading
In an earlier post, I did a survey of a class of reinforcement learning algorithms, known as Multi Arm Bandit(MAB) . Essentially, these algorithms make decisions and learn from rewards received from the environment. You can also think of them as experimental learning or trial and error learning.
Now it’s time put them into practice. So, in this post, my focus is on using some of these algorithms for a real life use case. Our use case is price optimization. When selling a product, how do you choose the optimum price, to maximize profit. Continue reading
One of the popular features of MongoDB is the ability to store arbitrarily nested objects and be able to index on any nested field. In this post I will show how to store nested objects in Cassandra using composite columns. I have recently added this feature to my open source Cassandra project agiato. Since in Cassandra, like many other NOSQL databases, stored data is highly denormalized. The denormalized data often manifests itself in the form of a nested object e.g., denormalizing one to many relations.
In the solution presented here, the object data is stored in column families with composite columns. An object will typically have some statically defined fields. Rest of the fields will be dynamic. Continue reading