Data Quality Control With Outlier Detection

For many Big Data projects, it has been reported  that significant part of the time, sometimes up to 70-80% of time,  is spent in data cleaning and preparation. Typically, in most ETL tools,  you define constraints and rules statically for data validation. Some examples of such rules are limit checking for numerical quantities and pattern matching for text data.

Sometimes it’s not feasible to define the rules statically, because there could be too many variables and the variables could be non stationary. Data is non stationary when it’s statistical properties change with time.  In this post, we will go through a technique of detecting whether some numerical data is outside an acceptable range by detecting outliers. Continue reading

Posted in Big Data, ETL, Hadoop and Map Reduce, Internet of Things, Outlier Detection, Statistics | Tagged , , , , | Leave a comment

Is Bigger Data Better for Machine Learning

I have seen the topic of data size as it relates to machine learning being discussed often. But they are mostly opinions, views and innuendos, not backed by any rational explanation. Comments like we need “we need meaningful data not big data” or “insightful data, not big data” abound which often leave me puzzled. How does Big Data preclude meaning or insight in the data. Being motivated to find concrete answers to the question of data size as it relates to learning, I decided to explore theories of Machine Learning to get to the heart of the issue.  I found some interesting and insightful answers in Computational Learning Theory.

In this post I will summarize my findings. As part of this exercise, I have implemented a python script to estimate the minimum training set size required by a learner, based on PAC theory. Continue reading

Posted in Big Data, Machine Learning, Predictive Analytic | Tagged , , , | 1 Comment

Bulk Insert, Update and Delete in Hadoop Data Lake

Hadoop Data Lake, unlike traditional data warehouse, does not enforce schema on write and serves as a repository of data with different formats from various sources. If the data collected in a data lake is immutable, they simply accumulate in an append only fashion and are easy to handle. Such data  tend to be fact data e.g., user behavior tracking data or sensor data.

However, dimension data or master data e.g., customer data, product data will typically be mutable. Generally they arrive in batches from some external source system reflecting incremental inserts, updates and deletes in the external system. All  incremental  changes need to be consolidated with the existing master data. This is a problem faced in many Hadoop or Hive based Big Data analytic project. Even if you are using the latest version of Hive, there is no bulk update or delete support. This post is about a Map Reduce job that will perform bulk insert, update and delete with data in HDFS. Continue reading

Posted in Big Data, ETL, Hadoop and Map Reduce, Hive | Tagged , , , | 1 Comment

Customer Service and Recommendation System

You may be wondering about the relationship, I alluded to in the title. A personalization and recommendation system like sifarish bootstraps from user and item engagement data. This kind of data is gleaned from various signals e.g. an user’s engagement with various items, an user’s explicit rating of an item. A recommendation system could be benefit significantly from customer service data residing in Customer Relationship Management (CRM) systems. It’s yet another customer item engagement signal.

In this post the focus will be on extraction customer engagement data from CRM and how it can be combined  with other customer item engagement signals to define a Continue reading

Posted in Big Data, Customer Service, eCommerce, Hadoop and Map Reduce, Recommendation Engine | Tagged , | 1 Comment

Real Time Detection of Outliers in Sensor Data using Spark Streaming

As far as analytic of sensor generated data is concerned, in Internet of Things (IoT) and in a connected everything world, it’s mostly about real time analytic of time series data. In this post, I will be addressing an use case involving detecting outliers in sensor generated data with Spark Streaming. Outliers are data points that deviate significantly from most of the data. The implementation is part of my new open source project ruscello, implemented in Scala. The project is focused on real time analytic of sensor data with various IoT use cases in mind.

The specific use we will be addressing in this post Continue reading

Posted in Big Data, Internet of Things, Outlier Detection, Real Time Processing, Spark, Time Series Analytic | Tagged , , | Leave a comment

Diversity in Personalization with Attribute Diffusion

One of the nagging problems in  personalized recommendation systems is crowding of items with same attribute values in the recommendation list. For example, if you happen to like certain artiste, the songs by the same artiste will tend to flood recommended song list in a music recommendation system. The flooding happens because of  the accuracy of the collaborative filtering algorithm. But that accuracy may not always generate the best outcome. The only criterion for success for a recommender  is whether or not the the user has clicked on one or more items in the recommendation impression list.

As I mentioned in my earlier post, getting a recommendation list through Collaborative Filtering (CF) machine learning algorithms goes only part of the way towards providing an effective solution. Various post processing mechanisms are required to modify and improve the recommendation list Continue reading

Posted in Big Data, Hadoop and Map Reduce, Personalization, Recommendation Engine | Tagged , | Leave a comment

Positive Feedback Driven Recommendation Rank Reordering

The basic recommendation output consisting of the tuple (user, item, predicted rating), is easy to obtain from any Collaborative Filtering (CF) based Recommendation and Personalization engine, including sifarish. It’s been reported that there is a bigger return for the quality of recommendation results by applying various post processing logic on the basic CF output, instead of tweaking the core machine learning algorithm. We can think of these post processing units as plugins that apply to the basic recommendation output.

Many of those have been implemented in sifarish. They  are not based on rigorous machine learning algorithms, but on simple intuition and heuristics. In this post I will go through one of them that was implemented recently. The actual rating of an item by an user is is considered in conjunction with the predicted rating to derive a net rating. Continue reading

Posted in Big Data, Collaborative Filtering, Hadoop and Map Reduce, Personalization, Recommendation Engine | Tagged , , | 1 Comment