Validating Big Data

Data quality is a thorny issue in most Big Data projects. It’s been reported that up to two thirds of the time spent in Big Data projects goes towards data cleansing and preparation. In this post, I will cover data validation features that have been added recently to my OSS project chombo, which runs on Hadoop and Storm. Set of  easily configurable common validation functions are provided out of the box.  I will use product data as a test case Continue reading

Posted in Big Data, data quality, ETL, Hadoop and Map Reduce | Tagged , | Leave a comment

Customer Conversion Prediction with Markov Chain Classifier

For on line users, conversion generally refers to the user action that results in some tangible gain for a business e.g., an user opening an account or an user making his or her first purchase. Next to drawing large number of users to a web site, getting an user to convert is the most critical event in an user’s relationship with on line business. Being able to predict when an user will convert to become a customer should be an important tool that on line businesses should have at their disposal. A business could intiate targeted marketing campaign based on the prediction result.

In this post, I will be using user online behavior data to predict whether an user will  convert using Markov Chain Classifier. The Hadoop based implementation Continue reading

Posted in Big Data, Hadoop and Map Reduce, Machine Learning, Marketing Analytic, Predictive Analytic, Statistics | Tagged , , | Leave a comment

Data Quality Control With Outlier Detection

For many Big Data projects, it has been reported  that significant part of the time, sometimes up to 70-80% of time,  is spent in data cleaning and preparation. Typically, in most ETL tools,  you define constraints and rules statically for data validation. Some examples of such rules are limit checking for numerical quantities and pattern matching for text data.

Sometimes it’s not feasible to define the rules statically, because there could be too many variables and the variables could be non stationary. Data is non stationary when it’s statistical properties change with time.  In this post, we will go through a technique of detecting whether some numerical data is outside an acceptable range by detecting outliers. Continue reading

Posted in Big Data, ETL, Hadoop and Map Reduce, Internet of Things, Outlier Detection, Statistics | Tagged , , , , | Leave a comment

Is Bigger Data Better for Machine Learning

I have seen the topic of data size as it relates to machine learning being discussed often. But they are mostly opinions, views and innuendos, not backed by any rational explanation. Comments like we need “we need meaningful data not big data” or “insightful data, not big data” abound which often leave me puzzled. How does Big Data preclude meaning or insight in the data. Being motivated to find concrete answers to the question of data size as it relates to learning, I decided to explore theories of Machine Learning to get to the heart of the issue.  I found some interesting and insightful answers in Computational Learning Theory.

In this post I will summarize my findings. As part of this exercise, I have implemented a python script to estimate the minimum training set size required by a learner, based on PAC theory. Continue reading

Posted in Big Data, Machine Learning, Predictive Analytic | Tagged , , , | 1 Comment

Bulk Insert, Update and Delete in Hadoop Data Lake

Hadoop Data Lake, unlike traditional data warehouse, does not enforce schema on write and serves as a repository of data with different formats from various sources. If the data collected in a data lake is immutable, they simply accumulate in an append only fashion and are easy to handle. Such data  tend to be fact data e.g., user behavior tracking data or sensor data.

However, dimension data or master data e.g., customer data, product data will typically be mutable. Generally they arrive in batches from some external source system reflecting incremental inserts, updates and deletes in the external system. All  incremental  changes need to be consolidated with the existing master data. This is a problem faced in many Hadoop or Hive based Big Data analytic project. Even if you are using the latest version of Hive, there is no bulk update or delete support. This post is about a Map Reduce job that will perform bulk insert, update and delete with data in HDFS. Continue reading

Posted in Big Data, ETL, Hadoop and Map Reduce, Hive | Tagged , , , | 1 Comment

Customer Service and Recommendation System

You may be wondering about the relationship, I alluded to in the title. A personalization and recommendation system like sifarish bootstraps from user and item engagement data. This kind of data is gleaned from various signals e.g. an user’s engagement with various items, an user’s explicit rating of an item. A recommendation system could be benefit significantly from customer service data residing in Customer Relationship Management (CRM) systems. It’s yet another customer item engagement signal.

In this post the focus will be on extraction customer engagement data from CRM and how it can be combined  with other customer item engagement signals to define a Continue reading

Posted in Big Data, Customer Service, eCommerce, Hadoop and Map Reduce, Recommendation Engine | Tagged , | 1 Comment

Real Time Detection of Outliers in Sensor Data using Spark Streaming

As far as analytic of sensor generated data is concerned, in Internet of Things (IoT) and in a connected everything world, it’s mostly about real time analytic of time series data. In this post, I will be addressing an use case involving detecting outliers in sensor generated data with Spark Streaming. Outliers are data points that deviate significantly from most of the data. The implementation is part of my new open source project ruscello, implemented in Scala. The project is focused on real time analytic of sensor data with various IoT use cases in mind.

The specific use we will be addressing in this post Continue reading

Posted in Big Data, Internet of Things, Outlier Detection, Real Time Processing, Spark, Time Series Analytic | Tagged , , | Leave a comment