Data profiling is the process of examining data to learn about important characteristics of data. It’s an important part of any ETL process. It’s often necessary to do data profiling before embarking on any serious analytic work. I have implemented various open source Hadoop based data profiling Map Reduce jobs. Most of them are in the project chombo. Some are are in other projects.
I will provide an overview of the various data profiling Hadoop Map Reduce implementations in this post. Continue reading
Anomaly detection with with various statistical modeling based techniques are simple and effective. The Zscore based technique is one among them. Zscore is defined as the absolute difference between a data value and it’s mean normalized with standard deviation. A data point with Zscore value above some threshold is considered to be a potential outlier. One criticism against Zscore is that it’s prone to be influenced by outliers. To remedy that, a technique called robust Zscore can be used which is much more tolerant of outliers.
In this post, I will go over a robust Zscore based implementation on Hadoop to detect outliers in data. The solution is part of my open source project chombo. Continue reading
Data quality is a thorny issue in most Big Data projects. It’s been reported that up to two thirds of the time spent in Big Data projects goes towards data cleansing and preparation. In this post, I will cover data validation features that have been added recently to my OSS project chombo, which runs on Hadoop and Storm. Set of easily configurable common validation functions are provided out of the box. I will use product data as a test case Continue reading
For on line users, conversion generally refers to the user action that results in some tangible gain for a business e.g., an user opening an account or an user making his or her first purchase. Next to drawing large number of users to a web site, getting an user to convert is the most critical event in an user’s relationship with on line business. Being able to predict when an user will convert to become a customer should be an important tool that on line businesses should have at their disposal. A business could intiate targeted marketing campaign based on the prediction result.
In this post, I will be using user online behavior data to predict whether an user will convert using Markov Chain Classifier. The Hadoop based implementation Continue reading
For many Big Data projects, it has been reported that significant part of the time, sometimes up to 70-80% of time, is spent in data cleaning and preparation. Typically, in most ETL tools, you define constraints and rules statically for data validation. Some examples of such rules are limit checking for numerical quantities and pattern matching for text data.
Sometimes it’s not feasible to define the rules statically, because there could be too many variables and the variables could be non stationary. Data is non stationary when it’s statistical properties change with time. In this post, we will go through a technique of detecting whether some numerical data is outside an acceptable range by detecting outliers. Continue reading
I have seen the topic of data size as it relates to machine learning being discussed often. But they are mostly opinions, views and innuendos, not backed by any rational explanation. Comments like we need “we need meaningful data not big data” or “insightful data, not big data” abound which often leave me puzzled. How does Big Data preclude meaning or insight in the data. Being motivated to find concrete answers to the question of data size as it relates to learning, I decided to explore theories of Machine Learning to get to the heart of the issue. I found some interesting and insightful answers in Computational Learning Theory.
In this post I will summarize my findings. As part of this exercise, I have implemented a python script to estimate the minimum training set size required by a learner, based on PAC theory. Continue reading
Hadoop Data Lake, unlike traditional data warehouse, does not enforce schema on write and serves as a repository of data with different formats from various sources. If the data collected in a data lake is immutable, they simply accumulate in an append only fashion and are easy to handle. Such data tend to be fact data e.g., user behavior tracking data or sensor data.
However, dimension data or master data e.g., customer data, product data will typically be mutable. Generally they arrive in batches from some external source system reflecting incremental inserts, updates and deletes in the external system. All incremental changes need to be consolidated with the existing master data. This is a problem faced in many Hadoop or Hive based Big Data analytic project. Even if you are using the latest version of Hive, there is no bulk update or delete support. This post is about a Map Reduce job that will perform bulk insert, update and delete with data in HDFS. Continue reading