This is a sequel to my earlier posts on Hadoop based ETL covering validation and profiling. Considering the fact that in most data projects more than 50% of the time is spent on data cleaning and munging, I have added significant ETL function to my OSS project chombo in github, including validation, transformation and profiling.
This post will focus on data transformation. Validation, transformation and profiling are the key activities in any ETL process. I will be using the retail sales data for a fictitious multinational retailer to showcase Continue reading
Time sequence data which is all around us may contain seasonal components. Data is seasonal when there is a seasonal component e.g month of the year, day of the week, hour of week day etc in the data. It is defined by a time range and a period.
My open source project chombo has solutions for seasonality analysis. The solution is two fold. First, there is a map reduce job to detect seasonality in data. Second, there is another map reduce job to calculate statistics of the seasonal components. In this post we will go through the steps for analysing operational data, with seasonal Continue reading
Data profiling is the process of examining data to learn about important characteristics of data. It’s an important part of any ETL process. It’s often necessary to do data profiling before embarking on any serious analytic work. I have implemented various open source Hadoop based data profiling Map Reduce jobs. Most of them are in the project chombo. Some are are in other projects.
I will provide an overview of the various data profiling Hadoop Map Reduce implementations in this post. Continue reading
Anomaly detection with with various statistical modeling based techniques are simple and effective. The Zscore based technique is one among them. Zscore is defined as the absolute difference between a data value and it’s mean normalized with standard deviation. A data point with Zscore value above some threshold is considered to be a potential outlier. One criticism against Zscore is that it’s prone to be influenced by outliers. To remedy that, a technique called robust Zscore can be used which is much more tolerant of outliers.
In this post, I will go over a robust Zscore based implementation on Hadoop to detect outliers in data. The solution is part of my open source project chombo. Continue reading
Data quality is a thorny issue in most Big Data projects. It’s been reported that more than half of the time spent in Big Data projects goes towards data cleansing and preparation. In this post, I will cover data validation features that have been added recently to my OSS project chombo, which runs on Hadoop and Storm. Set of easily configurable common validation functions are provided out of the box. I will use product data as a test case Continue reading
For on line users, conversion generally refers to the user action that results in some tangible gain for a business e.g., an user opening an account or an user making his or her first purchase. Next to drawing large number of users to a web site, getting an user to convert is the most critical event in an user’s relationship with on line business. Being able to predict when an user will convert to become a customer should be an important tool that on line businesses should have at their disposal. A business could intiate targeted marketing campaign based on the prediction result.
In this post, I will be using user online behavior data to predict whether an user will convert using Markov Chain Classifier. The Hadoop based implementation Continue reading
For many Big Data projects, it has been reported that significant part of the time, sometimes up to 70-80% of time, is spent in data cleaning and preparation. Typically, in most ETL tools, you define constraints and rules statically for data validation. Some examples of such rules are limit checking for numerical quantities and pattern matching for text data.
Sometimes it’s not feasible to define the rules statically, because there could be too many variables and the variables could be non stationary. Data is non stationary when it’s statistical properties change with time. In this post, we will go through a technique of detecting whether some numerical data is outside an acceptable range by detecting outliers. Continue reading