Analyzing vast amount of machine generated unstructured or semi structured data is Hadoop’s forte. Many of us have gone through the exercise of searching log files, most likely with grep, for some pattern and then looking at surrounding log lines as a context to get a better understanding of some incident or event reported in the log.
This effort could be necessary as part of troubleshooting some problem or to gain insight from log data about some significant event . The Hadoop based solution presented in this post can be thought of as grep on steroid.
When dealing with large volumes of log data, the manual way of searching is not viable. One of my earlier posts outlined the Hadoop based solution for this problem. In this post, the focus will be on the implementation, which is part of my open source project visitante. Continue reading
Association mining solves many real life problems e.g., frequent items bought together, songs frequently listened together in one session etc. Apriori is a popular algorithm for mining frequent items sets. In this post, we will go over a Hadoop based implementation of Apriori available in my open source project avenir. Frequent items mining results can also be used for collaborative filtering based recommendation and rule mining.
We will use retail sales data with a twist. Our interest will be Continue reading
This is a sequel to my earlier posts on Hadoop based ETL covering validation and profiling. Considering the fact that in most data projects more than 50% of the time is spent on data cleaning and munging, I have added significant ETL function to my OSS project chombo in github, including validation, transformation and profiling.
This post will focus on data transformation. Validation, transformation and profiling are the key activities in any ETL process. I will be using the retail sales data for a fictitious multinational retailer to showcase Continue reading
Time sequence data which is all around us may contain seasonal components. Data is seasonal when there is a seasonal component e.g month of the year, day of the week, hour of week day etc in the data. It is defined by a time range and a period.
My open source project chombo has solutions for seasonality analysis. The solution is two fold. First, there is a map reduce job to detect seasonality in data. Second, there is another map reduce job to calculate statistics of the seasonal components. In this post we will go through the steps for analysing operational data, with seasonal Continue reading
Data profiling is the process of examining data to learn about important characteristics of data. It’s an important part of any ETL process. It’s often necessary to do data profiling before embarking on any serious analytic work. I have implemented various open source Hadoop based data profiling Map Reduce jobs. Most of them are in the project chombo. Some are are in other projects.
I will provide an overview of the various data profiling Hadoop Map Reduce implementations in this post. Continue reading
Anomaly detection with with various statistical modeling based techniques are simple and effective. The Zscore based technique is one among them. Zscore is defined as the absolute difference between a data value and it’s mean normalized with standard deviation. A data point with Zscore value above some threshold is considered to be a potential outlier. One criticism against Zscore is that it’s prone to be influenced by outliers. To remedy that, a technique called robust Zscore can be used which is much more tolerant of outliers.
In this post, I will go over a robust Zscore based implementation on Hadoop to detect outliers in data. The solution is part of my open source project chombo. Continue reading
Data quality is a thorny issue in most Big Data projects. It’s been reported that more than half of the time spent in Big Data projects goes towards data cleansing and preparation. In this post, I will cover data validation features that have been added recently to my OSS project chombo, which runs on Hadoop and Storm. Set of easily configurable common validation functions are provided out of the box. I will use product data as a test case Continue reading