In the life of a Data Scientist, it’s not uncommon to run into a data set with no knowledge or very little knowledge about the data. You may be interested in learning about such data with missing meta data through some tools instead of going through the tedious process of manually perusing the data and try to make sense out of it.
In this a post we will go through a Spark based implementation to automatically discover data types for the various fields in a data set. The implementation in available in my OSS project chombo.
Data type discovery is only one of the ways Continue reading
Data normalization is a required data preparation step for many Machine Learning algorithms. These algorithms are sensitive to the relative values of the feature attributes. Data normalization is the process of bringing all the attribute values within some desired range. Unless the data is normalized, these algorithms don’t behave correctly.
In this post, we will go through various data normalization techniques, as implemented on Spark. To provide some context, we will also discuss how different supervised learning algorithms are negatively impacted from lack of normalization
The Spark based implementation is available in my open source project chombo. Continue reading
If you work with data, there is a high probability that you have run into duplicate data in your data set. Removing duplicates in Big Data is a computationally intensive process and parallel cluster processing with Hadoop or Spark becomes a necessity. In this post we will focus on de duplication based on exact match, whether for the whole record or set of specified key fields. De duplication can also be performed based on fuzzy matching. We will address de duplication for flat record oriented data only.
The Spark based implementation is available Continue reading
Typical training data set for real world machine learning problems has mixture of different types of data including numerical and categorical. Many machine learning algorithms can not handle categorical variables. Those that can, categorical data can pose a serious problem if they have high cardinality i.e too many unique values.
In this post we will go though a technique to convert high cardinality categorical attributes to numerical values, based on how the categorical variable correlates with the class or target variable. The Map Reduce implementations are available Continue reading
Most supervised Machine Learning algorithms face difficulty when there is class imbalance in the training data i.e., amount of data belonging one class heavily outnumber the other class. However, there are may real life problems where we encounter this situation e.g., fraud, customer churn and machine failure. There are various techniques to address this thorny problem of class imbalance.
In this post we will go over a technique based on oversampling o the minority class data called Synthetic Minority Over-sampling Technique (SMOTE). We will go into the details of a Hadoop based implementation using machine failure data Continue reading
Measuring campaign effectiveness is critical for any company to justify the marketing money being spent. Consider a company providing a free online service on signup. It’s critical for the company to convert them so that they subscribe to a paid service as soon as possible.
In this post, we will use simple statistical techniques to find the relative merits for different campaigns in terms of effectiveness which is measured by conversions. The Spark based solution is available Continue reading
Missing values are just part of life in the data processing world. In most cases you can not simply ignore the missing values as it may adversely affect whatever analytic processing you are going to do. Broadly speaking, handling missing data consists of two steps, gaining some insight on missing fields in the data and then taking some actions based on the insight gained from the first step. In this post the focus will be primarily on the first step.
The Hadoop based implementation is available in my OSS project chombo on github. In future Continue reading