Data profiling is the process of examining data to learn about important characteristics of data. It’s an important part of any ETL process. It’s often necessary to do data profiling before embarking on any serious analytic work. I have implemented various open source Hadoop based data profiling Map Reduce jobs. Most of them are in the project chombo. Some are are in other projects.
I will provide an overview of the various data profiling Hadoop Map Reduce implementations in this post. It complements my earlier post on Hadoop Map Reduce based data validation.
What Constitutes Data Profiling
Broadly speaking, data profiling focuses on analyzing the following characteristics of data. Some of them are only relevant for certain data types. for example, it makes very little sense to do uniqueness analysis for real valued data.
Most of the analysis involves only one attribute. However, some are based on pairs of attributes or multiple attributes.
- Completeness : How complete is the data? What percentage of records has missing or null values for some column
- Uniqueness : How many unique values does a column have? Does a column that is supposed to be unique key, have all unique values?
- Distribution : What is the distribution of values (i.e histogram) for a certain column?
- Basic statistics : For numerical data, what is the mean, standard deviation, minimum, maximum and median
- Pattern matching : What patterns are matched by data values of a column? Are there data values that don’t match any of the expected patterns
- Outliers : Are there outliers in the numerical data. You may be interested in removing them before further analysis.
- Correlation : What is the correlation between two given attributes? This kind of profiling may be important for feature analysis prior to building predictive models
- Functional dependency : Is there functional dependency between two attributes?
Next, we will explore various data profiling techniques including Map Reduce implementations for various data types.
Attribute of Any Data Type
Completeness analysis can be performed for any data type. We may be interested in counting missing values or null values for all columns of data. Perhaps, a column with too many missing values will be barred from use in further analysis.
The map reduce class ValueCounter counts number of occurrences of specified values for an attribute. The values for an attribute is specified as a list of values as follows
Missing value can be specified either by having a missing value in the coma separated list of values or by not having any value for the parameter as below
The map reduce output consists of attribute ordinal, value and the count for that value as follows. According to the output, the second column had 23 missing values.
It’s useful for finding the distribution between different values.It can also be used to check membership of categorical attributes.
The Map Reduce class PatternMatchedValueCounter will count number of matches, given one or more regular expression patterns for a column. It’s output is as follows
attr ordinal,pattern,match count
If a field (e.g. phone number) is supposed to be in one of some specific number of formats, this Map Reduce class helps to identify any violation. It tells us the distribution between different formats as found in the data.
The Map Reduce class UniqueCounter finds unique values for specified attributes. It will out either the count of unique values or the actual unique values. The output is as follows
attr ordinal,unique value count
If a set of attributes together is supposed to be unique, the Map Reduce class UniqueKeyAnalyzer, will find any violation of uniqueness of a set of attributes constituting the uniqye key. The output will contain records as below, only if the attribute set is not found to be unique
attr1 ordinal,attr2 ordinal,count
The output is generated only when count is greater than 1, for some unique key attribute combination.
The Map Reduce class MissingValueCounter counts missing fields row wise or column wise. When row wise the output is as follows
row content or row key,count of missing fields
For column wise missing value counting, the output is as follows
column index,count of missing fields
There are several Map Reduce jobs for profiling for columns with numerical data. The map reduce class NumericalAttrStats calculates min, max, mean, std deviation for specified attributes. A typical output line will be as follows, assuming partitioning with 2 attributes.
partId1,partId2,attr ordinal,sum,sum square,count,mean,variance,std dev,min,max
If the data is partitioned, we get the output for each partition id and attribute ordinal combination. The partitioning may be based on one or more attributes. The output consists of lot of intermediate results.
The Map Reduce class NumericalAttrMedian is similar to NumericalAttrStats. It calculates median and median absolute deviation. It runs in two rounds. In the first round median is calculated. In the second round median absolute deviation is calculated which is based on median. The output is as follows.
partId1,partId2,attr ordinal,median or median absolute deviation
The Map Reduce class NumericalAttrDistrStats calculates distribution or histogram. Bucket width is specified for each attribute through configuration. The output is as follows, with one record per attribute.
attr ordinal,value and frequency pairs,mode,25 percentile, 50 percentile, 75 percentile
To find outliers in numerical data, the Map reduce class NumericalAttrStats can be used to find mean and standard deviation and then the Map Reduce class ValidationChecker, appropriately configured for Zscore validator, can be used to find outliers.
For each validator violation found for any attribute in any record, generates output as below. The first line is the record where violation was found. The second line contains the offending attribute index or ordinal. The third line contains the type of validator that detected validation failure
record attr ordinal validator type
To remove the influence of existing outliers, a better option is to use NumericalAttrMedian to calculate median and median absolute deviation and then to use ValidationChecker.
The Map Reduce class CategoricalAttrDistrStats calculates histogram of categorical attribute values. It’s output is as follows.
attr ordinal,value and frequency pairs,entropy,mode
For categorical attributes, entropy provides a measure of deviation in the data. The mode is the categorical value with maximum frequency.
The Map Reduce class CategoricalAttrClassDistr calculates class or target attribute value distribution for each categorical attribute value. The output is as follows
partition IDs,attr ordinal,attr value,value and frequency pairs for class values,entropy,mode
The profiling done for String attributes is similar to what’s done for numerical attributes with NumericalAttrStats, except that we analyze string length. The Map Reduce class responsible is StringAttrStats. The output is as follows.
attr ordinal,sum,sum square,count,mean,variance,std dev,min,max
There is one record for each configured attribute.
Numerical Attribute Pair
The Map Reduce class NumericalCorrelation calculates correlation between two numerical attributes. The output is as follows
attr1 ordinal,attr2 ordinal,correlation
Categorical Attribute Pair
The Map Reduce class HeterogeneityReductionCorrelation calculates correlation between two categorical attributes. The output is as follows
Correlation calculation uses a contingency table and the algorithm can be based on either uncertainty coefficient or concentration coefficient.
Any Attribute Pair
Functional dependency analysis between two attributes applies to all data type except Text and Double. An example of functional dependency is that a mobile phone belongs to one user only.
This kind of analysis done with the Map Reduce class FunctionalDependencyAnalyzer will find whether the functional dependency between two attributes is violated i.e. the same mobile phone number belonging to multiple users.
attr,source attr value,target attr value
If there was violation, we would have seen multiple target attribute values in the output above.
The Map Reduce class MutualInformation can be used for feature engineering and specifically for feature sub set selection before building predictive analytic model. It involves all feature attributes and the class attribute.
It calculates some score based on mutual information. The score is higher when a feature attribute is weakly correlated with other feature attributes and strongly correlated with class attribute.
The MapReduce class DataTypeInferencer discovers data types of all columns or specified columns. The type are numeric integer, numeric float and string. The output is as follows
attr column ordinal, data type
It gives insight into data distribution in a multi dimensional attribute space. The distribution can also be used for detecting outliers.
The Map Reduce class RecordSimilarity can be used to find similarities between records. For a given record, all similar records are assigned a score reflecting similarities between records. It can be used, among other uses for deduplication.
Summary of Map Reduce Classes
Here is a summary of all the Map Reduce classes discussed so fa data profiling. They are categorized along two dimensions, how many attributes are involved and the attribute data type.
|Map Reduce Class||Comment||Attribute Type|
|ValueCounter||Counts specified values for attributes||Single attribute of any type|
|PatternMatchedValueCounter||Counts values based on regex pattern||Single attribute of any type|
|UniqueCounter||Counts unique values for attributes||Single attribute of any type|
|UniqueKeyAnalyzer||Verifies uniqueness of composite key||Single attribute of any type|
|NumericalAttrStats||Calculates mean, std dev etc||Single attribute of numerical type|
|NumericalAttrMedian||Caluclates median and median absolute deviation||Single attribute of numerical type|
|NumericalAttrDistrStats||Calculates histogram||Single attribute of numerical type|
|ValidationChecker||Finds outliers||Single attribute of numerical type|
|CategoricalAttrDistrStats||Calculates histogram||Single attribute of categorical type|
|StringAttrStats||Calculates mean, std dev etc for string length||Single attribute of string type|
|NumericalCorrelation||Correlation between attributes||Two attribute of numerical type|
|HeterogeneityReductionCorrelation||Correlation between attributes||Two attribute of categorical type|
|FunctionalDependencyAnalyzer||Verifies functional dependency||Two attributes of any type|
|MutualInformation||Feature correlation and relevance to class variable||All attributes|
|MutiVariateDistribution||Calculates multi variate distribution||All attributes of numerical and categorical type|
|RecordSimilarity||Similarity between records||All attributes of any type|
|MissingValueCounter||Counts missing values, row wise or column wise||All attributes of any type type|
|CategoricalAttrClassDistr||Class value distr for categorical attributes||Categorical attributes|
|DataTypeInferencer||Discovers data types||All attributes of any type type|
For commercial support for this solution or other solutions in my github repositories, please talk to ThirdEye Data Science Services. Support is available for Hadoop or Spark deployment on cloud including installation, configuration and testing,