Data profiling is the process of examining data to learn about important characteristics of data. It’s an important part of any ETL process. It’s often necessary to do data profiling before embarking on any serious analytic work. I have implemented various open source Hadoop based data profiling Map Reduce jobs. Most of them are in the project *chombo*. Some are are in other projects.

I will provide an overview of the various data profiling Hadoop Map Reduce implementations in this post. It complements my earlier post on Hadoop Map Reduce based data validation.

## What Constitutes Data Profiling

Broadly speaking, data profiling focuses on analyzing the following characteristics of data. Some of them are only relevant for certain data types. for example, it makes very little sense to do uniqueness analysis for real valued data.

Most of the analysis involves only one attribute. However, some are based on pairs of attributes or multiple attributes.

**Completeness**: How complete is the data? What percentage of records has missing or null values for some column**Uniqueness**: How many unique values does a column have? Does a column that is supposed to be unique key, have all unique values?**Distribution**: What is the distribution of values (i.e histogram) for a certain column?**Basic statistics**: For numerical data, what is the mean, standard deviation, minimum, maximum and median**Pattern matching**: What patterns are matched by data values of a column? Are there data values that don’t match any of the expected patterns**Outliers**: Are there outliers in the numerical data. You may be interested in removing them before further analysis.**Correlation**: What is the correlation between two given attributes? This kind of profiling may be important for feature analysis prior to building predictive models**Functional dependency**: Is there functional dependency between two attributes?

Next, we will explore various data profiling techniques including Map Reduce implementations for various data types.

## Attribute of Any Data Type

Completeness analysis can be performed for any data type. We may be interested in counting missing values or null values for all columns of data. Perhaps, a column with too many missing values will be barred from use in further analysis.

The map reduce class *ValueCounter* counts number of occurrences of specified values for an attribute. The values for an attribute is specified as a list of values as follows

values.{attributeOrdinal}=val1,val2,val3

Missing value can be specified either by having a missing value in the coma separated list of values or by not having any value for the parameter as below

values.{attributeOrdinal}=val1,,val2 values.{attributeOrdinal}=

The map reduce output consists of attribute ordinal, value and the count for that value as follows. According to the output, the second column had 23 missing values.

2,,23

It’s useful for finding the distribution between different values.It can also be used to check membership of categorical attributes.

The Map Reduce class *PatternMatchedValueCounter* will count number of matches, given one or more regular expression patterns for a column. It’s output is as follows

attr ordinal,pattern,match count

If a field (e.g. phone number) is supposed to be in one of some specific number of formats, this Map Reduce class helps to identify any violation. It tells us the distribution between different formats as found in the data.

The Map Reduce class *UniqueCounter* finds unique values for specified attributes. It will out either the count of unique values or the actual unique values. The output is as follows

attr ordinal,unique value count

attr ordinal,value1,value2,value3

If a set of attributes together is supposed to be unique, the Map Reduce class *UniqueKeyAnalyzer*, will find any violation of uniqueness of a set of attributes constituting the uniqye key. The output will contain records as below, only if the attribute set is not found to be unique

attr1 ordinal,attr2 ordinal,count

The output is generated only when count is greater than 1, for some unique key attribute combination.

## Numerical Attribute

There are several Map Reduce jobs for profiling for columns with numerical data. The map reduce class *NumericalAttrStats* calculates *min, max, mean, std deviation* for specified attributes. A typical output line will be as follows, assuming partitioning with 2 attributes.

partId1,partId2,attr ordinal,sum,sum square,count,mean,variance,std dev,min,max

If the data is partitioned, we get the output for each partition id and attribute ordinal combination. The partitioning may be based on one or more attributes. The output consists of lot of intermediate results.

The Map Reduce class *NumericalAttrMedian* is similar to *NumericalAttrStats. *It calculates *median* and *median absolute deviation*. It runs in two rounds. In the first round *median* is calculated. In the second round *median absolute deviation* is calculated which is based on *median*. The output is as follows.

partId1,partId2,attr ordinal,median or median absolute deviation

The Map Reduce class *NumericalAttrDistrStats* calculates distribution or histogram. Bucket width is specified for each attribute through configuration. The output is as follows, with one record per attribute.

attr ordinal,value and frequency pairs,mode,25 percentile, 50 percentile, 75 percentile

To find outliers in numerical data, the Map reduce class *NumericalAttrStats *can be used to find mean and standard deviation and then the Map Reduce class *ValidationChecker*, appropriately configured for *Zscore* validator, can be used to find outliers.

For each validator violation found for any attribute in any record, generates output as below. The first line is the record where violation was found. The second line contains the offending attribute index or ordinal. The third line contains the type of validator that detected validation failure

record attr ordinal validator type

To remove the influence of existing outliers, a better option is to use *NumericalAttrMedian *to calculate median and median absolute deviation and then to use *ValidationChecker.*

## Categorical Attribute

The Map Reduce class *CategoricalAttrDistrStats* calculates histogram of categorical attribute values. It’s output is as follows.

attr ordinal,value and frequency pairs,entropy,mode

For categorical attributes, *entropy* provides a measure of deviation in the data. The *mode* is the categorical value with maximum frequency.

## String Attribute

The profiling done for String attributes is similar to what’s done for numerical attributes with *NumericalAttrStats, *except that we analyze string length. The Map Reduce class responsible is *StringAttrStats*. The output is as follows.

attr ordinal,sum,sum square,count,mean,variance,std dev,min,max

There is one record for each configured attribute.

## Numerical Attribute Pair

The Map Reduce class *NumericalCorrelation* calculates correlation between two numerical attributes. The output is as follows

attr1 ordinal,attr2 ordinal,correlation

## Categorical Attribute Pair

The Map Reduce class *HeterogeneityReductionCorrelation* calculates correlation between two categorical attributes. The output is as follows

attr1,attr2,correlation

Correlation calculation uses a contingency table and the algorithm can be based on either uncertainty coefficient or concentration coefficient.

## Any Attribute Pair

Functional dependency analysis between two attributes applies to all data type except Text and Double. An example of functional dependency is that a mobile phone belongs to one user only.

This kind of analysis done with the Map Reduce class *FunctionalDependencyAnalyzer* will find whether the functional dependency between two attributes is violated i.e. the same mobile phone number belonging to multiple users.

attr,source attr value,target attr value

If there was violation, we would have seen multiple target attribute values in the output above.

## All Attributes

The Map Reduce class *MutualInformation* can be used for feature engineering and specifically for feature sub set selection before building predictive analytic model. It involves all feature attributes and the class attribute.

It calculates some score based on mutual information. The score is higher when a feature attribute is weakly correlated with other feature attributes and strongly correlated with class attribute.

## Whole Record

The Map Reduce class *MultiVariateDistribution* can be used to calculate multivariate distribution of records. The distribution includes only numerical and categorical attributes.

It gives insight into data distribution in a multi dimensional attribute space. The distribution can also be used for detecting outliers.

## Multiple Records

The Map Reduce class *SameTypeSimilarity* can be used to find similarities between records. For a given record, all similar records are assigned a score reflecting similarities between records. It can be used, among other uses for deduplication.

## Summary of Map Reduce Classes

Here is a summary of all the Map Reduce classes discussed so fa data profiling. They are categorized along two dimensions, how many attributes are involved and the attribute data type.

Map Reduce Class |
Comment |
Attribute Type |

ValueCounter | Counts specified values for attributes | Single attribute of any type |

PatternMatchedValueCounter | Counts values based on regex pattern | Single attribute of any type |

UniqueCounter | Counts unique values for attributes | Single attribute of any type |

UniqueKeyAnalyzer | Verifies uniqueness of composite key | Single attribute of any type |

NumericalAttrStats | Calculates mean, std dev etc | Single attribute of numerical type |

NumericalAttrMedian | Caluclates median and median absolute deviation | Single attribute of numerical type |

NumericalAttrDistrStats | Calculates histogram | Single attribute of numerical type |

ValidationChecker | Finds outliers | Single attribute of numerical type |

CategoricalAttrDistrStats | Calculates histogram | Single attribute of categorical type |

StringAttrStats | Calculates mean, std dev etc for string length | Single attribute of string type |

NumericalCorrelation | Correlation between attributes | Two attribute of numerical type |

HeterogeneityReductionCorrelation | Correlation between attributes | Two attribute of categorical type |

FunctionalDependencyAnalyzer | Verifies functional dependency | Two attributes of any type |

MutualInformation | Feature correlation and relevance to class variable | All attributes |

MutiVariateDistribution | Calculates multi variate distribution | All attributes of numerical and categorical type |

SameTypeSimilarity | Similarity between records | All attributes of any type |

MissingValueCounter | Counts missing values, row wise or column wise | All attributes of any type type |

## Summing Up

We have taken a tour through all the data profiling map reduce classes available in my open source projects *chombo, avenir, beymani and sifarish* in Github. Most of them are from the project *chombo*.

There is a sample script to run all the map reduce jobs. Sample configuration file for all the map reduce jobs is also available.

Are there any equivalents in spark?

I am planning a Spark port, not only for data profiling but all ETL map reduce jobs. Let me know if you want to contribute

Would like to. By the way, i like the library names.

Pingback: Profiling Big Data | Mawazo – Big Data Cloud

Pingback: Transforming Big Data | Mawazo