Profiling Big Data

Data profiling is the process of examining data to learn about important characteristics of data. It’s an important part of any ETL process. It’s often necessary to do data profiling before embarking on any serious analytic work. I have implemented various open source Hadoop based data profiling Map Reduce jobs. Most of them are in the project chombo. Some are are in other projects.

I will provide an overview of the various data profiling Hadoop Map Reduce implementations in this post. It complements my earlier post on Hadoop Map Reduce based data validation.

What Constitutes Data Profiling

Broadly speaking, data profiling focuses on analyzing  the following characteristics of data. Some of them are only relevant for certain data types. for example, it makes very little sense to do uniqueness analysis for real valued data.

Most of the analysis involves only one attribute. However, some are based on pairs of attributes or multiple attributes.

  1. Completeness : How complete is the data? What percentage of records has missing or null values for some column
  2. Uniqueness : How many unique values does a column have? Does a column that is supposed to be unique key, have all unique values?
  3. Distribution : What is the distribution of values (i.e histogram) for a certain column?
  4. Basic statistics : For numerical data, what is the mean, standard deviation, minimum, maximum and median
  5. Pattern matching : What patterns are matched by data values of a column? Are there data values that don’t match any of the expected patterns
  6. Outliers : Are there outliers in the numerical data. You may be interested in removing them before further analysis.
  7. Correlation : What is the correlation between two given attributes? This kind of profiling may be important for feature analysis prior to building predictive models
  8. Functional dependency : Is there functional dependency between two attributes?

Next, we will explore various data profiling techniques including Map Reduce implementations for various data types.

Attribute of Any Data Type

Completeness analysis can be performed for any data type. We may be interested in counting missing values or null values for all columns of data. Perhaps, a column with too many missing values will be barred from use in further analysis.

The map reduce class ValueCounter counts number of occurrences of specified values for an attribute. The values for an attribute is specified as a list of values as follows


Missing value can be specified either by having a missing value in the coma separated list of values or by not having any value for the parameter as below


The map reduce output consists of attribute ordinal, value and the count for that value as follows. According to the output, the second column had 23 missing values.


It’s useful for finding the distribution between different values.It can also be used to check membership of categorical attributes.

The Map Reduce class PatternMatchedValueCounter will count number of matches, given one or more regular expression patterns for a column. It’s output is as follows

attr ordinal,pattern,match count

If a field (e.g. phone number) is supposed to be in one of some specific number of formats, this Map Reduce class helps to identify any violation. It tells us the distribution between different formats as found in the data.

The Map Reduce class UniqueCounter finds unique values for specified attributes. It will out either the count of unique values or the actual unique values. The output is as follows

attr ordinal,unique value count
attr ordinal,value1,value2,value3

If a set of attributes together is supposed to be unique, the Map Reduce class UniqueKeyAnalyzer, will find any violation of uniqueness of a set of attributes constituting the uniqye key. The output will contain records as below, only if the attribute set is not found to be  unique

attr1 ordinal,attr2 ordinal,count

The output is generated only when count is greater than 1, for some unique key attribute combination.

The Map Reduce class MissingValueCounter counts missing fields row wise or column wise. When row wise the output is as follows

row content or row key,count of missing fields

For column wise missing value counting, the output is as follows

column index,count of missing fields

Numerical Attribute

There are several Map Reduce jobs for profiling for columns with numerical data. The map reduce class NumericalAttrStats calculates min, max, mean, std deviation for specified attributes. A typical output line will be as follows, assuming partitioning with 2 attributes.

partId1,partId2,attr ordinal,sum,sum square,count,mean,variance,std dev,min,max

If the data is partitioned, we get the output for each partition id and attribute ordinal combination. The partitioning may be based on one or more attributes. The output consists of lot of intermediate results.

The Map Reduce class NumericalAttrMedian is similar to NumericalAttrStats. It calculates median and median absolute deviation. It runs in two rounds. In the first round median is calculated. In the second round median absolute deviation is calculated which is based on median.  The output is as follows.

partId1,partId2,attr ordinal,median or median absolute deviation

The Map Reduce class NumericalAttrDistrStats calculates distribution or histogram. Bucket width is specified for each attribute through configuration. The output is as follows, with one record per attribute.

attr ordinal,value and frequency pairs,mode,25 percentile, 50 percentile, 75 percentile

To find outliers in numerical data, the Map reduce class NumericalAttrStats can be used to find mean and standard deviation and then the Map Reduce class ValidationChecker, appropriately configured for Zscore validator,  can be used to find outliers.

For each validator violation found for any attribute in any record, generates output as below. The first line is the record where violation was found. The second line contains the offending attribute index or ordinal. The third line contains the type of validator that detected validation failure

attr ordinal
validator type

To remove the influence of existing outliers, a better option is to use NumericalAttrMedian to calculate median and median absolute deviation and  then to use ValidationChecker.

Categorical Attribute

The Map Reduce class CategoricalAttrDistrStats calculates histogram of categorical attribute values. It’s output is as follows.

attr ordinal,value and frequency pairs,entropy,mode

For categorical attributes, entropy provides a measure of  deviation in the data. The mode is the categorical value with maximum frequency.

The Map Reduce class CategoricalAttrClassDistr calculates class or target attribute value distribution for each categorical attribute value. The output is as follows

partition IDs,attr ordinal,attr value,value and frequency pairs for class values,entropy,mode

String Attribute

The profiling done for String attributes is similar to what’s done for numerical attributes with NumericalAttrStats, except that we analyze string length. The Map Reduce class responsible is StringAttrStats. The output is as follows.

attr ordinal,sum,sum square,count,mean,variance,std dev,min,max

There is one record for each configured attribute.

Numerical Attribute Pair

The Map Reduce class NumericalCorrelation calculates correlation between two  numerical attributes. The output is as follows

attr1 ordinal,attr2 ordinal,correlation

Categorical Attribute Pair

The Map Reduce class HeterogeneityReductionCorrelation calculates correlation between two categorical attributes. The output is as follows


Correlation calculation uses a contingency table and the algorithm can be based on either uncertainty coefficient or concentration coefficient.

Any Attribute Pair

Functional dependency analysis between two attributes applies to all data type except Text and Double. An example of functional dependency is that a mobile phone belongs to one user only.

This kind of analysis done with the Map Reduce class FunctionalDependencyAnalyzer  will find whether the functional dependency between two attributes is violated i.e. the same mobile phone number belonging to multiple users.

attr,source attr value,target attr value

If there was violation, we would have seen multiple target attribute values in the output above.

All Attributes

The Map Reduce class MutualInformation can be used for feature engineering and specifically for feature sub set selection before building predictive analytic model. It involves all feature attributes and the class attribute.

It calculates some score based on mutual information. The score is higher when a feature attribute is weakly correlated with other feature attributes and strongly correlated with class attribute.

The MapReduce class DataTypeInferencer discovers data types of all columns or specified columns. The type are numeric integer, numeric float and string. The output is as follows

attr column ordinal, data type

Whole Record

The Map Reduce class MultiVariateDistribution can be used to calculate multivariate distribution of records. The distribution includes only numerical and categorical attributes.

It gives insight into data distribution in a multi dimensional attribute space. The distribution can also be used for detecting outliers.

Multiple Records

The Map Reduce class  RecordSimilarity can be used to find similarities between records.  For a given record, all similar records are assigned a score reflecting similarities between records. It can be used, among other uses for deduplication.

Summary of Map Reduce Classes

Here is a summary of all the Map Reduce classes discussed so fa  data profiling. They are categorized along two dimensions, how many attributes are involved and the attribute data type.

Map Reduce Class Comment Attribute Type
ValueCounter Counts specified values for attributes Single attribute of any type
PatternMatchedValueCounter Counts values based on regex pattern Single attribute of any type
UniqueCounter Counts unique values  for attributes Single attribute of any type
UniqueKeyAnalyzer Verifies uniqueness of composite key Single attribute of any type
NumericalAttrStats Calculates mean, std dev etc Single attribute of numerical type
NumericalAttrMedian Caluclates median and median absolute deviation Single attribute of numerical type
NumericalAttrDistrStats Calculates histogram Single attribute of numerical type
ValidationChecker Finds outliers Single attribute of numerical type
CategoricalAttrDistrStats Calculates histogram Single attribute of categorical type
StringAttrStats Calculates mean, std dev etc for string length Single attribute of string type
NumericalCorrelation Correlation between attributes Two attribute of numerical type
HeterogeneityReductionCorrelation Correlation between attributes Two attribute of categorical type
FunctionalDependencyAnalyzer Verifies functional dependency Two attributes of any type
MutualInformation Feature correlation and relevance to class variable All attributes
MutiVariateDistribution Calculates multi variate distribution All attributes of numerical and categorical type
 RecordSimilarity Similarity between records All attributes of any type
MissingValueCounter Counts missing values, row wise or column wise All attributes of any type type
CategoricalAttrClassDistr Class value distr for categorical attributes Categorical attributes
DataTypeInferencer Discovers data types All attributes of any type type

Summing Up

We have taken a tour through all the data profiling map reduce classes available in my open source projects chomboavenir, beymani and sifarish in Github. Most of them are from the project chombo.

There is a sample script to run all the map reduce jobs. Sample configuration file for all the map reduce jobs is also available.


For commercial support for any solution in my github repositories, please talk to ThirdEye Data Science Services. Support is available for Hadoop or Spark deployment on cloud including installation, configuration and testing,

About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source project owner. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Machine Learning and Programming languages. I am fascinated by problems that don't have neat closed form solution.
This entry was posted in Big Data, Data Profiling, data quality, ETL, Hadoop and Map Reduce and tagged , . Bookmark the permalink.

6 Responses to Profiling Big Data

  1. Are there any equivalents in spark?

  2. Pingback: Profiling Big Data | Mawazo – Big Data Cloud

  3. Pingback: Transforming Big Data | Mawazo

  4. Pingback: Data Type Auto Discovery with Spark | Mawazo

Leave a Reply to 2s Complement Cancel reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s