Validating Big Data


Data quality is a thorny issue in most Big Data projects. It’s been reported that more than half  of the time spent in Big Data projects goes towards data cleansing and preparation. In this post, I will cover data validation features that have been added recently to my OSS project chombo, which runs on Hadoop and Storm. Set of  easily configurable common validation functions are provided out of the box.  I will use product data as a test case to show how it works.

Validation Functions

A close examination of the data validation functions typically used reveal that they can be broadly be classified under two groups

  1. A broad set of commonly used functions.
  2. A highly nuanced customs set of functions, very closely tied with specific business rules in the domain of the application

In chombo, the first group is handled with a set of built in validators available out of the box. The user simply configures a set of validators for each field in the data. Each validator is identified with a specific tag. The second group of validation functions  is handled, by allowing the user to write custom java classes which  derive from  the base Validator class.

Common Out Of The Box Validators

Here is a list of validators available out of the box. More will be added in future as needed. They are listed under different categories. The following validators are generic in nature and do not depend on the field data type.

Tag Comment
notMissing Ensures field is not missing
ensureInt Ensures field value is integer
ensureLong Ensures field value is long
ensureDouble Ensures field value is double
ensureDate Ensures field value is a date

The following validators are available for string fields. The pattern validator can be used for phone number, email, date or any other data with pattern.

Tag Comment
minLength Ensures field length is greater than or equal to specified length
maxLength Ensures field length is less than or equal to specified length
exactLength Ensures field length is equal to specified length
min Ensures that field is lexicographic-ally greater than or equal specified string
max Ensures that field is lexicographic-ally less than or equal to specified string
pattern Ensures that field matches given pattern

The following validators are for numeric fields. For zscoreBasedRange validation, there is another map reduce class that can be used to calculate mean and standard deviation of all numeric fields. The file path for the stats is provided through the parameter stat.file.path

For validation using more robust statistics based on median and median average divergence, the validator robustZscoreBasedRange can be used.

Tag Comment
min Ensures field is greater than or equal to specified value
max Ensures field is less than or equal to specified value
zscoreBasedRange Validates range based on mean and std deviation
robustZscoreBasedRange Validates range based on median and median average deviation

The following variables are for categorical fields. The only validator currently available is for membership check.

Tag Comment
membership Ensures field value is member of the the categorical value set

The JSON schema file contains various metadata required for different validators.  One such example is the minimum value for min validator. This is the metadata used for our product data test case. The validators for a field is defined through configuration. Here is one example: validator.1=membership,notMissing.  The left hand side contains the word validator followed by the ordinal of the field. The right hand side contains the list of validaors to be applied to the field.

Customization

Customization can be done by creating Java validator classes extending a base class. It is also necessary to create a custom validator factory class implementing a factory interface. Instead of custom validator factory class, the list of validator tag and the corresponding custom validator class name could be provided through configuration.

Since custom validators are expected to implement complex business logic, possibly requiring access to fields other than the one being validated, the whole record is passed in the validate() method of the Validator extended classes.

Validation Map Reduce

The map reduce class is very simple involving only mappers. For every record, it applies all the validarors for each field. All validation violations are summarized and written to a file in HDFS. The input data used is product data with the following fields. The fields category and  brand are categorical.

  1. product ID
  2. category
  3. brand
  4. model
  5. price
  6. quantity

Original records are output as is. If the configuration parameter filter.invalid.records  is set to true, the invalid records are filtered out from the output. Validation report is written to a file in HDFS as specified through the parameter invalid.data.file.path. Here is some sample output.

958YJY4QIE,laptop,asus,8946WB,888,52
field:4
max  
63M9YMVU99,wearables,fitbit,IZ8KH4,123,155
field:5
max  
DV5VTR1U4R,,apple,7NC446,158,100
field:1
membership  notMissing  

For each record that was found to be invalid, it generates the following output

  1. actual record
  2. list of field ordinals
  3. for each field, list of tags for validators that resulted in data being invalid

Configurations can be provided either through properties file or through typesafe  HCONF configuration file. If using HCONF, the configuration file needs to be placed in HDFS as defined in the configuration parameter validator.config.file.path . An example HCONF configuration file is available.

Configuration

Configuration is defined in several ways depending on the level of details. The properties  file only defines high level configuration e.g., what validator should be used for what field.

Many of the validators will use metadata from JSON schema file as configuration parameters. Depending on the validator  type, some of them will be used. You may want to  look inside the classes StringValidator and NumericalValidator  to find out how they are used.

Some validators require for complex objects to be passed part of a context. More details on configuration can be found in the tutorial document with link provided below.

Summing Up

To try the exercise in this post, you could refer to the tutorial for the details of the steps. I have provided a set of validators that I thought are commonly used. To add other validators, you could  either send me a git pull request or file an issue in github.

Advertisements

About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source project owner. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Machine Learning and Programming languages. I am fascinated by problems that don't have neat closed form solution.
This entry was posted in Big Data, data quality, ETL, Hadoop and Map Reduce and tagged , . Bookmark the permalink.

8 Responses to Validating Big Data

  1. Pingback: Anomaly Detection with Robust Zscore | Mawazo

  2. Pingback: Profiling Big Data | Mawazo

  3. Pingback: Transforming Big Data | Mawazo

  4. Pingback: Transforming Big Data | Mawazo – Big Data Cloud

  5. Prasad Saka says:

    Hey Pranab,
    First of thanks for such a great effort to put this together. I am trying to experiment with your code and want to run it to see how it works.
    I am facing an issue though: where is the util.py available? It’s not found anywhere. I checked out nuovo branch.

  6. Pingback: Simple Sanity Checks for Data Correctness with Spark | Mawazo

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s