Data quality is a thorny issue in most Big Data projects. It’s been reported that more than half of the time spent in Big Data projects goes towards data cleansing and preparation. In this post, I will cover data validation features that have been added recently to my OSS project chombo, which runs on Hadoop and Storm. Set of easily configurable common validation functions are provided out of the box. I will use product data as a test case to show how it works.
A close examination of the data validation functions typically used reveal that they can be broadly be classified under two groups
- A broad set of commonly used functions.
- A highly nuanced customs set of functions, very closely tied with specific business rules in the domain of the application
In chombo, the first group is handled with a set of built in validators available out of the box. The user simply configures a set of validators for each field in the data. Each validator is identified with a specific tag. The second group of validation functions is handled, by allowing the user to write custom java classes which derive from the base Validator class.
Common Out Of The Box Validators
Here is a list of validators available out of the box. More will be added in future as needed. They are listed under different categories. The following validators are generic in nature and do not depend on the field data type.
|notMissing||Ensures field is not missing|
|ensureInt||Ensures field value is integer|
|ensureLong||Ensures field value is long|
|ensureDouble||Ensures field value is double|
|ensureDate||Ensures field value is a date|
The following validators are available for string fields. The pattern validator can be used for phone number, email, date or any other data with pattern.
|minLength||Ensures field length is greater than or equal to specified length|
|maxLength||Ensures field length is less than or equal to specified length|
|exactLength||Ensures field length is equal to specified length|
|min||Ensures that field is lexicographic-ally greater than or equal specified string|
|max||Ensures that field is lexicographic-ally less than or equal to specified string|
|pattern||Ensures that field matches given pattern|
The following validators are for numeric fields. For zscoreBasedRange validation, there is another map reduce class that can be used to calculate mean and standard deviation of all numeric fields. The file path for the stats is provided through the parameter stat.file.path
For validation using more robust statistics based on median and median average divergence, the validator robustZscoreBasedRange can be used.
|min||Ensures field is greater than or equal to specified value|
|max||Ensures field is less than or equal to specified value|
|zscoreBasedRange||Validates range based on mean and std deviation|
|robustZscoreBasedRange||Validates range based on median and median average deviation|
The following variables are for categorical fields. The only validator currently available is for membership check.
|membership||Ensures field value is member of the the categorical value set|
The JSON schema file contains various metadata required for different validators. One such example is the minimum value for min validator. This is the metadata used for our product data test case. The validators for a field is defined through configuration. Here is one example: validator.1=membership,notMissing. The left hand side contains the word validator followed by the ordinal of the field. The right hand side contains the list of validaors to be applied to the field.
Customization can be done by creating Java validator classes extending a base class. It is also necessary to create a custom validator factory class implementing a factory interface. Instead of custom validator factory class, the list of validator tag and the corresponding custom validator class name could be provided through configuration.
Since custom validators are expected to implement complex business logic, possibly requiring access to fields other than the one being validated, the whole record is passed in the validate() method of the Validator extended classes.
Validation Map Reduce
The map reduce class is very simple involving only mappers. For every record, it applies all the validarors for each field. All validation violations are summarized and written to a file in HDFS. The input data used is product data with the following fields. The fields category and brand are categorical.
- product ID
Original records are output as is. If the configuration parameter filter.invalid.records is set to true, the invalid records are filtered out from the output. Validation report is written to a file in HDFS as specified through the parameter invalid.data.file.path. Here is some sample output.
958YJY4QIE,laptop,asus,8946WB,888,52 field:4 max 63M9YMVU99,wearables,fitbit,IZ8KH4,123,155 field:5 max DV5VTR1U4R,,apple,7NC446,158,100 field:1 membership notMissing
For each record that was found to be invalid, it generates the following output
- actual record
- list of field ordinals
- for each field, list of tags for validators that resulted in data being invalid
Configurations can be provided either through properties file or through typesafe HCONF configuration file. If using HCONF, the configuration file needs to be placed in HDFS as defined in the configuration parameter validator.config.file.path . An example HCONF configuration file is available.
Configuration is defined in several ways depending on the level of details. The properties file only defines high level configuration e.g., what validator should be used for what field.
Many of the validators will use metadata from JSON schema file as configuration parameters. Depending on the validator type, some of them will be used. You may want to look inside the classes StringValidator and NumericalValidator to find out how they are used.
Some validators require for complex objects to be passed part of a context. More details on configuration can be found in the tutorial document with link provided below.
To try the exercise in this post, you could refer to the tutorial for the details of the steps. I have provided a set of validators that I thought are commonly used. To add other validators, you could either send me a git pull request or file an issue in github.