Data quality is a thorny issue in most Big Data projects. It’s been reported that more than half of the time spent in Big Data projects goes towards data cleansing and preparation. In this post, I will cover data validation features that have been added recently to my OSS project chombo, which runs on Hadoop and Storm. Set of easily configurable common validation functions are provided out of the box. I will use product data as a test case to show how it works.
A close examination of the data validation functions typically used reveal that they can be broadly be classified under two groups
- A broad set of commonly used functions.
- A highly nuanced customs set of functions, very closely tied with specific business rules in the domain of the application
In chombo, the first group is handled with a set of built in validators available out of the box. The user simply configures a set of validators for each field in the data. Each validator is identified with a specific tag. The second group of validation functions is handled, by allowing the user to write custom java classes which derive from the base Validator class.
Common Out Of The Box Validators
Here is a list of validators available out of the box. More will be added in future as needed. They are listed under different categories. The following validators are generic in nature and do not depend on the field data type.
|notMissing||Ensures field is not missing|
|ensureInt||Ensures field value is integer|
|ensureLong||Ensures field value is long|
|ensureDouble||Ensures field value is double|
|ensureDate||Ensures field value is a date|
The following validators are available for string fields. The pattern validator can be used for phone number, email, date or any other data with pattern.
|minLength||Ensures field length is greater than or equal to specified length|
|maxLength||Ensures field length is less than or equal to specified length|
|exactLength||Ensures field length is equal to specified length|
|min||Ensures that field is greater than or equal to specified string value|
|max||Ensures that field is less than or equal to specified string value|
|pattern||Ensures that field matches given regex pattern|
|preDefinedPattern||Ensures that field matches the selected out of the box regex pattern|
The validator preDfinedPattern supports many common regular expression patterns out of the box, so that the user does not have to define them. The different attribute types supported are
The regex pattern desired should be indicated by selecting one of the names from the list and providing that as part of the validator configuration.
If the regex pattern for these fields does not meet your requirement, you should use the more generic pattern validator and provide your own regex pattern through configuration.
The following validators are for numeric fields. For zscoreBasedRange validation, there is another map reduce class that can be used to calculate mean and standard deviation of all numeric fields. The file path for the stats is provided through the parameter stat.file.path
For validation using more robust statistics based on median and median average divergence, the validator robustZscoreBasedRange can be used.
|min||Ensures field is greater than or equal to specified value|
|max||Ensures field is less than or equal to specified value|
|zscoreBasedRange||Validates range based on mean and std deviation|
|robustZscoreBasedRange||Validates range based on median and median average deviation|
The following validators are for date fields. Date value can be either formatted or in epoch time.
|min||Ensures field is later than or equal to specified date value|
|max||Ensures field is earlier than or equal to specified value|
The following variables are for categorical fields. The only validator currently available is for membership check.
|membership||Ensures field value is member of the the categorical value set|
All the validators covered so far apply to specified columns in the data. However, some times the validation logic spans the whole record and involves multiple columns simultaneously. For example, when one column has certain value the value for another columns may be constrained in some ways. Here are the row wise validators supported.
|notMissingGroup||Among a group of columns, at least one of them needs to a have a non empty value|
|valueDependency||When a column has certain value, the value in another column is constrained e.g. member of a set of values for categorical data|
The JSON schema file contains various metadata required for different validators. One such example is the minimum value for min validator. This is the metadata used for our product data test case.
The validators for a field may be defined through property configuration file. Here is one example: validator.1=membership,notMissing. The left hand side contains the word validator followed by the ordinal of the field. The right hand side contains the list of validators to be applied to the field.
Validator configuration may also be provided through hconf configuration file. For complex validators, this is the only option.
Customization can be done by creating Java validator classes extending a base class. It is also necessary to create a custom validator factory class implementing a factory interface. Instead of custom validator factory class, the list of validator tag and the corresponding custom validator class name could be provided through configuration.
Since custom validators are expected to implement complex business logic, possibly requiring access to fields other than the one being validated, the whole record is passed in the validate() method of the Validator extended classes.
Validation Map Reduce
The map reduce class is very simple involving only mappers. For every record, it applies all the validarors for each field. All validation violations are summarized and written to a file in HDFS. The input data used is product data with the following fields. The fields category and brand are categorical.
- product ID
Original records are output as is. If the configuration parameter filter.invalid.records is set to true, the invalid records are filtered out from the output. Validation report is written to a file in HDFS as specified through the parameter invalid.data.file.path. Here is some sample output.
958YJY4QIE,laptop,asus,8946WB,888,52 field:4 max 63M9YMVU99,wearables,fitbit,IZ8KH4,123,155 field:5 max DV5VTR1U4R,,apple,7NC446,158,100 field:1 membership notMissing
For each record that was found to be invalid, it generates the following output
- actual record
- list of field ordinals
- for each field, list of tags for validators that resulted in data being invalid
Configurations can be provided either through properties file or through typesafe HCONF configuration file. If using HCONF, the configuration file needs to be placed in HDFS as defined in the configuration parameter validator.config.file.path . An example HCONF configuration file is available.
Configuration is defined in several ways depending on the level of details. The properties file only defines high level configuration e.g., what validator should be used for what field.
Many of the validators will use metadata from JSON schema file as configuration parameters. Depending on the validator type, some of them will be used. You may want to look inside the classes StringValidator and NumericalValidator to find out how they are used.
Configuration for most validators can be provided through the property file. However, some validators require for complex objects to be passed part of a context. The configuration for such validators are provided through hconf file, which the is the format for Typesafe configuration. It is very similar to JSON.
You can use either properties file or hconf file for validator configurations. More details on configuration can be found in the tutorial document with link provided below.
To try the exercise in this post, you could refer to the tutorial for the details of the steps. I have provided a set of validators that I thought are commonly used. To add other validators, you could either send me a git pull request or file an issue in github.
For commercial support for any solution in my github repositories, please talk to ThirdEye Data Science Services. Support is available for Hadoop or Spark deployment on cloud including installation, configuration and testing,