Sometimes when running a complex data processing pipeline with Hadoop or Spark, you may encounter data, where most of the data is just grossly invalid. It might save lot of pain and headache, if we could do some simple sanity checks before feeding the data into a complex processing pipe line. If you suspect that the data is mostly invalid, validation checks could be performed on a fraction of the data by sub sampling. We will go though a Spark job that does what I just described. It’s part of my open source project chombo. A Hadoop based implementation is also available.
Solution for a more rigorous data validation with many out of the box data validators is also available in chombo.
Simple Validation Check
Only some basic validation checks are performed, as listed below. For more vigorous validation checks, the other solution I alluded to should be used, which provides many out of the box field validators.
- Count of number of fields
- Data type checks for individual fields
- Check for missing fields
The field data types supported are as follows. It’s provided as a list in the configuration.
For String data there is no validation check. For Date type data, validation check is performed only when a when a date formatter is provided through configuration.
The Spark implementation essentially contains set of transformers including maps and filters. The output consists of all the invalid records found.
The first test case I ran is where the number of fields in the data is different from what is expected. When this kind of validation check fails, individual fields are not checked for data type validation. An invalid record is prepended with special marker, which can be configured. Here is some sample output
[xx]H1J8R77KG9UH,ZX9I4F1A11,9MUMVX,2016-10-07 04:49:33,7WZ2515J,1,57.98,UKP [xx]62CHT3GI57W4,LLP0312UT3,7M5WGU,2016-10-07 05:09:30,PH9HAA8H,5,140.31,UKP [xx]62CHT3GI57W4,LLP0312UT3,7M5WGU,2016-10-07 05:09:30,GMLX96AE,2,24.77,UKP [xx]62CHT3GI57W4,LLP0312UT3,7M5WGU,2016-10-07 05:09:30,0X8QEOP6,2,175.75,UKP [xx]PM0NLARU1ID5,7YC1MER2R5,27BMH0,2016-10-07 05:15:13,X8XNH0W8,5,51.55,UKP [xx]0SA68KV2KWI2,E6USN4249V,27BMH0,2016-10-07 05:32:30,1C489W9J,5,227.39,UKP [xx]3MRE3KH35111,Y66J97ZCP5,9MUMVX,2016-10-07 06:12:04,RF5XOLWS,4,325.60,UKP
The second test case involves a case where a field has invalid data type across all rows. Here is some sample output. For the 5th field, integer is expected, however string is found.The invalid field is prepended with a special marker, which is configurable.
614CQ578HW31,PG236OUMWF,7M5WGU,2016-10-06 22:46:45,[x]YOKF9B05,4,287.89,UKP R5B6W9T1UN97,DDX49MRTGE,27BMH0,2016-10-06 22:49:26,[x]D2SDU96O,2,109.06,UKP KQNMTM1I48V3,L907379HO9,27BMH0,2016-10-06 23:38:39,[x]5E7TF3FP,2,194.02,UKP KWYUJJ2GLXMC,MIC85UA0O5,H1I9CI,2016-10-06 23:55:16,[x]X3P533R3,1,95.56,UKP O7434RLB1DUZ,PGGWL6LUMC,27BMH0,2016-10-07 00:16:56,[x]Y4E74ZO3,2,47.27,UKP ID4FQST20DBY,D9BB5VRORC,154YYM,2016-10-07 00:27:23,[x]G29AKWRI,5,294.79,UKP OQXM5FSQABQ3,95210UERPF,H1I9CI,2016-10-07 00:41:29,[x]55HE23U5,4,72.85,UKP H2AE7Y29N5C4,AQ77JS8H7H,27BMH0,2016-10-07 01:58:00,[x]59G6576G,4,171.08,UKP B3B95AF39J5R,0U5TTFJ1KZ,27BMH0,2016-10-07 02:01:51,[x]3ME7I3T1,2,182.41,UKP B3B95AF39J5R,0U5TTFJ1KZ,27BMH0,2016-10-07 02:01:51,[x]QUA61H5Z,3,292.30,UKP
Only when invalid data is found, output is generated. Lack of output will indicate valid data.
When it’s known that the majority of the data might be invalid, all records need not be processed and the data can be sub sampled. Validation checks are performed on the sub sampled data.
if it’s anticipated that at least x fraction of the data is invalid, then the sub sampling fraction should be set to (1 -x) or higher using the parameter sample.fraction. However, if you guess is incorrect, no invalid data may be found in sub samples, although complete data set may have many invalid records.
Setting sub sample fraction is an option. If you are not sure about the quality of the data, it may be best not to set the sub sample fraction parameter. All records are processed when this parameter is not set.
We have gone through a Spark based solution for simple data validation and sanity check. If you want to execute the use case, you could follow the tutorial. If you prefer Hadoop, here is the Hadoop based implementation for the same solution.