Simple Sanity Checks for Data Correctness with Spark

Sometimes when running a complex data processing pipeline with Hadoop or Spark, you may encounter data, where most of the data is just grossly invalid. It might save lot of pain and headache, if we could do some simple sanity checks before feeding the data  into a complex processing pipe line. If you suspect that the data is mostly invalid, validation checks could be performed on a fraction of the data by sub sampling. We will go though a Spark job that does what I just described. It’s part of my open source project chombo. A Hadoop based implementation is also available.

Solution for a more rigorous data validation with many out of the box data validators is also available in chombo.

Simple Validation Check

Only some basic validation checks are performed, as listed below. For more vigorous validation checks, the other solution I alluded to should be used, which provides many out of the box field  validators.

  1. Count of number of fields
  2. Data type checks for individual fields
  3. Check for missing fields

The field data types supported are as follows. It’s provided as a list in the configuration.

  1. String
  2. Integer
  3. Long
  4. Double
  5. Date

For String data there is no validation check. For Date type data, validation check is performed only when a when a date formatter is provided through configuration.


The Spark implementation essentially contains set of transformers including maps and filters. The output consists of all the invalid records found.

The first test case I ran is where the number of fields in the data is different from what is expected. When this kind of validation check fails, individual fields are not checked for data type validation. An invalid record is prepended  with special marker, which can be configured. Here is some sample output

[xx]H1J8R77KG9UH,ZX9I4F1A11,9MUMVX,2016-10-07 04:49:33,7WZ2515J,1,57.98,UKP
[xx]62CHT3GI57W4,LLP0312UT3,7M5WGU,2016-10-07 05:09:30,PH9HAA8H,5,140.31,UKP
[xx]62CHT3GI57W4,LLP0312UT3,7M5WGU,2016-10-07 05:09:30,GMLX96AE,2,24.77,UKP
[xx]62CHT3GI57W4,LLP0312UT3,7M5WGU,2016-10-07 05:09:30,0X8QEOP6,2,175.75,UKP
[xx]PM0NLARU1ID5,7YC1MER2R5,27BMH0,2016-10-07 05:15:13,X8XNH0W8,5,51.55,UKP
[xx]0SA68KV2KWI2,E6USN4249V,27BMH0,2016-10-07 05:32:30,1C489W9J,5,227.39,UKP
[xx]3MRE3KH35111,Y66J97ZCP5,9MUMVX,2016-10-07 06:12:04,RF5XOLWS,4,325.60,UKP

The second test case involves a case where a field has invalid data type across all rows. Here is some sample output. For the 5th field, integer is expected, however string is found.The invalid field is prepended with a special marker, which is configurable.

614CQ578HW31,PG236OUMWF,7M5WGU,2016-10-06 22:46:45,[x]YOKF9B05,4,287.89,UKP
R5B6W9T1UN97,DDX49MRTGE,27BMH0,2016-10-06 22:49:26,[x]D2SDU96O,2,109.06,UKP
KQNMTM1I48V3,L907379HO9,27BMH0,2016-10-06 23:38:39,[x]5E7TF3FP,2,194.02,UKP
KWYUJJ2GLXMC,MIC85UA0O5,H1I9CI,2016-10-06 23:55:16,[x]X3P533R3,1,95.56,UKP
O7434RLB1DUZ,PGGWL6LUMC,27BMH0,2016-10-07 00:16:56,[x]Y4E74ZO3,2,47.27,UKP
ID4FQST20DBY,D9BB5VRORC,154YYM,2016-10-07 00:27:23,[x]G29AKWRI,5,294.79,UKP
OQXM5FSQABQ3,95210UERPF,H1I9CI,2016-10-07 00:41:29,[x]55HE23U5,4,72.85,UKP
H2AE7Y29N5C4,AQ77JS8H7H,27BMH0,2016-10-07 01:58:00,[x]59G6576G,4,171.08,UKP
B3B95AF39J5R,0U5TTFJ1KZ,27BMH0,2016-10-07 02:01:51,[x]3ME7I3T1,2,182.41,UKP
B3B95AF39J5R,0U5TTFJ1KZ,27BMH0,2016-10-07 02:01:51,[x]QUA61H5Z,3,292.30,UKP

Only when invalid data is found, output is generated. Lack of output will indicate valid data.

Sub Sampling

When it’s known that the majority of the data might be invalid, all records need not be processed and the data can be sub sampled. Validation checks are performed on the sub sampled data.

if it’s anticipated that at least x fraction of the data is invalid, then the sub sampling fraction should be set to  (1 -x) or higher using the parameter sample.fraction. However, if you guess is incorrect, no invalid data may be found in sub samples, although complete data set may have many invalid records.

Setting sub sample fraction is an option. If you are not sure about the quality of the data, it may be best not to set the sub sample fraction  parameter. All records are processed when this parameter is not set.

Wrapping Up

We have gone through a Spark based solution for simple data validation and sanity check. If you want to execute the use case, you could follow the tutorial. If you prefer Hadoop, here is the Hadoop based implementation for the same solution.

For commercial support for this solution or other solutions in my github repositories, please talk to ThirdEye Data Science Services. Support is available for Hadoop or Spark deployment on cloud including installation, configuration and testing,


About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source project owner. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Machine Learning and Programming languages. I am fascinated by problems that don't have neat closed form solution.
This entry was posted in ETL, Hadoop and Map Reduce, Spark and tagged . Bookmark the permalink.

One Response to Simple Sanity Checks for Data Correctness with Spark

  1. Pingback: Removing Duplicates from Order Data Using Spark | Mawazo

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s