Tag Archives: data validation

Pluggable Rule Driven Data Validation with Spark

Data validation is an essential component in any ETL data pipeline. As we all know most Data Engineers and Scientist spend most of their time cleaning and preparing their data before they can even get to the core processing of … Continue reading

Posted in Big Data, Data Science, ETL, Spark | Tagged , | 2 Comments

Simple Sanity Checks for Data Correctness with Spark

Sometimes when running a complex data processing pipeline with Hadoop or Spark, you may encounter data, where most of the data is just grossly invalid. It might save lot of pain and headache, if we could do some simple sanity checks before feeding … Continue reading

Posted in ETL, Hadoop and Map Reduce, Spark | Tagged | 1 Comment

Validating Big Data

Data quality is a thorny issue in most Big Data projects. It’s been reported that more than half  of the time spent in Big Data projects goes towards data cleansing and preparation. In this post, I will cover data validation … Continue reading

Posted in Big Data, data quality, ETL, Hadoop and Map Reduce | Tagged , | 11 Comments