Category Archives: data quality

Profiling Big Data

Data profiling is the process of examining data to learn about important characteristics of data. It’s an important part of any ETL process. It’s often necessary to do data profiling before embarking on any serious analytic work. I have implemented … Continue reading

Posted in Big Data, Data Profiling, data quality, ETL, Hadoop and Map Reduce | Tagged , | 5 Comments

Anomaly Detection with Robust Zscore

Anomaly detection with with various statistical modeling based techniques are simple and effective. The Zscore based technique is one among them. Zscore is defined as the absolute difference between a data value and it’s mean normalized with standard deviation. A … Continue reading

Posted in Anomaly Detection, Big Data, data quality, Data Science, Hadoop and Map Reduce | Tagged , , | 6 Comments

Validating Big Data

Data quality is a thorny issue in most Big Data projects. It’s been reported that more than half  of the time spent in Big Data projects goes towards data cleansing and preparation. In this post, I will cover data validation … Continue reading

Posted in Big Data, data quality, ETL, Hadoop and Map Reduce | Tagged , | 8 Comments