Category Archives: ETL

JSON to Relational Mapping with Spark

If there one data format that’s ubiquitous, it’s JSON. Whether  you are calling an API, or exporting data from some system, the format is most likely to be JSON these days. However many databases can not handle  JSON and you … Continue reading

Posted in Big Data, ETL, Spark | Tagged , | Leave a comment

Simple Sanity Checks for Data Correctness with Spark

Sometimes when running a complex data processing pipeline with Hadoop or Spark, you may encounter data, where most of the data is just grossly invalid. It might save lot of pain and headache, if we could do some simple sanity checks before feeding … Continue reading

Posted in ETL, Hadoop and Map Reduce, Spark | Tagged | Leave a comment

Transforming Big Data

This is a sequel to my earlier posts on Hadoop based ETL covering validation and profiling. Considering  the fact that in most data projects more than 50% of the time is spent on  data cleaning and munging, I have added significant … Continue reading

Posted in Big Data, Data Transformation, ETL, Hadoop and Map Reduce | Tagged , | 2 Comments

Profiling Big Data

Data profiling is the process of examining data to learn about important characteristics of data. It’s an important part of any ETL process. It’s often necessary to do data profiling before embarking on any serious analytic work. I have implemented … Continue reading

Posted in Big Data, Data Profiling, data quality, ETL, Hadoop and Map Reduce | Tagged , | 5 Comments

Validating Big Data

Data quality is a thorny issue in most Big Data projects. It’s been reported that more than half  of the time spent in Big Data projects goes towards data cleansing and preparation. In this post, I will cover data validation … Continue reading

Posted in Big Data, data quality, ETL, Hadoop and Map Reduce | Tagged , | 8 Comments

Data Quality Control With Outlier Detection

For many Big Data projects, it has been reported  that significant part of the time, sometimes up to 70-80% of time,  is spent in data cleaning and preparation. Typically, in most ETL tools,  you define constraints and rules statically for … Continue reading

Posted in Big Data, Data Science, ETL, Hadoop and Map Reduce, Internet of Things, Outlier Detection, Statistics | Tagged , , , , | 1 Comment

Bulk Insert, Update and Delete in Hadoop Data Lake

Hadoop Data Lake, unlike traditional data warehouse, does not enforce schema on write and serves as a repository of data with different formats from various sources. If the data collected in a data lake is immutable, they simply accumulate in an append only … Continue reading

Posted in Big Data, ETL, Hadoop and Map Reduce, Hive | Tagged , , , | 10 Comments