Tag Archives: ETL

Removing Duplicates from Order Data Using Spark

If you work with data, there is a high probability that you have run into duplicate data in your data set. Removing duplicates in Big Data is a computationally intensive process and parallel cluster processing with Hadoop or Spark becomes … Continue reading

Posted in Big Data, Data Science, ETL, Spark | Tagged , | Leave a comment

Processing Missing Values with Hadoop

Missing values are just part of life in the data processing world. In most cases you can not simply ignore the missing values as it may adversely affect whatever analytic processing you are going to do. Broadly speaking, handling missing … Continue reading

Posted in Big Data, Data Profiling, Data Science, ETL, Hadoop and Map Reduce | Tagged , , | Leave a comment

Transforming Big Data

This is a sequel to my earlier posts on Hadoop based ETL covering validation and profiling. Considering  the fact that in most data projects more than 50% of the time is spent on  data cleaning and munging, I have added significant … Continue reading

Posted in Big Data, Data Transformation, ETL, Hadoop and Map Reduce | Tagged , | 3 Comments

Profiling Big Data

Data profiling is the process of examining data to learn about important characteristics of data. It’s an important part of any ETL process. It’s often necessary to do data profiling before embarking on any serious analytic work. I have implemented … Continue reading

Posted in Big Data, Data Profiling, data quality, ETL, Hadoop and Map Reduce | Tagged , | 5 Comments

Validating Big Data

Data quality is a thorny issue in most Big Data projects. It’s been reported that more than half  of the time spent in Big Data projects goes towards data cleansing and preparation. In this post, I will cover data validation … Continue reading

Posted in Big Data, data quality, ETL, Hadoop and Map Reduce | Tagged , | 8 Comments

Data Quality Control With Outlier Detection

For many Big Data projects, it has been reported  that significant part of the time, sometimes up to 70-80% of time,  is spent in data cleaning and preparation. Typically, in most ETL tools,  you define constraints and rules statically for … Continue reading

Posted in Big Data, Data Science, ETL, Hadoop and Map Reduce, Internet of Things, Outlier Detection, Statistics | Tagged , , , , | 1 Comment

Folding, Cross Validation with Map Reduce

In my current web log data mining project using Hadoop, I am trying to build a predictive model for predicting certain attributes of the web site visitor.  I have number of ETL (Extract, Transform and Load)tasks to prepare the data … Continue reading

Posted in Hadoop and Map Reduce | Tagged , , , | 3 Comments