Tag Archives: ETL

Missing Value Imputation with Restricted Boltzmann Machine Neural Network

Missing value is a common problem in many real world data set. There are various techniques for imputing missing values. We will use a kind of Neural Network called RBM for imputing missing values. Restricted Boltzmann Machine (RBM) are stochastic … Continue reading

Posted in Data Science, Deep Learning, ETL, Machine Learning, Python | Tagged , , , | Leave a comment

Bulk Mutation in an Integration Data Lake with Spark

Data lakes act as repository of data from various sources, possibly of different formats. It can be used to build data warehouse or to perform other data analysis activities. Data lakes are generally built on top of Hadoop Distributed File … Continue reading

Posted in Big Data, Data Warehouse, eCommerce, ETL, Spark | Tagged , , , , | Leave a comment

Data Type Auto Discovery with Spark

In the life of a Data Scientist, it’s not uncommon to run into a data set with no knowledge or very little knowledge about the data. You may be interested in learning about such data with missing meta data  through … Continue reading

Posted in Big Data, Data Profiling, Data Science, Scala, Spark | Tagged , | Leave a comment

Data Normalization with Spark

Data normalization is a required data preparation step for many Machine Learning algorithms. These algorithms are sensitive to the relative values of the feature attributes. Data normalization is the process of bringing all the attribute values within some desired range. Unless … Continue reading

Posted in Big Data, Data Science, ETL, Machine Learning, Spark | Tagged , , | Leave a comment

Removing Duplicates from Order Data Using Spark

If you work with data, there is a high probability that you have run into duplicate data in your data set. Removing duplicates in Big Data is a computationally intensive process and parallel cluster processing with Hadoop or Spark becomes … Continue reading

Posted in Big Data, Data Science, ETL, Spark | Tagged , | 2 Comments

Processing Missing Values with Hadoop

Missing values are just part of life in the data processing world. In most cases you can not simply ignore the missing values as it may adversely affect whatever analytic processing you are going to do. Broadly speaking, handling missing … Continue reading

Posted in Big Data, Data Profiling, Data Science, ETL, Hadoop and Map Reduce | Tagged , , | 2 Comments

Transforming Big Data

This is a sequel to my earlier posts on Hadoop based ETL covering validation and profiling. Considering  the fact that in most data projects more than 50% of the time is spent on  data cleaning and munging, I have added significant … Continue reading

Posted in Big Data, Data Transformation, ETL, Hadoop and Map Reduce | Tagged , | 4 Comments