Tag Archives: ETL

Duplicate Data Detection with Neural Network and Contrastive Learning

Posted on July 21, 2021 by Pranab

Duplicate data is a ubiquitous problem in the data world. It often appears when data from different silos are consolidated. It could be an issue in an analytic project based on data aggregated from various sources. The training data for … Continue reading →

Posted in AI, Data Science, Deep Learning, ETL, Python, PyTorch | Tagged dedup, duplicate, ETL | 3 Comments

Missing Value Imputation with Restricted Boltzmann Machine Neural Network

Posted on August 26, 2019 by Pranab

Missing value is a common problem in many real world data set. There are various techniques for imputing missing values. We will use a kind of Neural Network called RBM for imputing missing values. Restricted Boltzmann Machine (RBM) are stochastic … Continue reading →

Posted in Data Science, Deep Learning, ETL, Machine Learning, Python | Tagged ETL, missing value imputation, RBM, scikit | Leave a comment

Bulk Mutation in an Integration Data Lake with Spark

Posted on November 19, 2018 by Pranab

Data lakes act as repository of data from various sources, possibly of different formats. It can be used to build data warehouse or to perform other data analysis activities. Data lakes are generally built on top of Hadoop Distributed File … Continue reading →

Posted in Big Data, Data Warehouse, eCommerce, ETL, Spark | Tagged data integration, data lake, ETL, Hadoop, HDFS | 1 Comment

Data Type Auto Discovery with Spark

Posted on January 10, 2018 by Pranab

In the life of a Data Scientist, it’s not uncommon to run into a data set with no knowledge or very little knowledge about the data. You may be interested in learning about such data with missing meta data through … Continue reading →

Posted in Big Data, Data Profiling, Data Science, Scala, Spark | Tagged data type auto discovery, ETL | Leave a comment

Data Normalization with Spark

Posted on December 5, 2017 by Pranab

Data normalization is a required data preparation step for many Machine Learning algorithms. These algorithms are sensitive to the relative values of the feature attributes. Data normalization is the process of bringing all the attribute values within some desired range. Unless … Continue reading →

Posted in Big Data, Data Science, ETL, Machine Learning, Spark | Tagged ETL, normalization, spark | Leave a comment

Removing Duplicates from Order Data Using Spark

Posted on November 7, 2017 by Pranab

If you work with data, there is a high probability that you have run into duplicate data in your data set. Removing duplicates in Big Data is a computationally intensive process and parallel cluster processing with Hadoop or Spark becomes … Continue reading →

Posted in Big Data, Data Science, ETL, Spark | Tagged dedup, ETL | 2 Comments

Processing Missing Values with Hadoop

Posted on July 26, 2017 by Pranab

Missing values are just part of life in the data processing world. In most cases you can not simply ignore the missing values as it may adversely affect whatever analytic processing you are going to do. Broadly speaking, handling missing … Continue reading →

Posted in Big Data, Data Profiling, Data Science, ETL, Hadoop and Map Reduce | Tagged ETL, imputation, missing value | 2 Comments

Transforming Big Data

Posted on November 17, 2015 by Pranab

This is a sequel to my earlier posts on Hadoop based ETL covering validation and profiling. Considering the fact that in most data projects more than 50% of the time is spent on data cleaning and munging, I have added significant … Continue reading →

Posted in Big Data, Data Transformation, ETL, Hadoop and Map Reduce | Tagged data transformation, ETL | 4 Comments

Profiling Big Data

Posted on September 22, 2015 by Pranab

Data profiling is the process of examining data to learn about important characteristics of data. It’s an important part of any ETL process. It’s often necessary to do data profiling before embarking on any serious analytic work. I have implemented … Continue reading →

Posted in Big Data, Data Profiling, data quality, ETL, Hadoop and Map Reduce | Tagged data profiling, ETL | 6 Comments

Validating Big Data

Posted on July 28, 2015 by Pranab

Data quality is a thorny issue in most Big Data projects. It’s been reported that more than half of the time spent in Big Data projects goes towards data cleansing and preparation. In this post, I will cover data validation … Continue reading →

Posted in Big Data, data quality, ETL, Hadoop and Map Reduce | Tagged data validation, ETL | 11 Comments

Tag Archives: ETL

Duplicate Data Detection with Neural Network and Contrastive Learning

Missing Value Imputation with Restricted Boltzmann Machine Neural Network

Bulk Mutation in an Integration Data Lake with Spark

Data Type Auto Discovery with Spark

Data Normalization with Spark

Removing Duplicates from Order Data Using Spark

Processing Missing Values with Hadoop

Transforming Big Data

Profiling Big Data

Validating Big Data

Recent Posts

Top Posts

Archives

Categories

Meta

About me

My Recent Tweets