Category Archives: Data Transformation

Handling Categorical Feature Variables in Machine Learning using Spark

Posted on March 19, 2018 by Pranab

Categorical features variables i.e. features variables with fixed set of unique values appear in the training data set for many real world problems. However, categorical variables pose a serious problem for many Machine Learning algorithms. Some examples of such algorithms … Continue reading →

Posted in Big Data, Data Science, Data Transformation, ETL, Scala, Spark | Tagged binary dummy variable, categorical variable, training data set | 1 Comment

Combating High Cardinality Features in Supervised Machine Learning

Posted on October 9, 2017 by Pranab

Typical training data set for real world machine learning problems has mixture of different types of data including numerical and categorical. Many machine learning algorithms can not handle categorical variables. Those that can, categorical data can pose a serious problem … Continue reading →

Posted in Big Data, Data Science, Data Transformation, ETL, Hadoop and Map Reduce, Predictive Analytic | Tagged categorical attributes, data pre processing, high cardinality, supply chain | 5 Comments

Transforming Big Data

Posted on November 17, 2015 by Pranab

This is a sequel to my earlier posts on Hadoop based ETL covering validation and profiling. Considering the fact that in most data projects more than 50% of the time is spent on data cleaning and munging, I have added significant … Continue reading →

Posted in Big Data, Data Transformation, ETL, Hadoop and Map Reduce | Tagged data transformation, ETL | 4 Comments

Category Archives: Data Transformation

Handling Categorical Feature Variables in Machine Learning using Spark

Combating High Cardinality Features in Supervised Machine Learning

Transforming Big Data

Recent Posts

Top Posts

Archives

Categories

Meta

About me

My Recent Tweets