Category Archives: ETL

Combating High Cardinality Features in Supervised Machine Learning

Typical training data set for real world machine learning problems has mixture of different types of data including numerical and categorical. Many machine learning algorithms can not handle categorical variables. Those that can, categorical data can pose a serious problem … Continue reading

Posted in Big Data, Data Science, Data Transformation, ETL, Hadoop and Map Reduce, Predictive Analytic | Tagged , , , | Leave a comment

Handling Rare Events and Class Imbalance in Predictive Modeling for Machine Failure

Most supervised Machine Learning algorithms face difficulty when there is class imbalance in the training data i.e., amount of data belonging one class heavily outnumber the other class. However, there are may real life problems where we encounter this situation e.g., … Continue reading

Posted in Big Data, Data Science, ETL, Hadoop and Map Reduce | Tagged , , , , | Leave a comment

Processing Missing Values with Hadoop

Missing values are just part of life in the data processing world. In most cases you can not simply ignore the missing values as it may adversely affect whatever analytic processing you are going to do. Broadly speaking, handling missing … Continue reading

Posted in Big Data, Data Profiling, Data Science, ETL, Hadoop and Map Reduce | Tagged , , | Leave a comment

JSON to Relational Mapping with Spark

If there one data format that’s ubiquitous, it’s JSON. Whether  you are calling an API, or exporting data from some system, the format is most likely to be JSON these days. However many databases can not handle  JSON and you … Continue reading

Posted in Big Data, ETL, Spark | Tagged , | Leave a comment

Simple Sanity Checks for Data Correctness with Spark

Sometimes when running a complex data processing pipeline with Hadoop or Spark, you may encounter data, where most of the data is just grossly invalid. It might save lot of pain and headache, if we could do some simple sanity checks before feeding … Continue reading

Posted in ETL, Hadoop and Map Reduce, Spark | Tagged | Leave a comment

Transforming Big Data

This is a sequel to my earlier posts on Hadoop based ETL covering validation and profiling. Considering  the fact that in most data projects more than 50% of the time is spent on  data cleaning and munging, I have added significant … Continue reading

Posted in Big Data, Data Transformation, ETL, Hadoop and Map Reduce | Tagged , | 3 Comments

Profiling Big Data

Data profiling is the process of examining data to learn about important characteristics of data. It’s an important part of any ETL process. It’s often necessary to do data profiling before embarking on any serious analytic work. I have implemented … Continue reading

Posted in Big Data, Data Profiling, data quality, ETL, Hadoop and Map Reduce | Tagged , | 5 Comments