Author Archives: Pranab

About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source project owner. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Machine Learning and Programming languages. I am fascinated by problems that don't have neat closed form solution.

Combating High Cardinality Features in Supervised Machine Learning

Typical training data set for real world machine learning problems has mixture of different types of data including numerical and categorical. Many machine learning algorithms can not handle categorical variables. Those that can, categorical data can pose a serious problem … Continue reading

Posted in Big Data, Data Science, Data Transformation, ETL, Hadoop and Map Reduce, Predictive Analytic | Tagged , , , | Leave a comment

Handling Rare Events and Class Imbalance in Predictive Modeling for Machine Failure

Most supervised Machine Learning algorithms face difficulty when there is class imbalance in the training data i.e., amount of data belonging one class heavily outnumber the other class. However, there are may real life problems where we encounter this situation e.g., … Continue reading

Posted in Big Data, Data Science, ETL, Hadoop and Map Reduce | Tagged , , , , | Leave a comment

Measuring Campaign Effectiveness for an Online Service on Spark

Measuring campaign effectiveness is critical for any company to justify the marketing money being spent. Consider a company providing a free online service on signup. It’s critical for the company to convert them so that they subscribe to a paid … Continue reading

Posted in Big Data, Data Science, Marketing Analytic, Spark | Tagged , , | Leave a comment

Processing Missing Values with Hadoop

Missing values are just part of life in the data processing world. In most cases you can not simply ignore the missing values as it may adversely affect whatever analytic processing you are going to do. Broadly speaking, handling missing … Continue reading

Posted in Big Data, Data Profiling, Data Science, ETL, Hadoop and Map Reduce | Tagged , , | Leave a comment

Project Assignment Optimization with Simulated Annealing on Spark

Optimizing assignment of people to projects is a very complex problem and classical optimization techniques are not very useful. The topic this post is a project assignment optimization problem where people should be assigned to projects in a way that will … Continue reading

Posted in Data Science, Optimization, Spark | Tagged , , | 1 Comment

Mining Seasonal Products from Sales Data

The other day someone asked me how to include products with seasonal demand in recommendations based on collaborative filtering or some other technique. The solution to the problem involves two steps. The first step is to identify products with seasonal … Continue reading

Posted in Big Data, Data Mining, Data Science, eCommerce, Map Reduce, Recommendation Engine | Tagged , , , | Leave a comment

Predicting Call Hangup in Customer Service Calls with Decision Tree and Random Forest

When customers hangup after a long wait in a call, it’s money wasted for the company. Moreover, it leaves the customer with a poor experience. It would have been nice, if we could predict in real time while the customer … Continue reading

Posted in Big Data, Customer Service, Hadoop and Map Reduce, Machine Learning, Predictive Analytic | Tagged , , | 2 Comments