Category Archives: Data Science

Leave One Out Encoding for Categorical Feature Variables on Spark

Categorical feature variables is a thorny issue for many supervised Machine Learning algorithms. Many learning algorithms can not handle categorical feature variables. In this post, we will go over an encoding scheme called Leave One Out Encoding, as implemented with … Continue reading

Posted in Big Data, Data Science, ETL, Spark | Tagged | 1 Comment

Auto Training and Parameter Tuning for a ScikitLearn based Model for Leads Conversion Prediction

This is a sequel to my last blog on CRM leads conversion prediction using Gradient Boosted Trees as implemented in ScikitLearn. The focus of this blog is automatic training and parameter tuning for the model. The implementation is available in … Continue reading

Posted in Data Science, Machine Learning, Python, ScikitLearn, Supervised Learning | Tagged , , , | Leave a comment

Predicting CRM Lead Conversion with Gradient Boosting using ScikitLearn

Sales leads are are generally managed and nurtured in CRM systems. It will be nice if we could predict the likelihood of any lead converting to an actual deal. This could be very beneficial in many ways e.g. proactively  providing … Continue reading

Posted in Data Science, Machine Learning, Optimization, Python, ScikitLearn | Tagged , , , | 1 Comment

Handling Categorical Feature Variables in Machine Learning using Spark

Categorical features variables i.e. features variables with fixed set of unique values  appear in the training data set for many real world problems. However, categorical variables pose a serious problem for many Machine Learning algorithms. Some examples of such algorithms … Continue reading

Posted in Big Data, Data Science, Data Transformation, ETL, Scala, Spark | Tagged , , | Leave a comment

Optimizing Discount Price for Perishable Products with Thompson Sampling using Spark

For retailers, stocking perishable products is a risky business. If a product doesn’t sell completely by the expiry date, then the remaining inventory has to be discarded and loss be taken for those items. Retailers will do whatever is necessary … Continue reading

Posted in AI, Big Data, Data Science, Reinforcement Learning, Scala, Spark | Tagged , | 2 Comments

Data Type Auto Discovery with Spark

In the life of a Data Scientist, it’s not uncommon to run into a data set with no knowledge or very little knowledge about the data. You may be interested in learning about such data with missing meta data  through … Continue reading

Posted in Big Data, Data Profiling, Data Science, Scala, Spark | Tagged , | Leave a comment

Data Normalization with Spark

Data normalization is a required data preparation step for many Machine Learning algorithms. These algorithms are sensitive to the relative values of the feature attributes. Data normalization is the process of bringing all the attribute values within some desired range. Unless … Continue reading

Posted in Big Data, Data Science, ETL, Machine Learning, Spark | Tagged , , | Leave a comment