This is a sequel to my last blog on CRM leads conversion prediction using Gradient Boosted Trees as implemented in ScikitLearn. The focus of this blog is automatic training and parameter tuning for the model. The implementation is available in my open source project avenir.
The auto training logic as used here is independent of any particular supervised learning algorithm and applicable for any learning algorithm.
The frame work around ScikitLearn, used here facilitates building predictive models without having to write python code. I will be adding other supervised learning algorithms to this framework Continue reading
Sales leads are are generally managed and nurtured in CRM systems. It will be nice if we could predict the likelihood of any lead converting to an actual deal. This could be very beneficial in many ways e.g. proactively providing special care for weak leads and for projecting future revenue .
In this post we will go over a predictive modeling solution built on Python ScikitLearn Machine Learning Library. We will be using Gradient Boosted Tree(GBT) which is a powerful and popular supervised learning algorithm.
During the course of this blog, we will also see Continue reading
Categorical features variables i.e. features variables with fixed set of unique values appear in the training data set for many real world problems. However, categorical variables pose a serious problem for many Machine Learning algorithms. Some examples of such algorithms are Logistic Regression, Support Vector Machine (SVM) and any Regression algorithm.
In this post we will go over a Spark based solution to alleviate the problem. The solution implementation can be found in Continue reading
For retailers, stocking perishable products is a risky business. If a product doesn’t sell completely by the expiry date, then the remaining inventory has to be discarded and loss be taken for those items. Retailers will do whatever is necessary to avert such a situation i.e being stuck with unsold items for a perishable product.
In this post, we apply a particular type of Multi Arm Bandit algorithm called Thompson Sampling to solve the problem. The solution is implemented on Spark and available Continue reading
In the life of a Data Scientist, it’s not uncommon to run into a data set with no knowledge or very little knowledge about the data. You may be interested in learning about such data with missing meta data through some tools instead of going through the tedious process of manually perusing the data and try to make sense out of it.
In this a post we will go through a Spark based implementation to automatically discover data types for the various fields in a data set. The implementation in available in my OSS project chombo.
Data type discovery is only one of the ways Continue reading
Data normalization is a required data preparation step for many Machine Learning algorithms. These algorithms are sensitive to the relative values of the feature attributes. Data normalization is the process of bringing all the attribute values within some desired range. Unless the data is normalized, these algorithms don’t behave correctly.
In this post, we will go through various data normalization techniques, as implemented on Spark. To provide some context, we will also discuss how different supervised learning algorithms are negatively impacted from lack of normalization
The Spark based implementation is available in my open source project chombo. Continue reading
If you work with data, there is a high probability that you have run into duplicate data in your data set. Removing duplicates in Big Data is a computationally intensive process and parallel cluster processing with Hadoop or Spark becomes a necessity. In this post we will focus on de duplication based on exact match, whether for the whole record or set of specified key fields. De duplication can also be performed based on fuzzy matching. We will address de duplication for flat record oriented data only.
The Spark based implementation is available Continue reading