Data validation is an essential component in any ETL data pipeline. As we all know most Data Engineers and Scientist spend most of their time cleaning and preparing their data before they can even get to the core processing of the data.
In this post we will go over a pluggable rule driven data validation solution implemented on Spark. Earlier I had posted about the same solution implemented on Hadoop. This post can be considered as a sequel to the earlier post. The solution is available Continue reading
Query expansion is a process of reformulating a query to improve query results and to be more specific to improve the recall for a query. Topic modeling is an Natural Language Processing (NLP) technique to discover hidden topics or concepts in documents. We will be going through a Query Expansion technique based on Topic Modeling.
The solution is based on Latent Dirichlet Allocation (LDA) algorithm as implemented python gensim library. LDA is a Continue reading
Categorical feature variables is a thorny issue for many supervised Machine Learning algorithms. Many learning algorithms can not handle categorical feature variables. In this post, we will go over an encoding scheme called Leave One Out Encoding, as implemented with Spark. It’s a recent algorithm and popular in Kaggle. This algorithm is particularly useful for high cardinality categorical features.
The Spark implementation of the encoding algorithms can be found Continue reading
This is a sequel to my last blog on CRM leads conversion prediction using Gradient Boosted Trees as implemented in ScikitLearn. The focus of this blog is automatic training and parameter tuning for the model. The implementation is available in my open source project avenir.
The auto training logic as used here is independent of any particular supervised learning algorithm and applicable for any learning algorithm.
The frame work around ScikitLearn, used here facilitates building predictive models without having to write python code. I will be adding other supervised learning algorithms to this framework Continue reading
Sales leads are are generally managed and nurtured in CRM systems. It will be nice if we could predict the likelihood of any lead converting to an actual deal. This could be very beneficial in many ways e.g. proactively providing special care for weak leads and for projecting future revenue .
In this post we will go over a predictive modeling solution built on Python ScikitLearn Machine Learning Library. We will be using Gradient Boosted Tree(GBT) which is a powerful and popular supervised learning algorithm.
During the course of this blog, we will also see Continue reading
Categorical features variables i.e. features variables with fixed set of unique values appear in the training data set for many real world problems. However, categorical variables pose a serious problem for many Machine Learning algorithms. Some examples of such algorithms are Logistic Regression, Support Vector Machine (SVM) and any Regression algorithm.
In this post we will go over a Spark based solution to alleviate the problem. The solution implementation can be found in Continue reading
For retailers, stocking perishable products is a risky business. If a product doesn’t sell completely by the expiry date, then the remaining inventory has to be discarded and loss be taken for those items. Retailers will do whatever is necessary to avert such a situation i.e being stuck with unsold items for a perishable product.
In this post, we apply a particular type of Multi Arm Bandit algorithm called Thompson Sampling to solve the problem. The solution is implemented on Spark and available Continue reading