Building SciKitLearn Random Forest Model and Tuning Parameters without writing Python Code

Random Forest is a supervised learning algorithm which can be used for classification and regression. In this article we go though a process of training a Random Forest model including auto parameter tuning without writing any Python code.We will use patient medical data to predict heart disease as an example use case.

The implementation is available in open source project avenir on github. Extensive use of configuration parameters enables the end user to use the solution without writing python code.

Continue reading
Posted in Data Science, Machine Learning, Python, ScikitLearn | Tagged , , | Leave a comment

Model Drift Detection with Kolmogorov Smirnov Statistic on Spark

In retail business, you may be using various business solutions based on product demand data e.g inventory management or how a newly introduced product may be performing with time. The buying behavior model may change with time rendering the those solutions ineffective. It may be necessary to periodically tune the system, i.e detect any drift in the behavior model. If the drift is significant, appropriate changes to be made in business solutions that depend on demand distribution.

In this post we will find out how to use a statistic called Kolmogorov Smirnov Statistic (KS statistic )to measure product demand model drift. The spark implementation for KS Statistic is available in my open source project chombo and avenir. The implementation uses a heavy doses of configuration and is application agnostic.

Continue reading
Posted in Data Science, Machine Learning, Spark, Statistics | Tagged | Leave a comment

Evaluation of Time Series Predictability with Kaboudan Metric using Prophet

You might be getting ready to build a time series forecasting model using state of the art LSTM network. Before you proceed you may want to pause and ask yourself whether your time series inherently predictable at all i.e whether it’s even possible to build a forecast model.

In this post, we will learn how to calculate a metric called Kaboudan Metric, that will indicate how predictable a time series is. The python implementation is available in my open source project avenir. Any forecasting model can be used to calculate this metric. I have used Prophet from Facebook.

Continue reading
Posted in Python, Time Series Analytic | Tagged , , | Leave a comment

Contextual Data Completeness Metric Computation on Spark

Data quality is critical for the healthy operation of any data driven enterprise. There are various kinds of data quality metrics. In this post, the focus will be on the completeness of data. Data quality from a completeness point of view will be expressed as metrics in different levels of granularity. Additionally, data completeness will be assessed within the context of a consuming process and hence it is contextual.

The Spark implementation is available in my open source project chombo. The implementation is agnostic to the specific data set. It’s configuration and meta data driven.

Continue reading
Posted in Big Data, Data Science, ETL, Spark | Tagged , | Leave a comment

Machine Learning Model Interpretation and Prescriptive Analytic with Lime

Machine learning model interpretablity is the degree to which a human can comprehend the reasons behind the prediction made by a model. Interpretablity may be required for various reasons e.g. meeting compliance requirements or gaining insight for high stakes situation e.g. medical diagnosis. In this post we will show how to use lime python library to to interpret a Random Forest based loan approval Machine Learning predictive model.

The implementation is available in my open source project avenir. To make it easier, a wrapper class around lime has been created, so that lime can be used without any Python coding by defining all relevant parameters in a properties configuration file.

Continue reading
Posted in Data Science, Machine Learning, Python | Tagged , , | Leave a comment

Automated Machine Learning with Hyperopt and Scikitlearn without Writing Python Code

The most challenging part of building supervised machine learning model is optimization for algorithm selection, feature selection and algorithm specific hyper parameter value selection that yields the best performing model. Undertaking such a task manually is not feasible, unless the model is very simple.

The purpose of Automated Machine Learning (AutoML) tools is to democratize Machine Learning by making this optimization process automated. In this post we will use one such autoML tool called Hyperopt along with Scikitlearn. and show how to choose the optimum Scikitlearn classification algorithm, feature subset and associated hyper parameters for the algorithm. The solution is available in my open source project avenir on github.

Continue reading
Posted in Data Science, Machine Learning, Python, ScikitLearn, Supervised Learning | Tagged , , , | 3 Comments

Time Series Trend and Seasonality Component Decomposition with STL on Spark

You may be interested in decomposing a time series into level, trend, seasonality and remainder components to gain more insight into your time series. You may also be interested in decomposition to separate out the remainder component for anomaly detection. We will allude to various real life use cases later in this post. We will go through a time series decomposition solution using Seasonal and Trend decomposition using Loess (STL) algorithm as implemented on Spark.

The spark implementation is available in my open source project ruscello. The implementation is agonistic to any problem domain or data set, because as in my other Spark implementations, it is metadata and configuration driven. We will use eCommerce product sales data to show case the STL implementation.

Continue reading
Posted in Anomaly Detection, Big Data, Data Science, ETL, Spark, Time Series Analytic | Tagged , , , | Leave a comment