Machine learning model interpretablity is the degree to which a human can comprehend the reasons behind the prediction made by a model. Interpretablity may be required for various reasons e.g. meeting compliance requirements or gaining insight for high stakes situation e.g. medical diagnosis. In this post we will show how to use lime python library to to interpret a Random Forest based loan approval Machine Learning predictive model.
The implementation is available in my open source project avenir. To make it easier, a wrapper class around lime has been created, so that lime can be used without any Python coding by defining all relevant parameters in a properties configuration file.
The most challenging part of building supervised machine learning model is optimization for algorithm selection, feature selection and algorithm specific hyper parameter value selection that yields the best performing model. Undertaking such a task manually is not feasible, unless the model is very simple.
The purpose of Automated Machine Learning (AutoML) tools is to democratize Machine Learning by making this optimization process automated. In this post we will use one such autoML tool called Hyperopt along with Scikitlearn. and show how to choose the optimum Scikitlearn classification algorithm, feature subset and associated hyper parameters for the algorithm. The solution is available in my open source project avenir on github.
You may be interested in decomposing a time series into level, trend, seasonality and remainder components to gain more insight into your time series. You may also be interested in decomposition to separate out the remainder component for anomaly detection. We will allude to various real life use cases later in this post. We will go through a time series decomposition solution using Seasonal and Trend decomposition using Loess (STL) algorithm as implemented on Spark.
The spark implementation is available in my open source project ruscello. The implementation is agonistic to any problem domain or data set, because as in my other Spark implementations, it is metadata and configuration driven. We will use eCommerce product sales data to show case the STL implementation.
Missing value is a common problem in many real world data set. There are various techniques for imputing missing values. We will use a kind of Neural Network called RBM for imputing missing values. Restricted Boltzmann Machine (RBM) are stochastic neural network used for probabilistic graphical modeling. We will use a customer survey data set with missing income fields to show how to use RBM to impute missing values.
The Python implementation is available in my open source project avenir on github. It provides a user friendly wrapper around RBM implementation in scikit Python ML library. It allow you to use RBM by appropriate settings in a property configuration files. There is very little coding involved except to call the train and prediction API.
Categorical variables are ubiquitous in data. They pose a serious problem in many Data Science analysis processes. For example, many supervised Machine Learning algorithms work only with numerical data. With high cardinality categorical variables, popular encoding solutions like One Hot Encoding is not feasible.
In this post we will go through a technique called Feature Hashing for encoding high cardinality categorical variables as implemented on Spark. We will showcase the solution with a use case from mobile advertisement.
You can find the Spark implementation in my open source github project avenir. As with my other Spark implementation, the solution is meta data driven and agnostic of any specific application or data set.
There are many techniques for time series anomaly detection. In this post, the focus is on sequence based anomaly detection of time series data with Markov Chain. The technique will be elucidated with a use case involving data from a health monitoring device. Anomaly detection is critical for this kind of health monitoring data, since it may indicate potential harmful health condition.
The spark implementation is available in my open source project beymani on github. The complete solution also uses my other open source projects avenir and chombo. As with all my other open source Spark implementation, it is agnostic of any specific application. Generous use of configuration and meta data enables us to do that.
Posted in Anomaly Detection, Big Data, Data Science, Machine Learning, Outlier Detection, Scala, Spark
Tagged anomaly score threshold, health monitring data, markov chain, sequence anomaly, time series anomaly
In text summarization, we create a summary of the original content that is coherent and captures the salient points in the original content. There are various important usages of text summarization. Something we face almost every day is the text snippet that is shown in the search engine results. That snippet is essentially a summary. Our decision of whether to click on an items in the search result is largely driven by the title and the summary of the content.
In this post we will go through 6 unsupervised extractive text summarization algorithms that have been implemented in Python and is part of my open source project avenir in github.