You might be getting ready to build a time series forecasting model using state of the art LSTM network. Before you proceed you may want to pause and ask yourself whether your time series inherently predictable at all i.e whether it’s even possible to build a forecast model.
In this post, we will learn how to calculate a metric called Kaboudan Metric, that will indicate how predictable a time series is. The python implementation is available in my open source project avenir. Any forecasting model can be used to calculate this metric. I have used Prophet from Facebook.
Data quality is critical for the healthy operation of any data driven enterprise. There are various kinds of data quality metrics. In this post, the focus will be on the completeness of data. Data quality from a completeness point of view will be expressed as metrics in different levels of granularity. Additionally, data completeness will be assessed within the context of a consuming process and hence it is contextual.
The Spark implementation is available in my open source project chombo. The implementation is agnostic to the specific data set. It’s configuration and meta data driven.
Machine learning model interpretablity is the degree to which a human can comprehend the reasons behind the prediction made by a model. Interpretablity may be required for various reasons e.g. meeting compliance requirements or gaining insight for high stakes situation e.g. medical diagnosis. In this post we will show how to use lime python library to to interpret a Random Forest based loan approval Machine Learning predictive model.
The implementation is available in my open source project avenir. To make it easier, a wrapper class around lime has been created, so that lime can be used without any Python coding by defining all relevant parameters in a properties configuration file.
The most challenging part of building supervised machine learning model is optimization for algorithm selection, feature selection and algorithm specific hyper parameter value selection that yields the best performing model. Undertaking such a task manually is not feasible, unless the model is very simple.
The purpose of Automated Machine Learning (AutoML) tools is to democratize Machine Learning by making this optimization process automated. In this post we will use one such autoML tool called Hyperopt along with Scikitlearn. and show how to choose the optimum Scikitlearn classification algorithm, feature subset and associated hyper parameters for the algorithm. The solution is available in my open source project avenir on github.
You may be interested in decomposing a time series into level, trend, seasonality and remainder components to gain more insight into your time series. You may also be interested in decomposition to separate out the remainder component for anomaly detection. We will allude to various real life use cases later in this post. We will go through a time series decomposition solution using Seasonal and Trend decomposition using Loess (STL) algorithm as implemented on Spark.
The spark implementation is available in my open source project ruscello. The implementation is agonistic to any problem domain or data set, because as in my other Spark implementations, it is metadata and configuration driven. We will use eCommerce product sales data to show case the STL implementation.
Missing value is a common problem in many real world data set. There are various techniques for imputing missing values. We will use a kind of Neural Network called RBM for imputing missing values. Restricted Boltzmann Machine (RBM) are stochastic neural network used for probabilistic graphical modeling. We will use a customer survey data set with missing income fields to show how to use RBM to impute missing values.
The Python implementation is available in my open source project avenir on github. It provides a user friendly wrapper around RBM implementation in scikit Python ML library. It allow you to use RBM by appropriate settings in a property configuration files. There is very little coding involved except to call the train and prediction API.
Categorical variables are ubiquitous in data. They pose a serious problem in many Data Science analysis processes. For example, many supervised Machine Learning algorithms work only with numerical data. With high cardinality categorical variables, popular encoding solutions like One Hot Encoding is not feasible.
In this post we will go through a technique called Feature Hashing for encoding high cardinality categorical variables as implemented on Spark. We will showcase the solution with a use case from mobile advertisement.
You can find the Spark implementation in my open source github project avenir. As with my other Spark implementation, the solution is meta data driven and agnostic of any specific application or data set.