Missing Value Imputation with Restricted Boltzmann Machine Neural Network

Missing value is a common problem in many real world data set. There are various techniques for imputing missing values. We will use a kind of Neural Network called RBM for imputing missing values. Restricted Boltzmann Machine (RBM) are stochastic neural network used for probabilistic graphical modeling. We will use a customer survey data set with missing income fields to show how to use RBM to impute missing values.

The Python implementation is available in my open source project avenir on github. It provides a user friendly wrapper around RBM implementation in scikit Python ML library. It allow you to use RBM by appropriate settings in a property configuration files. There is very little coding involved except to call the train and prediction API.

Continue reading
Posted in Data Science, Deep Learning, ETL, Machine Learning, Python | Tagged , , , | Leave a comment

Encoding High Cardinality Categorical Variables with Feature Hashing on Spark

Categorical variables are ubiquitous in data. They pose a serious problem in many Data Science analysis processes. For example, many supervised Machine Learning algorithms work only with numerical data. With high cardinality categorical variables, popular encoding solutions like One Hot Encoding is not feasible.

In this post we will go through a technique called Feature Hashing for encoding high cardinality categorical variables as implemented on Spark. We will showcase the solution with a use case from mobile advertisement.

You can find the Spark implementation in my open source github project avenir. As with my other Spark implementation, the solution is meta data driven and agnostic of any specific application or data set.

Continue reading
Posted in Big Data, Data Science, ETL, Scala, Spark | Tagged , , | Leave a comment

Time Series Sequence Anomaly Detection with Markov Chain on Spark

There are many techniques for time series anomaly detection. In this post, the focus is on sequence based anomaly detection of time series data with Markov Chain. The technique will be elucidated with a use case involving data from a health monitoring device. Anomaly detection is critical for this kind of health monitoring data, since it may indicate potential harmful health condition.

The spark implementation is available in my open source project beymani on github. The complete solution also uses my other open source projects avenir and chombo. As with all my other open source Spark implementation, it is agnostic of any specific application. Generous use of configuration and meta data enables us to do that.

Continue reading
Posted in Anomaly Detection, Big Data, Data Science, Machine Learning, Outlier Detection, Scala, Spark | Tagged , , , , | Leave a comment

Six Unsupervised Extractive Text Summarization Techniques Side by Side

In text summarization, we create a summary of the original content that is coherent and captures the salient points in the original content. There are various important usages of text summarization. Something we face almost every day is the text snippet that is shown in the search engine results. That snippet is essentially a summary. Our decision of whether to click on an items in the search result is largely driven by the title and the summary of the content.

In this post we will go through 6 unsupervised extractive text summarization algorithms that have been implemented in Python and is part of my open source project avenir in github.

Continue reading
Posted in Data Science, NLP, Python, Text Analytic, Text Mining | Tagged , | Leave a comment

Synthetic Training Data Generation for Machine Learning Classification Problems using Ancestral Sampling

Access to good training data set is a serious impediment to building supervised Machine Learning models. Such data is scarce and when available, the quality of the data set may be questionable. Even if good quality data set is available, you may prefer to use synthetic data for various reasons, we will allude to later.

In this post we will go through an Ancestral Sampling based solution for generating synthetic training data. The implementation can easily be adopted for other classification problem. The ancestral sampling python implementation along with sample code on how to use it is available Continue reading

Posted in Python, Statistics, Supervised Learning | Tagged , , | Leave a comment

Elastic Search or Solr Search Result Quality Evaluation with NCDG Metric on Spark

You have built an enterprise search engine with Elastic Search or Solr. You have tweaked all the knobs in the search engine to get the best possible quality for the search results. But how do you know how well your search engine results are satisfying the users and meeting their information needs.

If you have access to relevance feedback data from users, there are various search result relevance metrics that could be calculated. In this post the focus is on computing a metric called Normalized Cumulative Discounted Gain (NCDG) on Spark.

Two kinds of data are necessary to compute NCDG, 1)search engine result as queries are executed and 2)relevance feedback from users who interact with the search results. The Spark implementation of NCDG is available Continue reading

Posted in Big Data, Data Science, elastic search, Log Analysis, Scala, Search Analytic, Solr, Spark | Tagged , , , | Leave a comment

Plugin Framework Based Data Transformation on Spark

Data transformation is one of the key components in most ETL process. It is well known, that in most data projects, more than 50% of the time in spent in data pre processing. In my earlier blog, a Hadoop based data transformation  solution with a plugin framework was discussed exhaustively.

This is a companion article , where we will go through a data transformation implementation on Spark. The Spark implementation is part of my open source Continue reading

Posted in Big Data, Data Science, ETL, Scala, Spark | Tagged , | 1 Comment