Access to good training data set is a serious impediment to building supervised Machine Learning models. Such data is scarce and when available, the quality of the data set may be questionable. Even if good quality data set is available, you may prefer to use synthetic data for various reasons, we will allude to later.
In this post we will go through an Ancestral Sampling based solution for generating synthetic training data. The implementation can easily be adopted for other classification problem. The ancestral sampling python implementation along with sample code on how to use it is available Continue reading
You have built an enterprise search engine with Elastic Search or Solr. You have tweaked all the knobs in the search engine to get the best possible quality for the search results. But how do you know how well your search engine results are satisfying the users and meeting their information needs.
If you have access to relevance feedback data from users, there are various search result relevance metrics that could be calculated. In this post the focus is on computing a metric called Normalized Cumulative Discounted Gain (NCDG) on Spark.
Two kinds of data are necessary to compute NCDG, 1)search engine result as queries are executed and 2)relevance feedback from users who interact with the search results. The Spark implementation of NCDG is available Continue reading
Posted in Big Data, Data Science, elastic search, Log Analysis, Scala, Search Analytic, Solr, Spark
Tagged enterprise search, NCDG, relevance feedback, search performance
Data transformation is one of the key components in most ETL process. It is well known, that in most data projects, more than 50% of the time in spent in data pre processing. In my earlier blog, a Hadoop based data transformation solution with a plugin framework was discussed exhaustively.
This is a companion article , where we will go through a data transformation implementation on Spark. The Spark implementation is part of my open source Continue reading
The most challenging phase in supervised Machine Learning pipeline is parameter tuning. There are many parameters, each with a range of values. The so called grid search is brute force approach that tries all possible combinations of values for the parameters looking for a combination of values that gives smallest test error.
Most supervised Machine Learning algorithm involves a significant number of parameters and grid search is not practical. To put things in perspective, if there 10 parameters and if each can take on 5 possible values there will 510 possible combinations to try.
In this post we demonstrate that with stochastic optimization technique called Simulated Annealing, near optimal solution can be found with significantly less number of iterations. The implementation Continue reading
Many Machine Learning models is based on certain assumptions made about the data. For example, in ZScore based anomaly detection, it is assumed that the data has normal distribution. Your Machine Learning model will be as good as how those assumptions hold true. In this post, we will go over a Spark based implementation of Chi Square test for the assumptions of some distribution of the data set
The implementation is available in my open source chombo on github. Like my all other Spark projects, Continue reading
There are may benefits of auto correlation analysis on time series data, as we will be alluding to in detail later. It allows us to gain important insights on the nature of the time series data. Cycle detection is one of them. To put things in context, we will use cycle detection for energy usage time series data as an example to demonstrate the usefulness of auto correlation.
The Spark implementation is available in my open source project ruscello on github. This project has Continue reading
Data lakes act as repository of data from various sources, possibly of different formats. It can be used to build data warehouse or to perform other data analysis activities. Data lakes are generally built on top of Hadoop Distributed File (HDFS), which is append only. HDFS is essentially WORM file system i.e. Write Once and Read Many Times.
In an integration scenario, however your source data streams may have updates and deletes. This post is about performing updates and deletes in an HDFS backed data lake. The Spark based solution is available Continue reading