Author Archives: Pranab

About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source project owner. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Machine Learning and Programming languages. I am fascinated by problems that don't have neat closed form solution.

Synthetic Training Data Generation for Machine Learning Classification Problems using Ancestral Sampling

Access to good training data set is a serious impediment to building supervised Machine Learning models. Such data is scarce and when available, the quality of the data set may be questionable. Even if good quality data set is available, … Continue reading

Posted in Python, Statistics, Supervised Learning | Tagged , , | Leave a comment

Elastic Search or Solr Search Result Quality Evaluation with NCDG Metric on Spark

You have built an enterprise search engine with Elastic Search or Solr. You have tweaked all the knobs in the search engine to get the best possible quality for the search results. But how do you know how well your … Continue reading

Posted in Big Data, Data Science, elastic search, Log Analysis, Scala, Search Analytic, Solr, Spark | Tagged , , , | Leave a comment

Plugin Framework Based Data Transformation on Spark

Data transformation is one of the key components in most ETL process. It is well known, that in most data projects, more than 50% of the time in spent in data pre processing. In my earlier blog, a Hadoop based … Continue reading

Posted in Big Data, Data Science, ETL, Scala, Spark | Tagged , | Leave a comment

Supervised Machine Learning Parameter Search and Tuning with Simulated Annealing

The most challenging phase in supervised Machine Learning pipeline is parameter tuning. There are many parameters, each with a range of values. The so called grid search is brute force approach that tries all possible combinations of values for the … Continue reading

Posted in Machine Learning, Python, ScikitLearn, Supervised Learning | Tagged , , | Leave a comment

Normal Distribution Fitness Test with Chi Square on Spark

Many Machine Learning models is based on certain assumptions made about the data. For example, in ZScore based  anomaly detection, it is  assumed that the data has normal distribution. Your Machine Learning model will be as good as how those … Continue reading

Posted in Anomaly Detection, Big Data, Data Science, Spark, Statistics | Tagged , | Leave a comment

Time Series Seasonal Cycle Detection with Auto Correlation on Spark

There are may benefits of auto correlation analysis on time series data, as we will be alluding to in detail later. It allows us to gain important insights on the nature of the time series data. Cycle detection is one … Continue reading

Posted in Big Data, Correlation, Spark, Statistics, Time Series Analytic | Tagged , , | 1 Comment

Bulk Mutation in an Integration Data Lake with Spark

Data lakes act as repository of data from various sources, possibly of different formats. It can be used to build data warehouse or to perform other data analysis activities. Data lakes are generally built on top of Hadoop Distributed File … Continue reading

Posted in Big Data, Data Warehouse, eCommerce, ETL, Spark | Tagged , , , , | Leave a comment