Author Archives: Pranab

About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source project owner. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Machine Learning and Programming languages. I am fascinated by problems that don't have neat closed form solution.

Time Series Seasonal Cycle Detection with Auto Correlation on Spark

There are may benefits of auto correlation analysis on time series data, as we will be alluding to in detail later. It allows us to gain important insights on the nature of the time series data. Cycle detection is one … Continue reading

Posted in Big Data, Correlation, Spark, Statistics, Time Series Analytic | Tagged , , | Leave a comment

Bulk Mutation in an Integration Data Lake with Spark

Data lakes act as repository of data from various sources, possibly of different formats. It can be used to build data warehouse or to perform other data analysis activities. Data lakes are generally built on top of Hadoop Distributed File … Continue reading

Posted in Big Data, Data Warehouse, eCommerce, ETL, Spark | Tagged , , , , | Leave a comment

Learning Alarm Threshold from User Feedback using Decision Tree on Spark

Alarm fatigue is a phenomena where some one is exposed to large number of alarms, become desensitized to them and start ignoring them. It’s been reported that security professionals ignore 32% of alarms because they are thought to be false. … Continue reading

Posted in Anomaly Detection, Big Data, Data Science, Outlier Detection, Spark | Tagged , , , , | Leave a comment

Contextual Outlier Detection with Statistical Modeling on Spark

Sometimes an outlier is defined with respect to a context. Whether a data point should be labeled as an outlier depends on the associated context. For a bank ATM, transactions that are considered normal between 6 AM and 10 PM, … Continue reading

Posted in Anomaly Detection, Big Data, Data Science, Spark | Tagged , , | 1 Comment

Pluggable Rule Driven Data Validation with Spark

Data validation is an essential component in any ETL data pipeline. As we all know most Data Engineers and Scientist spend most of their time cleaning and preparing their data before they can even get to the core processing of … Continue reading

Posted in Big Data, Data Science, ETL, Spark | Tagged , | 2 Comments

Improving Elastic Search Query Result with Query Expansion using Topic Modeling

Query expansion is a process of reformulating a query to improve query results and to be more specific to improve the recall for a query. Topic modeling is an Natural Language Processing (NLP) technique to discover hidden topics or concepts … Continue reading

Posted in elastic search, NLP, Python, Solr, Text Analytic, Text Mining, Topic Modeling | Tagged , , , | 1 Comment

Leave One Out Encoding for Categorical Feature Variables on Spark

Categorical feature variables is a thorny issue for many supervised Machine Learning algorithms. Many learning algorithms can not handle categorical feature variables. In this post, we will go over an encoding scheme called Leave One Out Encoding, as implemented with … Continue reading

Posted in Big Data, Data Science, ETL, Spark | Tagged | 1 Comment