Category Archives: Big Data

Elastic Search or Solr Search Result Quality Evaluation with NCDG on Spark

You have built an enterprise search engine with Elastic Search or Solr. You have tweaked all the knobs in the search engine to get the best possible quality for the search results. But how do you know how well your … Continue reading

Posted in Big Data, Data Science, elastic search, Log Analysis, Scala, Search Analytic, Solr, Spark | Tagged , , , | Leave a comment

Plugin Framework Based Data Transformation on Spark

Data transformation is one of the key components in most ETL process. It is well known, that in most data projects, more than 50% of the time in spent in data pre processing. In my earlier blog, a Hadoop based … Continue reading

Posted in Big Data, Data Science, ETL, Scala, Spark | Tagged , | Leave a comment

Normal Distribution Fitness Test with Chi Square on Spark

Many Machine Learning models is based on certain assumptions made about the data. For example, in ZScore based  anomaly detection, it is  assumed that the data has normal distribution. Your Machine Learning model will be as good as how those … Continue reading

Posted in Anomaly Detection, Big Data, Data Science, Spark, Statistics | Tagged , | Leave a comment

Time Series Seasonal Cycle Detection with Auto Correlation on Spark

There are may benefits of auto correlation analysis on time series data, as we will be alluding to in detail later. It allows us to gain important insights on the nature of the time series data. Cycle detection is one … Continue reading

Posted in Big Data, Correlation, Spark, Statistics, Time Series Analytic | Tagged , , | 1 Comment

Bulk Mutation in an Integration Data Lake with Spark

Data lakes act as repository of data from various sources, possibly of different formats. It can be used to build data warehouse or to perform other data analysis activities. Data lakes are generally built on top of Hadoop Distributed File … Continue reading

Posted in Big Data, Data Warehouse, eCommerce, ETL, Spark | Tagged , , , , | Leave a comment

Learning Alarm Threshold from User Feedback using Decision Tree on Spark

Alarm fatigue is a phenomena where some one is exposed to large number of alarms, become desensitized to them and start ignoring them. It’s been reported that security professionals ignore 32% of alarms because they are thought to be false. … Continue reading

Posted in Anomaly Detection, Big Data, Data Science, Outlier Detection, Spark | Tagged , , , , | Leave a comment

Contextual Outlier Detection with Statistical Modeling on Spark

Sometimes an outlier is defined with respect to a context. Whether a data point should be labeled as an outlier depends on the associated context. For a bank ATM, transactions that are considered normal between 6 AM and 10 PM, … Continue reading

Posted in Anomaly Detection, Big Data, Data Science, Spark | Tagged , , | 1 Comment