Plugin Framework Based Data Transformation on Spark

Data transformation is one of the key components in most ETL process. It is well known, that in most data projects, more than 50% of the time in spent in data pre processing. In my earlier blog, a Hadoop based data transformation  solution with a plugin framework was discussed exhaustively.

This is a companion article , where we will go through a data transformation implementation on Spark. The Spark implementation is part of my open source Continue reading

Posted in Big Data, Data Science, ETL, Scala, Spark | Tagged , | Leave a comment

Supervised Machine Learning Parameter Search and Tuning with Simulated Annealing

The most challenging phase in supervised Machine Learning pipeline is parameter tuning. There are many parameters, each with a range of values. The so called grid search is brute force approach that tries all possible combinations of values for the parameters looking for a combination of values that gives smallest test error.

Most supervised  Machine Learning algorithm involves a significant number of parameters and grid search is not practical. To put things in perspective, if there 10 parameters and if each can take on 5 possible values there will 510 possible combinations to try.

In this post we demonstrate that with stochastic optimization technique called Simulated Annealing, near optimal solution can be found with significantly less number of iterations. The implementation Continue reading

Posted in Machine Learning, Python, ScikitLearn, Supervised Learning | Tagged , , | Leave a comment

Normal Distribution Fitness Test with Chi Square on Spark

Many Machine Learning models is based on certain assumptions made about the data. For example, in ZScore based  anomaly detection, it is  assumed that the data has normal distribution. Your Machine Learning model will be as good as how those assumptions hold true. In this post, we will go over a Spark based implementation of Chi Square test for the assumptions of some distribution of the data set

The implementation  is available in my open source chombo on github. Like my all other Spark projects,  Continue reading

Posted in Anomaly Detection, Big Data, Data Science, Spark, Statistics | Tagged , | Leave a comment

Time Series Seasonal Cycle Detection with Auto Correlation on Spark

There are may benefits of auto correlation analysis on time series data, as we will be alluding to in detail later. It allows us to gain important insights on the nature of the time series data. Cycle detection is one of them. To put things in context, we will use cycle detection for energy usage time series data as an example to demonstrate the usefulness of auto correlation.

The Spark implementation is available in my open source project ruscello on github. This project has  Continue reading

Posted in Big Data, Correlation, Spark, Statistics, Time Series Analytic | Tagged , , | 1 Comment

Bulk Mutation in an Integration Data Lake with Spark

Data lakes act as repository of data from various sources, possibly of different formats. It can be used to build data warehouse or to perform other data analysis activities. Data lakes are generally built on top of Hadoop Distributed File (HDFS), which is append only. HDFS is essentially WORM file system i.e. Write Once and Read Many Times.

In an integration scenario, however your source data streams may have updates and deletes. This post is about performing updates and deletes in an HDFS backed data lake. The Spark based solution is available Continue reading

Posted in Big Data, Data Warehouse, eCommerce, ETL, Spark | Tagged , , , , | Leave a comment

Learning Alarm Threshold from User Feedback using Decision Tree on Spark

Alarm fatigue is a phenomena where some one is exposed to large number of alarms, become desensitized to them and start ignoring them. It’s been reported that security professionals ignore 32% of alarms because they are thought to be false. This kind of sensory overload can happen with monitoring systems in various domains, e.g computer systems and network, industrial monitoring systems and medical patient monitoring systems.

Typically alarm flooding happens when alarm threshold levels are not set properly. How do we know what the proper alarm threshold level should be. That is the problem we will be addressing in this post. Assuming user feedback is available for alarms, we will use supervised Machine Learning to learn new threshold. The solution is available Continue reading

Posted in Anomaly Detection, Big Data, Data Science, Outlier Detection, Spark | Tagged , , , , | Leave a comment

Contextual Outlier Detection with Statistical Modeling on Spark

Sometimes an outlier is defined with respect to a context. Whether a data point should be labeled as an outlier depends on the associated context. For a bank ATM, transactions that are considered normal between 6 AM and 10 PM, may be considered anomalous between 10 PM and 6 AM. In this case, the context is the hour of the day.

In this post, we will go through some contextual outlier detection techniques based on statistical modeling of the data. The Spark based implementation is available Continue reading

Posted in Anomaly Detection, Big Data, Data Science, Spark | Tagged , , | 1 Comment