Data lakes act as repository of data from various sources, possibly of different formats. It can be used to build data warehouse or to perform other data analysis activities. Data lakes are generally built on top of Hadoop Distributed File (HDFS), which is append only. HDFS is essentially WORM file system i.e. Write Once and Read Many Times.
In an integration scenario, however your source data streams may have updates and deletes. This post is about performing updates and deletes in an HDFS backed data lake. The Spark based solution is available Continue reading
Alarm fatigue is a phenomena where some one is exposed to large number of alarms, become desensitized to them and start ignoring them. It’s been reported that security professionals ignore 32% of alarms because they are thought to be false. This kind of sensory overload can happen with monitoring systems in various domains, e.g computer systems and network, industrial monitoring systems and medical patient monitoring systems.
Typically alarm flooding happens when alarm threshold levels are not set properly. How do we know what the proper alarm threshold level should be. That is the problem we will be addressing in this post. Assuming user feedback is available for alarms, we will use supervised Machine Learning to learn new threshold. The solution is available Continue reading
Sometimes an outlier is defined with respect to a context. Whether a data point should be labeled as an outlier depends on the associated context. For a bank ATM, transactions that are considered normal between 6 AM and 10 PM, may be considered anomalous between 10 PM and 6 AM. In this case, the context is the hour of the day.
In this post, we will go through some contextual outlier detection techniques based on statistical modeling of the data. The Spark based implementation is available Continue reading
Data validation is an essential component in any ETL data pipeline. As we all know most Data Engineers and Scientist spend most of their time cleaning and preparing their data before they can even get to the core processing of the data.
In this post we will go over a pluggable rule driven data validation solution implemented on Spark. Earlier I had posted about the same solution implemented on Hadoop. This post can be considered as a sequel to the earlier post. The solution is available Continue reading
Query expansion is a process of reformulating a query to improve query results and to be more specific to improve the recall for a query. Topic modeling is an Natural Language Processing (NLP) technique to discover hidden topics or concepts in documents. We will be going through a Query Expansion technique based on Topic Modeling.
The solution is based on Latent Dirichlet Allocation (LDA) algorithm as implemented python gensim library. LDA is a Continue reading
Categorical feature variables is a thorny issue for many supervised Machine Learning algorithms. Many learning algorithms can not handle categorical feature variables. In this post, we will go over an encoding scheme called Leave One Out Encoding, as implemented with Spark. It’s a recent algorithm and popular in Kaggle. This algorithm is particularly useful for high cardinality categorical features.
The Spark implementation of the encoding algorithms can be found Continue reading
This is a sequel to my last blog on CRM leads conversion prediction using Gradient Boosted Trees as implemented in ScikitLearn. The focus of this blog is automatic training and parameter tuning for the model. The implementation is available in my open source project avenir.
The auto training logic as used here is independent of any particular supervised learning algorithm and applicable for any learning algorithm.
The frame work around ScikitLearn, used here facilitates building predictive models without having to write python code. I will be adding other supervised learning algorithms to this framework Continue reading