Category Archives: Spark

Bulk Mutation in an Integration Data Lake with Spark

Data lakes act as repository of data from various sources, possibly of different formats. It can be used to build data warehouse or to perform other data analysis activities. Data lakes are generally built on top of Hadoop Distributed File … Continue reading

Posted in Big Data, Data Warehouse, eCommerce, ETL, Spark | Tagged , , , , | Leave a comment

Learning Alarm Threshold from User Feedback using Decision Tree on Spark

Alarm fatigue is a phenomena where some one is exposed to large number of alarms, become desensitized to them and start ignoring them. It’s been reported that security professionals ignore 32% of alarms because they are thought to be false. … Continue reading

Posted in Anomaly Detection, Big Data, Data Science, Outlier Detection, Spark | Tagged , , , , | Leave a comment

Contextual Outlier Detection with Statistical Modeling on Spark

Sometimes an outlier is defined with respect to a context. Whether a data point should be labeled as an outlier depends on the associated context. For a bank ATM, transactions that are considered normal between 6 AM and 10 PM, … Continue reading

Posted in Anomaly Detection, Big Data, Data Science, Spark | Tagged , , | Leave a comment

Pluggable Rule Driven Data Validation with Spark

Data validation is an essential component in any ETL data pipeline. As we all know most Data Engineers and Scientist spend most of their time cleaning and preparing their data before they can even get to the core processing of … Continue reading

Posted in Big Data, Data Science, ETL, Spark | Tagged , | 2 Comments

Leave One Out Encoding for Categorical Feature Variables on Spark

Categorical feature variables is a thorny issue for many supervised Machine Learning algorithms. Many learning algorithms can not handle categorical feature variables. In this post, we will go over an encoding scheme called Leave One Out Encoding, as implemented with … Continue reading

Posted in Big Data, Data Science, ETL, Spark | Tagged | 1 Comment

Handling Categorical Feature Variables in Machine Learning using Spark

Categorical features variables i.e. features variables with fixed set of unique values  appear in the training data set for many real world problems. However, categorical variables pose a serious problem for many Machine Learning algorithms. Some examples of such algorithms … Continue reading

Posted in Big Data, Data Science, Data Transformation, ETL, Scala, Spark | Tagged , , | Leave a comment

Optimizing Discount Price for Perishable Products with Thompson Sampling using Spark

For retailers, stocking perishable products is a risky business. If a product doesn’t sell completely by the expiry date, then the remaining inventory has to be discarded and loss be taken for those items. Retailers will do whatever is necessary … Continue reading

Posted in AI, Big Data, Data Science, Reinforcement Learning, Scala, Spark | Tagged , | 2 Comments