Category Archives: Data Science

Learning Alarm Threshold from User Feedback using Decision Tree on Spark

Alarm fatigue is a phenomena where some one is exposed to large number of alarms, become desensitized to them and start ignoring them. It’s been reported that security professionals ignore 32% of alarms because they are thought to be false. … Continue reading

Posted in Anomaly Detection, Big Data, Data Science, Outlier Detection, Spark | Tagged , , , , | Leave a comment

Contextual Outlier Detection with Statistical Modeling on Spark

Sometimes an outlier is defined with respect to a context. Whether a data point should be labeled as an outlier depends on the associated context. For a bank ATM, transactions that are considered normal between 6 AM and 10 PM, … Continue reading

Posted in Anomaly Detection, Big Data, Data Science, Spark | Tagged , , | 1 Comment

Pluggable Rule Driven Data Validation with Spark

Data validation is an essential component in any ETL data pipeline. As we all know most Data Engineers and Scientist spend most of their time cleaning and preparing their data before they can even get to the core processing of … Continue reading

Posted in Big Data, Data Science, ETL, Spark | Tagged , | 2 Comments

Leave One Out Encoding for Categorical Feature Variables on Spark

Categorical feature variables is a thorny issue for many supervised Machine Learning algorithms. Many learning algorithms can not handle categorical feature variables. In this post, we will go over an encoding scheme called Leave One Out Encoding, as implemented with … Continue reading

Posted in Big Data, Data Science, ETL, Spark | Tagged | 1 Comment

Auto Training and Parameter Tuning for a ScikitLearn based Model for Leads Conversion Prediction

This is a sequel to my last blog on CRM leads conversion prediction using Gradient Boosted Trees as implemented in ScikitLearn. The focus of this blog is automatic training and parameter tuning for the model. The implementation is available in … Continue reading

Posted in Data Science, Machine Learning, Python, ScikitLearn, Supervised Learning | Tagged , , , | Leave a comment

Predicting CRM Lead Conversion with Gradient Boosting using ScikitLearn

Sales leads are are generally managed and nurtured in CRM systems. It will be nice if we could predict the likelihood of any lead converting to an actual deal. This could be very beneficial in many ways e.g. proactively¬† providing … Continue reading

Posted in Data Science, Machine Learning, Optimization, Python, ScikitLearn | Tagged , , , | 2 Comments

Handling Categorical Feature Variables in Machine Learning using Spark

Categorical features variables i.e. features variables with fixed set of unique values¬† appear in the training data set for many real world problems. However, categorical variables pose a serious problem for many Machine Learning algorithms. Some examples of such algorithms … Continue reading

Posted in Big Data, Data Science, Data Transformation, ETL, Scala, Spark | Tagged , , | Leave a comment