Find Out How Well Your Machine Learning Model is Calibrated

If your machine learning model predicts probability of target, which is common for classification task, how much confidence do you have for the predicted probability. If you need make a critical decision based on the prediction in the medical domain for example, will you feel confident about making such decisions. Calibration is a metric that will tell you whether your model is trust worthy. Although it’s a critical metric, it’s not discussed that often. It’s been observed that large and complex neural networks while being more accurate has worse model calibration.

In this post we will walk through an example of Neural Network model for heart disease prediction and find out if the model is calibrated well. Calibrating a poorly calibrated model is a separate issue and that won’t be the topic of this post. But you can go through the citations in the post to learn about calibrating a poorly calibrated model. The Python code is in my GitHub repository avenir.

Continue reading
Posted in Data Science, Machine Learning, mlops, Python | Tagged | Leave a comment

Duplicate Data Detection with Neural Network and Contrastive Learning

Duplicate data is a ubiquitous problem in the data world. It often appears when data from different silos are consolidated. It could be an issue in an analytic project based on data aggregated from various sources. The training data for a machine learning model may have duplicates and unless removed it will have an adverse impact on the model performance. In this post we will go through a simple feed forward neural network based solution for finding duplicates. It’s applicable for any structured data, whether a relational table or a JSON data.

The solution is available in open source project avenir. PyTorch has been used for the NN model. It could easily be reimplemented with TensorFlow.

Continue reading
Posted in AI, Data Science, Deep Learning, ETL, Python, PyTorch | Tagged , , | Leave a comment

eCommerce Order Processing System Monitoring with Isolation Forest Based Anomaly Detection on Spark

Timely delivery of orders is critical for customer satisfaction for any retail eCommerce business. It’s even more critical for time bound guaranteed delivery orders. Retail eCommerce businesses generally use order processing workflow systems, which are state machines where state transition happens after some automated or manual actions.

The primary concern in such systems is delay in state transitions, which may eventually cause delivery delay for customers. In this post we will go though the application of a machine learning based multivariate anomaly detection algorithm called Isolation Forest for detecting unusual delays in processing orders. The Spark based Isolation Forest implementation is available in my open source project beymani on github.

Continue reading
Posted in Anomaly Detection, Data Science, eCommerce, Scala, Spark | Tagged , , , | Leave a comment

Data Driven Causal Relationship Discovery with Python Example Code

You may find two variables A and B strongly correlated, but how do you know whether A causes B or B causes A. Irrespective of the causal direction, causality will be manifested as correlation. Discovering causal relationship is important for many problems. However, unlike correlation it’s not so easy to discover causality. In this post we will go through a technique called Additive Noise Method. We will use product sale cannibalization as an example i.e whether introduction of new product is causing plummeting sale of another existing competing product. The example python code can be found in my open source project avenir in GitHub.

Continue reading
Posted in Data Science, Machine Learning, Python | Tagged , | Leave a comment

Unsupervised Concept Drift Detection Techniques for Machine Learning Models with Examples in Python

Concept drift is an serious operational issue for deployed machine learning models. Please refer to my earlier post for introduction and various concepts. Unsupervised drift detection techniques although always applicable for unsupervised models, are also effective frequently for supervised machine learning models. Supervised machine learning is essentially about finding the conditional distribution P(y|x). For supervised machine learning models, a change in P(x) is often accompanied by change in P(y|x). Essentially P(x) is used as a proxy for detecting change P(y|x). However, In some cases where P(x) is independent of P(y|x) and these techniques will fail.

We will go through a set of unsupervised drift detection algorithms in this post. Finally we will detect drift in a retail customer churn prediction models using the Nearest Neighbor count algorithm. The Python implementation is available in my open source project beymani in Github.

Continue reading
Posted in Data Science, Machine Learning, mlops, Python | Tagged , | Leave a comment

Robustness Measurement of Machine Learning Models with Examples in Python

While all the focus is on maximizing model accuracy while training a machine learning model, enough attention is not paid to model robustness. You may have a perfectly trained model with high accuracy, but how confident are you about the accuracy. The accuracy may not be stable. It may vary across different regions of the feature space. Or the model may be vey sensitive to moderately out of distribution data following production deployment.

The focus of this post is overview of various robustness metrics and then showing some results for a particular metric. The implementation is available in my open source Github repository avenir.

Continue reading
Posted in AI, Data Science, Machine Learning, Python, PyTorch | Tagged | Leave a comment

Detecting and Measuring Human Bias in Machine Learning Models

Any machine learning model used for making decisions regarding humans may potentially be biased because the data used to train the model may be tainted with human bias. A model trained with biased data will exhibit the same bias when used for making predictions. Some examples of such biases are loan approval, recruitment and crime prediction. Such biased behavior of models may in violation of various anti discriminatory laws in many countries. Proper steps are necessary to detect such bias and take the necessary steps to remove such bias and comply with regulatory requirements

The focus of this post is detecting and measuring human bias according to various metrics. The python implementation is available in my open source GitHub repository avenir.

Continue reading
Posted in AI, Machine Learning, Python | Tagged , , | Leave a comment

Customer Service Quality Monitoring with AutoEncoder based Anomalous Case Detection

Most companies put lot of effort ensuring superb customer service. They want to resolve customer issues as quickly as possible leaving a positive experience with customers. It’s been said that one negative experience with customer service an obliterate loyalty to a company built over many years. Machine Learning can play a significant role in improving the quality of customer service.

In this post we go though a solution for detecting anomalous customer service cases using AutoEncoder. An anomalous customer service case will in many cases represent poor customer service. The solution is based on PyTorch implementation of AutoEncoder. I have implemented a Python wrapper class around PyTorch AutoEncoder. This along with a configuration file makes it easier to use PyTorch AutoEncoder. The solution is available in my open source Github project avenir.

Continue reading
Posted in Data Science, Machine Learning, Outlier Detection, Python, PyTorch | Tagged , , | Leave a comment

Concept Drift Detection Techniques with Python Implementation for Supervised Machine Learning Models

Concept drift is a serious problem for production deployed machine learning models. Concept drift occurs there is significant change in the underlying data generation process causing significant shift in the posterior distribution p(y|x). Concept drift is manifested as significant increase in error rates for deployed models in production. To mitigate the risk, it is critical to monitor performance of deployed models and detect any concept drift. If not detected and a model trained with recent data deployed, concept drift may render your model ineffective in production. One recent example of detrimental effect concept drift, as reported in media is the worsening performance of many deployed machine learning models as result of significant customer behavior change due to the Corona virus.

In this post, we will go through some techniques for supervised concept drift detection. We will also go through Python implementation of the algorithms along with results using an algorithm called Early Drift Detection Method (EDDM).The Python implementation is available in my open source GitHub repo for anomaly detection called beymani.

Continue reading
Posted in Data Science, Machine Learning, Python, Statistics | Tagged , , , | 1 Comment

Meeting Schedule Optimization with Genetic Algorithm in Python

There are many complex real world optimization problems for which it’s not possible to obtain the exact best solution efficiently with reasonable amount of computing resources. Often the solution search space for such problems is combinatorially explosive. For such problems, heuristic optimizations are the only pragmatic option. Heuristic optimization algorithm with significantly reduced computational cost is used when a sub optimum solution is acceptable.

In this post we will go through a solution for meeting schedule optimization with Genetic Algorithm (GA) in Python. For this seemingly innocuous problem, search space may have trillions of solutions to explore. I have implemented set of heuristic optimization algorithm, including GA available in my open source Github repository avenir. The implementations are reusable and agnostic to any specific problem.

Continue reading
Posted in Data Science, Optimizatiom, Python | Tagged , | Leave a comment