Robustness Measurement of Machine Learning Models with Examples in Python

While all the focus is on maximizing model accuracy while training a machine learning model, enough attention is not paid to model robustness. You may have a perfectly trained model with high accuracy, but how confident are you about the accuracy. The accuracy may not be stable. It may vary across different regions of the feature space. Or the model may be vey sensitive to moderately out of distribution data following production deployment.

The focus of this post is overview of various robustness metrics and then showing some results for a particular metric. The implementation is available in my open source Github repository avenir.

Continue reading
Posted in AI, Data Science, Machine Learning, Python, PyTorch | Tagged | Leave a comment

Detecting and Measuring Human Bias in Machine Learning Models

Any machine learning model used for making decisions regarding humans may potentially be biased because the data used to train the model may be tainted with human bias. A model trained with biased data will exhibit the same bias when used for making predictions. Some examples of such biases are loan approval, recruitment and crime prediction. Such biased behavior of models may in violation of various anti discriminatory laws in many countries. Proper steps are necessary to detect such bias and take the necessary steps to remove such bias and comply with regulatory requirements

The focus of this post is detecting and measuring human bias according to various metrics. The python implementation is available in my open source GitHub repository avenir.

Continue reading
Posted in AI, Machine Learning, Python | Tagged , , | Leave a comment

Customer Service Quality Monitoring with AutoEncoder based Anomalous Case Detection

Most companies put lot of effort ensuring superb customer service. They want to resolve customer issues as quickly as possible leaving a positive experience with customers. It’s been said that one negative experience with customer service an obliterate loyalty to a company built over many years. Machine Learning can play a significant role in improving the quality of customer service.

In this post we go though a solution for detecting anomalous customer service cases using AutoEncoder. An anomalous customer service case will in many cases represent poor customer service. The solution is based on PyTorch implementation of AutoEncoder. I have implemented a Python wrapper class around PyTorch AutoEncoder. This along with a configuration file makes it easier to use PyTorch AutoEncoder. The solution is available in my open source Github project avenir.

Continue reading
Posted in Data Science, Machine Learning, Outlier Detection, Python, PyTorch | Tagged , , | Leave a comment

Concept Drift Detection Techniques with Python Implementation for Supervised Machine Learning Models

Concept drift is a serious problem for production deployed machine learning models. Concept drift occurs there is significant change in the underlying data generation process causing significant shift in the posterior distribution p(y|x). Concept drift is manifested as significant increase in error rates for deployed models in production. To mitigate the risk, it is critical to monitor performance of deployed models and detect any concept drift. If not detected and a model trained with recent data deployed, concept drift may render your model ineffective in production. One recent example of detrimental effect concept drift, as reported in media is the worsening performance of many deployed machine learning models as result of significant customer behavior change due to the Corona virus.

In this post, we will go through some techniques for supervised concept drift detection. We will also go through Python implementation of the algorithms along with results using an algorithm called Early Drift Detection Method (EDDM).The Python implementation is available in my open source GitHub repo for anomaly detection called beymani.

Continue reading
Posted in Data Science, Machine Learning, Python, Statistics | Tagged , , , | Leave a comment

Meeting Schedule Optimization with Genetic Algorithm in Python

There are many complex real world optimization problems for which it’s not possible to obtain the exact best solution efficiently with reasonable amount of computing resources. Often the solution search space for such problems is combinatorially explosive. For such problems, heuristic optimizations are the only pragmatic option. Heuristic optimization algorithm with significantly reduced computational cost is used when a sub optimum solution is acceptable.

In this post we will go through a solution for meeting schedule optimization with Genetic Algorithm (GA) in Python. For this seemingly innocuous problem, search space may have trillions of solutions to explore. I have implemented set of heuristic optimization algorithm, including GA available in my open source Github repository avenir. The implementations are reusable and agnostic to any specific problem.

Continue reading
Posted in Data Science, Optimizatiom, Python | Tagged , | Leave a comment

Causal Inference with Deep Learning using Manufacturing Supply Chain Optimization as an Example

Machine Learning has been very successful using observational data to build models for predictions, but does not go far enough for causal inference. We humans use cause and effect to learn about the world. In causal inference statistical tools are used to analyze cause and effect. In causal analysis, our goal is to set a variable to a specific value to find the outcome in another variable, which aids in decision making. This is traditionally done through Randomized Control Trial or A/B testing. However in many real life cases A/B testing is not feasible or too expensive. In this post we will discuss solution for causal inference with deep learning models. We will use manufacturing supply chain as an example where our goal will be to gain insight on how to reduce back order to optimize profit.

The causal inference analysis in this post is based causal graphical model and do calculus. The implementation based on PyTorch is available in my open source project avenir in GitHub.

Continue reading
Posted in Data Science, Deep Learning, Machine Learning, Python, PyTorch | Tagged , , | Leave a comment

Time Series Change Point Detection with Two Sample Statistic on Spark with Application for Retail Sales Data

The goal of change point detection is to detect the times when statistically significant and sustained changes happen in a time series. It has wide range of applications in various domains including retail, medical, IoT, finance, business and meteorology. In this post we will go through a solution as implemented on Spark, based on non parametric two sample statistic to identify change points. Retail eCommerce sales data will be used as an example to show case the solution. Abrupt changes in sales can occur for various reasons e.g cannibalization by a competing product, sudden increase in sale due wrong posted sale price etc.

The implementation is part of my OSS project beymani in Github. As with all my projects, the implementation, Continue reading

Posted in Anomaly Detection, Big Data, Data Science, Scala, Spark, Time Series Analytic | Tagged , , | Leave a comment

Predicting Individual Viral Infection using Contact Data with LSTM Neural Network

With Covid-19 ravaging the world, lot of people are exploring ways AI and ML can help in combating the virus spread and infection. Virus like Covid-19  is a complex socio economic and public health problem and the solutions cut across many disciplines. In this post, the focus is on a very specific problem related to testing. Typical testing policies like testing en masse or testing whoever wants to get tested are not very effective. Wouldn’t it be great if we could predict the probability of viral infection for anyone based on recent contact history and then use the result to judiciously decide who should be tested.

In this post we will go through a solution where  personal contact data treated as sequence along with infection label gets used to train a Long Short Term Memory(LSTM) network. The trained model can be deployed to predict the probability of infection for any one with contact data. This prediction data could be used by healthcare authorities to select the people who should be tested in a data driven way. The Python implementation Continue reading

Posted in Data Science, Deep Learning, Machine Learning, Python, PyTorch | Tagged , , , , | 1 Comment

Semantic Search with Pre Trained Neural Transformer Model using Document, Sentence and Token Level Embedding

Some time ago I  worked on an enterprise search project, where we were tasked to improve the performance of an enterprise Solr search deployment. We recommended various improvements based on classic NLP techniques. One of the items on the agenda was deep learning language model based semantic search. Unfortunately we never got that to it because of time and budgetary  constraints.

Recently I got a chance to experiment with BERT pre trained Transformer model for semantic search. I experimented with various similarity algorithms for query and document vector embeddings. I will share my findings in this post along with suggestions on how to integrate it with Solr or ElasticSearch to boost performance. The python script is available Continue reading

Posted in AI, Deep Learning, elastic search, Machine Learning, NLP, Python, PyTorch, Search | Tagged , , , , , , , , | 5 Comments

Learn about Your Data with about Seventy Data Exploration Functions All in One Python Class

It’s a costly mistake to jump straight  into building machine learning models before getting a good insight into your data. I have made the mistake and paid the price.  Since then I made a resolution to learn about the data as much as possible first before  taking the next step. While exploring data, I always found myself using multiple python libraries and doing plethora of  imports for various python modules.

That experience motivated me to consolidate all common python data exploration functions, in one python class to make it easier to use. As an added feature I have also provided a workspace like interface, using which you can register multiple data sets with user provided name for each data set. You can refer to the data sets by name and perform various operations. The python implementation is available Continue reading

Posted in Data Science, Python, Statistics | Tagged , , , , , | 1 Comment