Tabular Data Column Semantic Type Identification with Contrastive Deep Learning

When data is aggregated from various source in a dynamic environment where the data format might change without any notice, identifying semantic type of columns in data is a challenging problem. In this post the problem semantic type identification of data columns will be framed as a classification problem with manually engineered column features. Also instead of using a normal SoftMax label probabilities, we will be using contrastive learning. The solution is available in my OSS GitHub repo whakapai. It’s also available as a Python package called torvik.

Continue reading
Posted in Data Science, Deep Learning, Python, PyTorch | Tagged , | 1 Comment

Feature Selection with Information Theory Based Techniques in Python.

Feature selection is the process of selection a subset of features most relevant from a given set of features for a supervised machine learning problem. There are many techniques for feature selection. in this post we will use 4 information theory based feature selection algorithms. This post is not about feature engineering which is construction of new features from a given set of features. The implementation is available in the daexp module of my python package matumizi. The GitHub repo is whakapai.

Continue reading
Posted in Data Science, Machine Learning, Python | Tagged , | Leave a comment

Discovering Subject Matter Experts from Email Communication Data using Graph Convolution Network.

Deep Learning model architecture aligns with a specific structure of the data e.g RNN or LSTM for linear data like text, CNN for grid data like image. The structure of the data in these cases are specialized kind graph structure. Linear data like text is a linear graph and grid data like like image is a grid graph. Graph Neural Network(GNN) is very powerful because it can process data with any arbitrary graph structure. Data with generic graph structure abound in real life e.g social network, paper citation graph. In this post, we will find out how GNN can be used to discover subject matter experts from email communication data.

We will use a type of GNN called Graph Convolution Network (GCN) for the solution. A no code GCN implementation based on PyTorch is available in my Github repo whkapai. it’s also available as part of Python package in TestPyPi

Continue reading
Posted in Deep Learning, Machine Learning, Python, PyTorch | Tagged , , | Leave a comment

Gig Economy Workforce Scheduling with Reinforcement Learning

Gig economy workers are typically work on a contract, potentially temporary and called to work on as needed basis. Some examples are delivery service, app based taxi service, content creation and low level administrative work. . A company may have a pool of gig workers. On a given day based on demand forecast they might need certain number of workers. How to decided which workers to call from the pool, that’s most beneficial to the company. It’s complex decision making problem. We are going to find out in this most how a type of Reinforcement Learning (RL) called Multi Arm Bandit (MAB) can effectively solve this decision making problem.

The Python implementation is available in my OSS GitHub repository avenir. The use case for the solution is a fictitious food delivery service.

Continue reading
Posted in Machine Learning, multi arm bandit, Python, Reinforcement Learning | Tagged , , | Leave a comment

Out of Distribution Data Detection in Deployed Machine Learning Models

If a deployed machine learning model encounters an out of distribution data, it should either reject it or delegate it to a human reviewer for further investigation and decision making. A sample is out of distribution (OOD) when it is generated by a distribution different from the distribution of the training data. For high stakes application such as finance or medical it’s critical to detect OOD data. Out of distribution data is related to data drift, except that data drift signifies a permanent shift in data data distribution. Out of distribution data is closer to outlier or anomalous data problem. Outlier detection aims to detect samples that are markedly different from most of the data.

There are many techniques for for OOD detection. In this post we will go through an OOD detection technique based on nearest neighbor algorithm applied to the latent data for a deep learning model. The implementation is available in my OSS Github repo avenir

Continue reading
Posted in AI, Machine Learning, mlops, Outlier Detection, Python, PyTorch | Tagged , , | Leave a comment

Remedial Action Recommendation with Machine Learning and Genetic Algorithm

Prescriptive analytic sits at the top of a three tier analytic pyramid. The bottom layers are descriptive and predictive analytic. Prescriptive analytic entails action recommendations based on the results of descriptive and predictive analytic, which if executed will have have positive business impact. As an illustrative example, after a machine learning has predicted that a customer is very likely to churn in the near future, the business might be interested in getting some remedial action recommendations which if implemented will prevent the churn.

In this post we will go through a solution for remedial action based on predictive Machine Learning (ML) and Genetic Algorithm (GA) , using loan approval as an example. Following the rejection of a loan application by the ML model, the bank may be interested in a set of remedial action recommendations for the applicant, so that the negative outcome can turned around to a positive one. The implementation is available in my OSS Github repo avenir.

Continue reading
Posted in AI, Data Science, Deep Learning, Machine Learning, Optimizatiom, Python, PyTorch | Tagged , , | Leave a comment

Conformal Prediction for a Neural Regression Model

When a deployed machine learning model makes a prediction, should we accept the prediction on its face value or question the reliability of the prediction. For certain critical applications like medical and aviation, where some decision making is involved post prediction unless there is high confidence associated with the prediction it may be too risky to accept it.

How do you associate a confidence value with the model prediction.It’s tempting to use the probability prediction for a classification as a measure confidence. However the predicted class probability has nothing to do with confidence. This is where conformal prediction enters into the picture. It enables us to associate a confidence level with the model prediction. For decision making system e.g deciding whether to treat patient based on the model prediction of chest X-ray, it’s critical to have a level of confidencve with the model prediction.

Continue reading
Posted in AI, Machine Learning, Python | Tagged , | Leave a comment

Machine Learning Model Performance Robustness Based on Local Neighborhood Performance

After training a machine learning model we generally test the model with a validation data set. We calculate accuracy or some other performance metric. This metric is global i.e based on the whole validation data set. How do you know how robust your model is. One measure of robustness is the std deviation or confidence interval of the performance metric calculated for various local neighborhoods of the test data. For a robust model, the std deviation or confidence interval of local performance metric should be low.

In this post, we will use a neural network model for loan approval and investigate robustness of the models. The implementation can be found in my OSS Github repo avenir.

Continue reading
Posted in AI, Machine Learning, Performance, Python, PyTorch | Tagged , | Leave a comment

Class Separation based Machine Learning Model Performance Metric

Output of binary classifier is typically the predicted probability of some class. The real probability value is converted to a binary value based on some probability threshold. For a well trained model, the predicted probability values should be clustered around 0 and 1. The class Separation metric will have a large value in such cases,which is a desirable property of the model.

Continue reading
Posted in Data Science, Machine Learning, mlops | Tagged , | 5 Comments

Deep Learning based Anomaly Detection for Data with Temporal and Spatial Correlation

Anomaly detection for data with both temporal ans spatial correlation is a complex problem and most of the solutions are based on deep learning.This post contains high level overview of various deep learning based anomaly detection solutions for spatio temporal data.

Continue reading
Posted in Anomaly Detection, Data Science, Deep Learning, Machine Learning | Tagged , , | Leave a comment