Concept Drift Detection Techniques with Python Implementation for Supervised Machine Learning Models

Concept drift is a serious problem for production deployed machine learning models. Concept drift occurs there is significant change in the underlying data generation process causing significant shift in the posterior distribution p(y|x). Concept drift is manifested as significant increase in error rates for deployed models in production. To mitigate the risk, it is critical to monitor performance of deployed models and detect any concept drift. If not detected and a model trained with recent data deployed, concept drift may render your model ineffective in production. One recent example of detrimental effect concept drift, as reported in media is the worsening performance of many deployed machine learning models as result of significant customer behavior change due to the Corona virus.

In this post, we will go through some techniques for supervised concept drift detection. We will also go through Python implementation of the algorithms along with results using an algorithm called Early Drift Detection Method (EDDM).The Python implementation is available in my open source GitHub repo for anomaly detection called beymani.

Concept Drift

Root cause of concept drift is non stationarity of data i.e change in statistical properties of data with passage of time. The harsh reality is that for most real world problems, data is not stationary. In other words, the data your model will encounter for prediction post deployment may be significantly different from the data that was used for training the model from statistical point of view.

Concept drift can be of the following types with respect to supervised machine learning problems i.e classification and regression problems.

Change p(y | x) i.e posterior distribution changes without change in p(x). It’s called real drift or concept drift. Concept drift signifies some fundamental change in the underlying data generating process and model retraining is reccommended.
Change in p(x) only called virtual drift. Also know as covariance drift or data drift. However, a change in p(x) may be accompanied by change in p(y | x), if not globally at least locally. If there is significant change in distribution tail, model retraining may be helpful. If the model is purely causal, p(y|x) will be independent of p(x) for the true underlying process.
Change in p(y) is called label drift. Often data drift will cause label drift. In some cases of label drift, model retraining is desirable. Suppose in training data was un balanced and some corrections were made for the unbalance. Now in production if the data is more balanced, the model can be improved with retaining.

It should be emphasized that that model prediction is a function of the posterior distribution p(y | x) and not the probability itself. For classification, the prediction is the mode of p(y | x) aka MAP estimate. For regression it’s the mean of the distribution. So a drift i.e change in p(y | x) may not influence the prediction of the model.

Depending upon how concept change occurs with time, there are 4 types of concept drift as follows

Abrupt. Concept changes within a short time. The transition from current concept to the new one is abrupt. Also known as concept shift. It’s easiest to detect this kind of concept drift
Gradual. A new concept replaces the old concept gradually over time. There are intermediate concepts which can be instances of old concept or the new concept being faced.
Incremental . An existing concept incrementally changes to a new concept.. There are intermediate concepts are defined by the old and the new concept but does not belong to either
Reoccuring. An old concept will re occur after some time. The transitions could any combination of the aforementioned 3 types of combinations

In real world concept drift is more complex.The transitions from the current concept to the next could any combination of the 3 types of transitions described above. Gradual and incremental concept drifts are more difficult to predict, especially in presence of noise.

In the production environment, there are multiple ways to respond to concept drift and adapt to the new environment as follows.

Static model. Do nothing and keep using the same trained model with the assumption that there is no concept drift. In most cases this will be naive approach
Blindly update model. There is no pro active drift detection. Assuming concept drift is present, models are periodically re trained with recent data. Without drift detection in place, it is difficult to estimate the time interval for re training and model re deployment.
Update model. There is pro active drift detection. Only when drift is detected, a new model is trained with recent data which replaces the old model
Training with weighted data. When new model is trained instead of discarding old training data, use weight inversely proportional to the age of data
Model ensemble. Static model is left as is. New models are trained to correct for the mistakes of the older models, as in boosting
Incrementally update model. With a granular machine learning models like decision tree, only the negatively impacted portion of the model could be retrained. Appropriate drift detection techniques should be able to identify local regions in the feature space negatively impacted by drift
Online learning. As new data is absorbed, the model is continuously updated and as a result the model is always adapting to the data distribution change. Most machine learning algorithms work in batch mode. So this approach will work only with learning algorithms capable of incremental learning one instance of data at a time.

We will be discussing multiple algorithms for supervised concept drift detection. All the algorithms calculate a statistic as a measure of drift. As the prediction output is processed a statistic indicative of drift is calculated. Please refer to this paper for an excellent survey of various supervised and unsupervised concept drift detection algorithms. The paper also has the citations for all the algorithms discussed here.

In some cases the critical or threshold values of the statistic is provided. For other algorithms where it’s not provided, an empirical distribution has to be created by Monte Carlo simulation as follows for estimating p values.

Calculate statistic according to the algorithm from past runs of the detector
It’s necessary that some of the runs involved concept drift, so that there will be enough data in the complete range of the statistic
Use them to create the empirical distribution

For many of the algorithms statistic is calculated for all the data since when drift occurred last. Assuming that the drift detector runs periodically e.g once a day and not continuously, the statistic calculated by the algorithm is check pointed when the run ends. Before the next run, the check pointed statistic is restored and it gets updated as a fresh set of predictions are processed.

All the algorithms are lagging indicator of drift, because only after it has processed enough data post drift, that the actual drift is detected.

Drift Detection Method (DDM)

This is one of the earliest and simplest. It’s based on prediction error rate. Incoming data instance is a Bernoulli trial variable indicating whether error occurred or not in model prediction. Incoming data is considered as a sequence with binomial distribution.

The algorithm tracks minimum probability of error (p) rate and the min std deviation (s) of the binomial distribution, when (p + s) reaches the minimum. Drift is considered to be present when the sum probability of error and std deviation of the error rate exceeds sum of minimum probability of error and a multiple of the minimum std deviation i.e (p + s) is greater than (p_min + 3 * s_min). The recommended multiplying factor is 3.

Early Drift Detection Method (EDDM)

This is similar to DDM, except that it is is considered better for gradual drift. It’s based on the mean (m) and std deviation (s) of the distance between two errors.

It tracks when (m + 2 *s) reaches it’s maximum value and saves them as as m_max and s_max. When the ratio (m + 2 *s) / (m_max + 2 * s_max) drops below a threshold drift is considered to have occurred. Recommended value of the threshold is 0.9

Fast Hoeffding Drift Detection Method (FHDDM)

With a sliding window probability of correct prediction is calculated, while the maximum probability values is tracked.

When the correct prediction probability drops below maximum and the difference in probabilities exceeds a threshold defined by Hoeffding inequality theorem , drift is is considered to have occurred. The threshold is sqrt(ln(1 / δ ) / (2 * n) ) where δ is low probability threshold value e.g 0.2 and n is the window size.

Paired Learner (PL)

A stable learner is trained on all data and then a recent learner is trained on recent data. Increment a counter when the stable learner makes an error in prediction but the recent learner does not. Decrements if the opposite is true. when the count is above a threshold drift is considered to have occurred.

This technique is computational expensive as you have to keep training newer models models

Shuffling and Resampling (SR)

Data set is split into train and test sets. The split location is where drift is assumed to occur. Model is trained on the training data. Error rate on test set is calculated. The data is shuffled, split into train and test sets, model trained and error calculated from the test set. Average error rate is calculated from multiple such shuffles. If the difference between ordered data error rate and average shuffled data error rate is above a threshold, then drift has occurred.

This technique is also computational expensive as multiple models need to be trained for every point in time where drift may have occurred.

Exponentially Weighted Moving Average Concept Drift Detection (ECDD)

Exponentially weighted moving average (EWMA) forecast is used. Forecast mean and standard deviation is calculated on a continuous basis. When the forecast exceeds the sum of mean and some multiple of std deviation, drift is considered to have occurred. For classification problems, EWMA forecast is made for the error stream treated as Bernaulli variables.

Response Distribution (RD)

Even when feedback after prediction is not available, prediction data i.e class probability for classification or real number for regression, can be used for detecting concept drift. It’s very unlikely for p(y|x) to change without a change p(y). For example for classification, the class probability distribution is bimodal with two peaks close to 1 and 0 probability for certain class label.

If this response distribution significantly deviates form the corresponding distribution based on the validation data when the model was trained, then there is strong possibility of concept drift. The deviation could be based on KL divergence or absolute difference between the 2 distributions.

Feature Distribution

This is another technique based on features only for drift detection in the absence of response feedback. A change in p(x) will generally be accompanied by a change in p(y|x).

Any multivariate unsupervised drift detection technique can be used to detect a shift in p(x).

Results for EDDM

Some of the drift detection algorithms discussed above are implemented in a Python class. We will use EDDM. The driver code for using the class is also available. Prediction data is simulated with Python code.

it’s run twice. In the first run, there is no drift and the calculated statistic at the end of the run is check pointed. For every run checkpoint statistic is restored first and saved at the end. Drift is present in the prediction data for the second run. Please refer to the tutorial document for details. Here is the result. It shows drift halfway through the prediction data.

Drift for Regression Model

All the algorithms discussed so for are for drift detection classification models, although according to the title of this post the topic is drift detection for supervised models. Unlike classification, there is paucity of solutions in the area of drift detection for regression models.

One solution that is readily available is to take the regression error which is a real number and apply any un supervised drift detection technique on the error data. In a future post I will cover un supervised drift detection techniques.

Ensemble and Hierarchy of Drift Detectors

For any of the ensemble and hierarchy based drift detectors as mentioned in the survey paper, aggregation functions provided in the Python implementation can be used. The ensemble and hierarchy based drift detector algorithms are as follows.

Linear Fore Rates (LFR)
Selective Detector Ensemble (eDetector)
Drift Detection Ensemble (DDE)
Hierarchical Hypothesis Testing (HLFR)
Hierarchical Hypothesis Testing (Request and Reverify)

For ensemble based detectors, a consensus level is specified e.g any, all or majority. For hierarchical detectors, only after drift is detected by a detector at some level, the next level detectors are used to validate the result. Individual detectors are based on any of the drift detection algorithm mentioned.

Wrapping Up

All the algorithms discussed are supervised drift detection i.e actual feedback for prediction is available which gets used with prediction to generate error data. In many cases, the feedback may not be available or if available it could be too late to be useful.

For those cases of missing or late feedback, you could use unsupervised drift detection techniques to detect change in feature distribution p(x). A change in p(x) will be accompanied by a change in the posterior distribution p(y | x). As mentioned earlier a drift in p(Y | x) does not necessarily translate to change in prediction whether for classification or regression.

In a future post I will cover un supervised drift detection, which is for un supervised models. In this area there is a wider range of solutions available.