Handling Rare Events and Class Imbalance in Predictive Modeling for Machine Failure


Most supervised Machine Learning algorithms face difficulty when there is class imbalance in the training data i.e., amount of data belonging one class heavily outnumber the other class. However, there are may real life problems where we encounter this situation e.g., fraud, customer churn and machine failure. There are various techniques to address this thorny problem of class imbalance.

In this post we will go over a technique based on oversampling o the minority class data called Synthetic Minority Over-sampling Technique (SMOTE). We will go into the details of a Hadoop based implementation using machine failure data as the use case.

The implementation can be found in my OSS projects avenir. The ETL map reduce  jobs necessary for pre processing the data before running the SMOTE algorithm come from another OSS project of mine chombo.

Class Imbalance

Most classification algorithms are adversely affected by class imbalance, with some exceptions.Decision trees are relatively more tolerant of class imbalance. Most supervised learning algorithms can tolerate small amount of class imbalance. When the imbalance is excessive, some corrective action is necessary.

As outlined in the article cited above and this reddit post, here are the techniques to handle class imbalance.

  1. Collect more data
  2. Use different performance metric
  3. Resample data set
  4. Generate synthetic data (SMOTE)
  5. Adaptive synthetic over sampling (Borderline-SMOTE and ADA-SYN)
  6. Sampling with data cleaning techniques
  7. Cluster-based sampling method (cluster-based oversampling – CBO)
  8. Integration of sampling and boosting (SMOTEBoost, DataBoost-IM)
  9. Over / under sampling with jittering (JOUS-Boost)
  10. Cost-Sensitive Dataspace Weighting with Adaptive Boosting (AdaC1, AdaC2, and AdaC3)
  11. Cost-Sensitive Decision Trees
  12. Cost-Sensitive Neural Networks
  13. Cost-Sensitive Bayesian Classifiers
  14. Cost-Sensitive SVMs
  15. Auto associator
  16. The Mahalanobis-Taguchi System (MTS)
  17. Use different classification algorithm
  18. Use cost based performance metric.
  19. Use different perspective e.g. anomaly detection
  20. Be more creative e.g. one class classifier

Most of the techniques follow a sampling based or miss-classification cost based approach. Some of them classification algorithms have been extended to account for oversampling. So those techniques are classification algorithm specific. We are going to use the 4th technique in the list above

Synthetic Minority Over Sampling Technique

With over sampling based approaches, we generate additional data for the minority class until we the 2 classes are equally represented in the data set. The steps of the SMOTE algorithm are as follows

  1. For each minority class record, find nearest neighbors with same class
  2. Randomly sample to select a neighbor
  3. For each field, interpolate between the source record and neighbor based on a random interpolation point  between the 2 values from the 2 records
  4. Repeat steps 2 and 3 depending on the amount of imbalance

The SMOTE algorithm introduces randomness in generating the synthetic data in 2 ways as follows.

  1. Random selection among k nearest neighbors
  2. Random field wise interpolation between source record and selected neighboring records

Before running the algorithm, we need to perform these 4 additional pre-processing steps. The second step is an optimization step to cut down the amount of processing.

  1. Normalize data
  2. Remove majority class records
  3. Find distance between each pair of records
  4. Find the nearest neighbors for minority class records

Implementation of the solution involves 5 map reduce jobs chained together. The output of one goes as input to the next Map Reduce job. Next we will discuss them in details.

Machine Failure Data

Our use case involves failure data for some machine e.g motor. The data has an ID field, 8 feature attribute fields and 1 class attribute field as follows

  1. ID
  2. age
  3. time since last maintenance
  4. number of break downs
  5. first spectral frequency
  6. first spectral frequency amplitude
  7. second spectral frequency
  8. second spectral frequency amplitude
  9. failure status

The last field is the class attribute. The majority class will correspond to a machine not having failed. The majority class constitutes about 95% of the data in the data set we will be using.

The data is collected for multiple machines of same type. The data could be generated every 24 hours. The whole training data set may contain 1 month’s data. A predictive model can be built with training data over a 1 month period. Data collected over a 24 hours period could be used to predict any impending failure of any machine, using the predictive model.

There are 4 fields related to frequency spectral density. When operating under normal conditions, rotating machinery exhibit a frequency spectral density of vibration amplitudes.

When a failure in impending, the frequency spectral density generally undergoes a change. The frequency spectral density is characterized in the data with the two highest frequency components.

Normalization

Numerical fields in the data may have different scales and the data may have categorical data also. For accurate distance calculation, the different fields should have uniform influence on the distance.  To enforce this,  we need to normalize the numerical fields.

Normalization is performed by the Map Reduce class Normalizer. Different normalization techniques are supported

  1. min max
  2. zscore
  3. center
  4. unit sum

We have use  zscore. With zscore,  optionally we could eliminate outliers also. We have not chosen that option. Here is some sample output data after normalization

02VVMR5389V9,0.509,-0.069,-0.174,-0.171,0.290,-0.169,0.132,-1
03959MAKKQG0,-1.036,0.345,-0.174,-0.227,-0.602,-0.200,0.124,-1
049L6715S91K,0.538,-0.210,-0.174,-0.225,0.541,-0.235,-0.386,-1
04B76LFIBG5C,0.772,-0.391,-0.174,-0.124,-0.298,-0.188,-0.084,-1
05I3CLC3BWV4,-0.418,0.401,-0.174,-0.210,-0.406,-0.211,-0.534,-1
05UHQNN3E188,0.289,0.023,-0.174,-0.085,0.013,-0.201,0.188,-1

Filtering Out Majority Class Records

Since we don’t need the majority class records any more, in this step we filter them out using a Map Reduce job Projection. Filtering out the majority class records will drastically reduce the amount of computation necessary, especially for inter record distance calculation which is the next step.

Projection is essentially a Map Reduce implementation of SQL select query. Fields to be projected and the where clause expressions are provided through configuration parameters. Here is some sample output

4I73YYBTP6O9,-0.378,-0.339,-0.174,1.429,0.365,1.328,-0.689,1
586F3XKT7C50,-0.915,0.064,-0.174,1.513,-0.371,1.337,0.374,1
5EY01F8R6923,1.169,1.014,-0.174,1.348,-0.007,1.439,-0.768,1
5L050AM3UT37,0.180,-0.674,-0.174,1.321,0.714,1.315,0.714,1
5P582FH4B361,0.479,0.663,-0.174,1.374,-0.605,1.380,-0.142,1
67NA1AF891PK,-0.358,-0.219,-0.174,1.395,0.551,1.331,0.095,1
6A9T8TE7J4MR,0.712,0.797,-0.174,1.219,0.124,1.367,0.777,1

Inter Record Distance

This is a computationally intensive task since it calculates distance between all possible record pairs. All the field meta data is provided through a JSON files. Another JSON file provides various distance calculation related parameters.

The implementation is in the Map Reduce class RecordSimilarity. Data is hashed into buckets and the buckets are then Cartesian joined. Number of reducer calls is b2, where b is the number of buckets, which is a configurable parameter.

It supports the following field types. Weights could be assigned to different fields to control influence of different fields in the final distance. For text fields, various algorithms are supported.

  1. numeric
  2. categorical
  3. text
  4. geo location.

Following algorithms  are supported for distance calculation. Each of these could be applied to a subset of fields and then the results could be further aggregated for the whole record. Or we could apply one algorithm for all the fields.

  1. eulidean
  2. manhattan
  3. minkowski
  4. categorical

More details on distance calculation can be found in an earlier post. Here is some sample output. Each line contains 2 IDs, 2 records and distance between the records, optionally scaled.

N43AB51QH6VZ,XJY0DNT468M1,N43AB51QH6VZ,-0.138,-0.545,-0.174,1.354,-1.298,1.417,-0.115,1,XJY0DNT468M1,0.031,-0.063,-0.174,1.422,0.041,1.330,0.712,1,236
N43AB51QH6VZ,AJ5DG8AMYT0C,N43AB51QH6VZ,-0.138,-0.545,-0.174,1.354,-1.298,1.417,-0.115,1,AJ5DG8AMYT0C,-0.468,-0.178,-0.174,1.270,0.799,1.332,-0.284,1,309
N43AB51QH6VZ,6A9T8TE7J4MR,N43AB51QH6VZ,-0.138,-0.545,-0.174,1.354,-1.298,1.417,-0.115,1,6A9T8TE7J4MR,0.712,0.797,-0.174,1.219,0.124,1.367,0.777,1,330
FV1A44G7O273,4I73YYBTP6O9,FV1A44G7O273,0.133,-0.302,0.775,1.237,-0.459,1.348,-0.468,1,4I73YYBTP6O9,-0.378,-0.339,-0.174,1.429,0.365,1.328,-0.689,1,198
FV1A44G7O273,2TBHM907133H,FV1A44G7O273,0.133,-0.302,0.775,1.237,-0.459,1.348,-0.468,1,2TBHM907133H,0.018,0.359,1.725,1.263,0.237,1.383,0.830,1,268
FV1A44G7O273,XJY0DNT468M1,FV1A44G7O273,0.133,-0.302,0.775,1.237,-0.459,1.348,-0.468,1,XJY0DNT468M1,0.031,-0.063,-0.174,1.422,0.041,1.330,0.712,1,232

Nearest Neighbors

In this step we find the nearest neighbors for records. The Map Reduce class implementation is TopMatchesByClass. There two options for selecting nearest neighbors as below

  1. Nearest by count
  2. Nearest by distance.

If nearest count option is chosen, the number of neighbors needs to be specified. With the nearest by distance option, the distance needs to be specified. In this case the number of neighbors selected will be variable. Here is some sample output, Each line contains a record followed by k nearest records, first one being the closest.

FV1A44G7O273,0.133,-0.302,0.775,1.237,-0.459,1.348,-0.468,1,P18202UL2887,0.345,0.193,0.775,1.305,-0.044,1.369,-0.034,1,OE6F3P5Z62LI,0.828,0.049,0.775,1.233,-0.238,1.362,-0.139,1,S0UQVER64O67,0.305,0.230,1.725,1.338,-0.235,1.370,-0.207,1,T3WZNFQXAE94,0.482,0.221,-0.174,1.433,-0.130,1.259,-0.481,1
G1B5YC6O3D6T,0.279,0.174,1.725,1.280,0.136,1.298,-0.294,1,S0UQVER64O67,0.305,0.230,1.725,1.338,-0.235,1.370,-0.207,1,P18202UL2887,0.345,0.193,0.775,1.305,-0.044,1.369,-0.034,1,2TBHM907133H,0.018,0.359,1.725,1.263,0.237,1.383,0.830,1,OE6F3P5Z62LI,0.828,0.049,0.775,1.233,-0.238,1.362,-0.139,1
G7XT56FX5SI6,-0.307,0.566,-0.174,1.250,-0.522,1.384,0.298,1,M2X590PETV5W,-0.507,0.551,-0.174,1.302,-0.235,1.303,0.303,1,16OHRFGSD022,0.127,0.787,-0.174,1.468,-0.258,1.329,0.359,1,9LKXXF504M38,-0.351,0.970,-0.174,1.271,0.021,1.344,0.282,1,XBCQ6830PHKA,-0.142,0.570,-0.174,1.404,-0.034,1.277,-0.142,1
I3T24R45E75N,-0.622,1.202,0.775,1.373,0.146,1.325,-0.097,1,9LKXXF504M38,-0.351,0.970,-0.174,1.271,0.021,1.344,0.282,1,COW5AHO53PUI,-0.149,1.242,-0.174,1.425,-0.205,1.336,-0.060,1,XBCQ6830PHKA,-0.142,0.570,-0.174,1.404,-0.034,1.277,-0.142,1,M2X590PETV5W,-0.507,0.551,-0.174,1.302,-0.235,1.303,0.303,1

Choosing the correct neighborhood size for synthetic sample generation can be tricky. If it’s too small, the generated samples may not provide enough coverage through the feature space.

If the neighborhood too large, depending on the complexity of the distribution of the records belonging to the 2 classes in the feature space,  there is the risk of creating a sample in regions with  predominantly majority class records.

Synthetic Minority Class Sample Generation

This is the last step of the workflow, which generates the synthetic minority class samples. The algorithm was described earlier.The Map reduce implementation is in the ClassBasedOverSampler class.

Distribution of the neighbors for sampling purpose by default is uniform. It also supports exponential distribution of the neighbors. With exponential distribution, neighbors that are closer will have higher probability. So they will tend to be sampled more. With exponential distribution a rejection samling based technique is used. We have used the uniform distribution option.

One of the parameters that need to be specified is the multiplier for oversampling. it can be calculated as below

m = round(p/q) – 1   where
m = multiplier
p = number of majority class records
q = number of minority class records

Here is some sample output. All the feature attribute values are generated by interpolation. Class attribute value remains as is. The ID is generated from the ID of the source record and the neighbor record.

00CAR5W29RTP,0.734,-0.480,-0.174,1.294,0.243,1.323,0.403,1
9Y700493NCV8,0.333,-0.401,-0.174,1.449,0.193,1.310,0.534,1
69ZD949KIPJN,0.910,-0.471,-0.174,1.351,0.304,1.280,0.586,1
0049IRC1WU91,0.659,-0.382,-0.174,1.352,-0.272,1.275,0.608,1
7IK66P93MJNC,0.985,-0.415,-0.174,1.354,0.296,1.308,0.441,1
00493CB97Y9J,0.984,-0.616,-0.174,1.353,0.218,1.290,0.518,1
004D9RJNCB69,0.853,-0.463,-0.174,1.354,0.272,1.310,0.485,1

When the output generated from this step is combined with the original data set we get a class balanced data set.

Pitfalls

Based on available data we are generating additional data. It’s critical that additional data generated does not change the underlying distribution of the data significantly. This may happen if the class boundaries are complex and there are multiple mutually exclusive regions in the feature space belong to some class.

To alleviate this problem, the neighborhood size for nearest neighbors should selected carefully. Other wise you end up generating minority class record in  a region of majority class in the feature space.

Final Words

This is yet another example where you set out to build a predictive model and immediately get swamped with burgeoning list of  data munging tasks, before you can even start.

My goal was to build an SVM based predictive model for machine failures and I immediately realized the data was highly unbalanced. That triggered a chain of ETL data munging tasks to get the data ready.

There are many techniques to combat class imbalance problems. We have used an oversampling based technique called SMOTE. We have gone through the details Map Reduce workflow using machine failure data as an use case. Details of the execution steps can be found in this tutorial document.

Advertisements

About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source project owner. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Machine Learning and Programming languages. I am fascinated by problems that don't have neat closed form solution.
This entry was posted in Big Data, Data Science, ETL, Hadoop and Map Reduce and tagged , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s