Most supervised Machine Learning algorithms face difficulty when there is class imbalance in the training data i.e., amount of data belonging one class heavily outnumber the other class. However, there are may real life problems where we encounter this situation e.g., fraud, customer churn and machine failure. There are various techniques to address this thorny problem of class imbalance.
In this post we will go over a technique based on oversampling o the minority class data called Synthetic Minority Over-sampling Technique (SMOTE). We will go into the details of a Hadoop based implementation using machine failure data as the use case.
The implementation can be found in my OSS projects avenir. The ETL map reduce jobs necessary for pre processing the data before running the SMOTE algorithm come from another OSS project of mine chombo.
Most classification algorithms are adversely affected by class imbalance, with some exceptions.Decision trees are relatively more tolerant of class imbalance. Most supervised learning algorithms can tolerate small amount of class imbalance. When the imbalance is excessive, some corrective action is necessary.
As outlined in the article cited above and this reddit post, here are the techniques to handle class imbalance.
- Collect more data
- Use different performance metric
- Resample data set
- Generate synthetic data (SMOTE)
- Adaptive synthetic over sampling (Borderline-SMOTE and ADA-SYN)
- Sampling with data cleaning techniques
- Cluster-based sampling method (cluster-based oversampling – CBO)
- Integration of sampling and boosting (SMOTEBoost, DataBoost-IM)
- Over / under sampling with jittering (JOUS-Boost)
- Cost-Sensitive Dataspace Weighting with Adaptive Boosting (AdaC1, AdaC2, and AdaC3)
- Cost-Sensitive Decision Trees
- Cost-Sensitive Neural Networks
- Cost-Sensitive Bayesian Classifiers
- Cost-Sensitive SVMs
- Auto associator
- The Mahalanobis-Taguchi System (MTS)
- Use different classification algorithm
- Use cost based performance metric.
- Use different perspective e.g. anomaly detection
- Be more creative e.g. one class classifier
Most of the techniques follow a sampling based or miss-classification cost based approach. Some of them classification algorithms have been extended to account for oversampling. So those techniques are classification algorithm specific. We are going to use the 4th technique in the list above
Synthetic Minority Over Sampling Technique
With over sampling based approaches, we generate additional data for the minority class until we the 2 classes are equally represented in the data set. The steps of the SMOTE algorithm are as follows
- For each minority class record, find nearest neighbors with same class
- Randomly sample to select a neighbor
- For each field, interpolate between the source record and neighbor based on a random interpolation point between the 2 values from the 2 records
- Repeat steps 2 and 3 depending on the amount of imbalance
The SMOTE algorithm introduces randomness in generating the synthetic data in 2 ways as follows.
- Random selection among k nearest neighbors
- Random field wise interpolation between source record and selected neighboring records
Before running the algorithm, we need to perform these 4 additional pre-processing steps. The second step is an optimization step to cut down the amount of processing.
- Normalize data
- Remove majority class records
- Find distance between each pair of records
- Find the nearest neighbors for minority class records
Implementation of the solution involves 5 map reduce jobs chained together. The output of one goes as input to the next Map Reduce job. Next we will discuss them in details.
Machine Failure Data
Our use case involves failure data for some machine e.g motor. The data has an ID field, 8 feature attribute fields and 1 class attribute field as follows
- time since last maintenance
- number of break downs
- first spectral frequency
- first spectral frequency amplitude
- second spectral frequency
- second spectral frequency amplitude
- failure status
The last field is the class attribute. The majority class will correspond to a machine not having failed. The majority class constitutes about 95% of the data in the data set we will be using.
The data is collected for multiple machines of same type. The data could be generated every 24 hours. The whole training data set may contain 1 month’s data. A predictive model can be built with training data over a 1 month period. Data collected over a 24 hours period could be used to predict any impending failure of any machine, using the predictive model.
There are 4 fields related to frequency spectral density. When operating under normal conditions, rotating machinery exhibit a frequency spectral density of vibration amplitudes.
When a failure in impending, the frequency spectral density generally undergoes a change. The frequency spectral density is characterized in the data with the two highest frequency components.
Numerical fields in the data may have different scales and the data may have categorical data also. For accurate distance calculation, the different fields should have uniform influence on the distance. To enforce this, we need to normalize the numerical fields.
Normalization is performed by the Map Reduce class Normalizer. Different normalization techniques are supported
- min max
- unit sum
We have use zscore. With zscore, optionally we could eliminate outliers also. We have not chosen that option. Here is some sample output data after normalization
02VVMR5389V9,0.509,-0.069,-0.174,-0.171,0.290,-0.169,0.132,-1 03959MAKKQG0,-1.036,0.345,-0.174,-0.227,-0.602,-0.200,0.124,-1 049L6715S91K,0.538,-0.210,-0.174,-0.225,0.541,-0.235,-0.386,-1 04B76LFIBG5C,0.772,-0.391,-0.174,-0.124,-0.298,-0.188,-0.084,-1 05I3CLC3BWV4,-0.418,0.401,-0.174,-0.210,-0.406,-0.211,-0.534,-1 05UHQNN3E188,0.289,0.023,-0.174,-0.085,0.013,-0.201,0.188,-1
Filtering Out Majority Class Records
Since we don’t need the majority class records any more, in this step we filter them out using a Map Reduce job Projection. Filtering out the majority class records will drastically reduce the amount of computation necessary, especially for inter record distance calculation which is the next step.
Projection is essentially a Map Reduce implementation of SQL select query. Fields to be projected and the where clause expressions are provided through configuration parameters. Here is some sample output
4I73YYBTP6O9,-0.378,-0.339,-0.174,1.429,0.365,1.328,-0.689,1 586F3XKT7C50,-0.915,0.064,-0.174,1.513,-0.371,1.337,0.374,1 5EY01F8R6923,1.169,1.014,-0.174,1.348,-0.007,1.439,-0.768,1 5L050AM3UT37,0.180,-0.674,-0.174,1.321,0.714,1.315,0.714,1 5P582FH4B361,0.479,0.663,-0.174,1.374,-0.605,1.380,-0.142,1 67NA1AF891PK,-0.358,-0.219,-0.174,1.395,0.551,1.331,0.095,1 6A9T8TE7J4MR,0.712,0.797,-0.174,1.219,0.124,1.367,0.777,1
Inter Record Distance
This is a computationally intensive task since it calculates distance between all possible record pairs. All the field meta data is provided through a JSON files. Another JSON file provides various distance calculation related parameters.
The implementation is in the Map Reduce class RecordSimilarity. Data is hashed into buckets and the buckets are then Cartesian joined. Number of reducer calls is b2, where b is the number of buckets, which is a configurable parameter.
It supports the following field types. Weights could be assigned to different fields to control influence of different fields in the final distance. For text fields, various algorithms are supported.
- geo location.
Following algorithms are supported for distance calculation. Each of these could be applied to a subset of fields and then the results could be further aggregated for the whole record. Or we could apply one algorithm for all the fields.
More details on distance calculation can be found in an earlier post. Here is some sample output. Each line contains 2 IDs, 2 records and distance between the records, optionally scaled.
N43AB51QH6VZ,XJY0DNT468M1,N43AB51QH6VZ,-0.138,-0.545,-0.174,1.354,-1.298,1.417,-0.115,1,XJY0DNT468M1,0.031,-0.063,-0.174,1.422,0.041,1.330,0.712,1,236 N43AB51QH6VZ,AJ5DG8AMYT0C,N43AB51QH6VZ,-0.138,-0.545,-0.174,1.354,-1.298,1.417,-0.115,1,AJ5DG8AMYT0C,-0.468,-0.178,-0.174,1.270,0.799,1.332,-0.284,1,309 N43AB51QH6VZ,6A9T8TE7J4MR,N43AB51QH6VZ,-0.138,-0.545,-0.174,1.354,-1.298,1.417,-0.115,1,6A9T8TE7J4MR,0.712,0.797,-0.174,1.219,0.124,1.367,0.777,1,330 FV1A44G7O273,4I73YYBTP6O9,FV1A44G7O273,0.133,-0.302,0.775,1.237,-0.459,1.348,-0.468,1,4I73YYBTP6O9,-0.378,-0.339,-0.174,1.429,0.365,1.328,-0.689,1,198 FV1A44G7O273,2TBHM907133H,FV1A44G7O273,0.133,-0.302,0.775,1.237,-0.459,1.348,-0.468,1,2TBHM907133H,0.018,0.359,1.725,1.263,0.237,1.383,0.830,1,268 FV1A44G7O273,XJY0DNT468M1,FV1A44G7O273,0.133,-0.302,0.775,1.237,-0.459,1.348,-0.468,1,XJY0DNT468M1,0.031,-0.063,-0.174,1.422,0.041,1.330,0.712,1,232
In this step we find the nearest neighbors for records. The Map Reduce class implementation is TopMatchesByClass. There two options for selecting nearest neighbors as below
- Nearest by count
- Nearest by distance.
If nearest count option is chosen, the number of neighbors needs to be specified. With the nearest by distance option, the distance needs to be specified. In this case the number of neighbors selected will be variable. Here is some sample output, Each line contains a record followed by k nearest records, first one being the closest.
FV1A44G7O273,0.133,-0.302,0.775,1.237,-0.459,1.348,-0.468,1,P18202UL2887,0.345,0.193,0.775,1.305,-0.044,1.369,-0.034,1,OE6F3P5Z62LI,0.828,0.049,0.775,1.233,-0.238,1.362,-0.139,1,S0UQVER64O67,0.305,0.230,1.725,1.338,-0.235,1.370,-0.207,1,T3WZNFQXAE94,0.482,0.221,-0.174,1.433,-0.130,1.259,-0.481,1 G1B5YC6O3D6T,0.279,0.174,1.725,1.280,0.136,1.298,-0.294,1,S0UQVER64O67,0.305,0.230,1.725,1.338,-0.235,1.370,-0.207,1,P18202UL2887,0.345,0.193,0.775,1.305,-0.044,1.369,-0.034,1,2TBHM907133H,0.018,0.359,1.725,1.263,0.237,1.383,0.830,1,OE6F3P5Z62LI,0.828,0.049,0.775,1.233,-0.238,1.362,-0.139,1 G7XT56FX5SI6,-0.307,0.566,-0.174,1.250,-0.522,1.384,0.298,1,M2X590PETV5W,-0.507,0.551,-0.174,1.302,-0.235,1.303,0.303,1,16OHRFGSD022,0.127,0.787,-0.174,1.468,-0.258,1.329,0.359,1,9LKXXF504M38,-0.351,0.970,-0.174,1.271,0.021,1.344,0.282,1,XBCQ6830PHKA,-0.142,0.570,-0.174,1.404,-0.034,1.277,-0.142,1 I3T24R45E75N,-0.622,1.202,0.775,1.373,0.146,1.325,-0.097,1,9LKXXF504M38,-0.351,0.970,-0.174,1.271,0.021,1.344,0.282,1,COW5AHO53PUI,-0.149,1.242,-0.174,1.425,-0.205,1.336,-0.060,1,XBCQ6830PHKA,-0.142,0.570,-0.174,1.404,-0.034,1.277,-0.142,1,M2X590PETV5W,-0.507,0.551,-0.174,1.302,-0.235,1.303,0.303,1
Choosing the correct neighborhood size for synthetic sample generation can be tricky. If it’s too small, the generated samples may not provide enough coverage through the feature space.
If the neighborhood too large, depending on the complexity of the distribution of the records belonging to the 2 classes in the feature space, there is the risk of creating a sample in regions with predominantly majority class records.
Synthetic Minority Class Sample Generation
This is the last step of the workflow, which generates the synthetic minority class samples. The algorithm was described earlier.The Map reduce implementation is in the ClassBasedOverSampler class.
Distribution of the neighbors for sampling purpose by default is uniform. It also supports exponential distribution of the neighbors. With exponential distribution, neighbors that are closer will have higher probability. So they will tend to be sampled more. With exponential distribution a rejection samling based technique is used. We have used the uniform distribution option.
One of the parameters that need to be specified is the multiplier for oversampling. it can be calculated as below
Here is some sample output. All the feature attribute values are generated by interpolation. Class attribute value remains as is. The ID is generated from the ID of the source record and the neighbor record.
00CAR5W29RTP,0.734,-0.480,-0.174,1.294,0.243,1.323,0.403,1 9Y700493NCV8,0.333,-0.401,-0.174,1.449,0.193,1.310,0.534,1 69ZD949KIPJN,0.910,-0.471,-0.174,1.351,0.304,1.280,0.586,1 0049IRC1WU91,0.659,-0.382,-0.174,1.352,-0.272,1.275,0.608,1 7IK66P93MJNC,0.985,-0.415,-0.174,1.354,0.296,1.308,0.441,1 00493CB97Y9J,0.984,-0.616,-0.174,1.353,0.218,1.290,0.518,1 004D9RJNCB69,0.853,-0.463,-0.174,1.354,0.272,1.310,0.485,1
When the output generated from this step is combined with the original data set we get a class balanced data set.
Based on available data we are generating additional data. It’s critical that additional data generated does not change the underlying distribution of the data significantly. This may happen if the class boundaries are complex and there are multiple mutually exclusive regions in the feature space belong to some class.
To alleviate this problem, the neighborhood size for nearest neighbors should selected carefully. Other wise you end up generating minority class record in a region of majority class in the feature space.
This is yet another example where you set out to build a predictive model and immediately get swamped with burgeoning list of data munging tasks, before you can even start.
My goal was to build an SVM based predictive model for machine failures and I immediately realized the data was highly unbalanced. That triggered a chain of ETL data munging tasks to get the data ready.
There are many techniques to combat class imbalance problems. We have used an oversampling based technique called SMOTE. We have gone through the details Map Reduce workflow using machine failure data as an use case. Details of the execution steps can be found in this tutorial document.