Adoption of *eLearning* or Learning Management Systems (*LMS*) has increased significantly within academic and business world. In some cases, depending on the content and the *eLearning* system being used, high drop out rates have been reported as a serious problem. Here is an article form the *Journal of Online Learning and Teaching* on this topic. In this post I will use K Nearest Neighbor (*KNN*) classification technique implemented on *Hadoop* to predict partially through a course whether a student is likely to drop out eventually. *KNN* is a popular and reliable classification technique.

The input features consist of various signals based on the engagement level of the student with the *eLearning* system and the student’s performance so far. With the identification of students who are likely to drop out, the instructors can be more vigilant and intervene to prevent attrition.

The solution is available in my OSS project *avenir*, which is a collection machine learning algorithms implemented on* Hadoop *and* Storm*. The distance calculation is done using *sifarish*, which is another OSS project of mine for personalization and recommendation implemented on *Hadoop* and *Storm*.

## K Nearest Neighbor

Conceptually KNN is very simple. The training data consists of students engagement data (input or feature variables) from past classes and whether the student eventually completed the course or not (class or output variable). The basic KNN algorithm is as follows.

*For any test point (i.e. a student about whom we want to make prediction), we find the K nearest neighbors. Each student is represented as a data point in a multi dimensional feature hyper space. Similarity is defined as the distance between data points using Euclidian and some other distance metric.**Each of the neighbors belongs to one class or another. In our example, the class is a binary variable indicating whether a student completed or dropped out. We simply take the majority class among the neighbors and declare the test point to belong to that class.*

Unlike other classification techniques, *KNN* does not build any predictive model and It operates directly on the data. It’s also known as local classification, example based classification etc. Here is a summary of pros and cons of KNN

Pros | Cons |

Simple and does not require learning and model building | Computationally intensive |

Works well even for small training data set | Does not work well for high dimensional data |

Works well even when different class populations are not linearly separable and decision hyper plane is complex |

## Improvised KNN

There are various improvisations of the basic *KNN*. Here are some them. The list should not be considered as exhaustive. Some of the of the techniques towards the end of the list have their origins in the academic world and they may not be well established in the business world.

*Make the vote by a neighbor inversely proportionally to it’s distance or square of distance from the test point. The idea of making a neighbor further away less influential is intuitive**The same idea, except that instead of linear, we apply a kernel function centered on the neighbor. The nature of the kernel function decides how the influence of a neighbor decays as the distance from the test point grows.**Use distance metric other than Euclidian e.g. Minkowski, Manhattan etc.Different distance metric choices are offered in sifarish. But for this example, I have use Euclidian.**Use different weights for different attributes when calculating distance. This feature is available in sifarish, but I have not used it. The weight could be inverse of the standard deviation of the class conditioned attribute values. The weight could also be learnt iteratively from the training data set.**Apply class conditional weighting on the vote of a neighbor. This approach being used in this post**Use traditional techniques to find a neighborhood and then apply other distance metric techniques based on class distribution within the neighborhood to recalculate distances**.**Use traditional techniques to find a neighborhood and then apply local discriminant analysis and recompute distances**Weights associated with feature are computed in the neighborhood of the test point by means of a local feature relevance factor*

## Input

Our input for student engagement with *eLearning* consists of eight attributes as below. We are assuming that the *eLearning* system can generate all these signals. I have not done any feature analysis on them. Perhaps with feature analysis I could get rid of attributes that redundant with respect to other attributes and / or not relevant to the output.

*Time spent with content**Time spent in discussion with others**Time spent with organizer**Number of emails sent and received**Number of chat messages**Score is tests**Score in assignments**Number of pages book marked*

These are all numerical variables. The class attribute is a binary variable , where P implies that the student completed the course and it’s F otherwise.

## Feature Analysis

As I mentioned, feature analysis should be have been done on the input feature variables. It would have benefited us in two ways. This entropy and mutual information based analysis is available in *avenir*. Here is an earlier post on the topic.

Essentially it assigns score to each attribute based how uncorrelated an attribute is to other attributes and how correlated it is to the class variable. Based on the score, the top n attributes could be selected for inclusion in the prediction model

## Class Imbalance and Overlap

These are some thorny issues faced by any machine learning classification technique. We have imbalance in data when the distribution of data conditioned on the class attribute value is highly skewed. In other words, there is dis proportionally higher number of data points belonging to one class when the class values are binary. Many real worlds data sets tend to have high class imbalance, including our *eLearning* data set. Students dropping out will be small percentage of the total population.

Class imbalance negatively impacts many machine learning techniques resulting in poor performance in predicting the the minority class. For KNN, generally imbalance is not problematic especially when the data set is large and the classes are separable.

The hyper plane separating classes is not very crisp in many data sets resulting in what is known as the class overlap problem. When the data points belonging to one class encroach deeply into the region occupied by data points belonging to the other class, we have class overlap problem. Most classification techniques can not withstand the combined force of class class imbalance and overlap and *KNN* is no exception.

## Class Conditional Probability Weighting

I have used a technique called class conditional weighting (*CCW*) to combat class imbalance and overlap. Details are available here. The vote by a neighbor is weighted by class conditional probability of the data point.

*y = arg max Σ I(y _{i} = c) p(x_{i} | y_{i}) where*

*y = class of test data point*

*I = Indicator function with value 1 if training data point belongs to class c*

*p = probability of training data point x*

_{i}conditioned on class c*the sum is taken over data points belonging to class c1 and c2*

As we can see even with imbalance and overlap as the training data points belonging to class c1 encroach into the region belonging to c2, it’s class conditional probability will be lower and hence will carry less weight.

## Map Reduce Workflow

Multiple map reduce jobs are orchestrated for KNN prediction as listed below. The first MR *SameTypeSimilarity* is from my Recommendation and Personalization engine project *sifarish* and it finds distance between objects with multiple attributes. It’s used in sifarish for for content based recommendation.

It works in two modes. If provided with one data set, it computes distance between each pair of data points. If provided with two sets, it finds pair wise distance between the two data sets. We are using it in the two data sets mode.

Input | Map Reduce | Output |

training and test data | SameTypeSimilarty |
finds distance between training data points and test data points |

training data | BayesianDistribution |
finds class conditional probability for training data attributes |

training data | BayesianPredictor |
finds class conditional probability for training data vector |

output of the first and previous job | FeatureCondProbJoiner |
joins training test data distance with conditional probability for training data vector |

output of the previous job | NearestNeighbor |
predicts class of test data |

*BayesianDistribution* is from the bayesian package of *avenir*. It calculates feature attribute class conditional probability, feature prior probability and class prior probability. BayesianPredictor is also from the same package. It makes class prediction for Naive Bayes classification. But we are only interested in feature class conditional probability output from this MR.

*FeatureCondProbJoiner* joins training and test data distances with the feature class conditional probability of the training data in preparation for the round of processing for *KNN* prediction.

*NearestNeighbor* makes the class prediction. Additionally, it provides some additional output which reflect the quality of the prediction.

## Input and Output

Input to *SameTypeSimilarity* consists of two data sets, one for training and the other for validation. They are generated by a python script. Each input attribute is assumed to have normal distribution with specified mean and standard deviation. Data is generated using a technique called rejection sampling.

The data generating script with logic is such that there is class imbalance which is natural for this data set. It also deliberately introduces large class overlap to make classification more challenging. Here is some sample input.

1242987,375,28,59,8,69,55,197,52,1,P 1375258,307,39,52,13,69,33,97,32,0,F 1545012,155,44,55,6,38,52,33,33,10,F 1416276,203,78,26,14,10,10,79,47,14,P 1212312,536,83,23,0,30,61,79,72,20,P 1259055,191,70,2,1,75,21,136,84,25,P 1241411,166,113,36,14,95,100,172,75,9,P

The first field is the student ID and the last field is the class attribute with the value P signifying completion of course and F otherwise. The remaining fields are the feature attributes.

Here is some sample output from the final map reduce.The first field is the student ID. The field before the last is the actual class attribute value and the last one is the predicted class attribute value. In between we have some secondary output that contains scores for different class attribute values.

P1973128,P,1.563226260137538E-17,P,P P1974227,F,5.094622163781374E-18,P,1.3913280652371146E-17,P,P P1977663,P,4.499679277419704E-18,F,P P1979022,F,6.229582705344719E-18,P,1.2507761942439003E-18,F,F P1981268,F,7.90091893283086E-19,P,3.9097666365799436E-18,F,P P1984678,F,9.797444241127851E-19,P,1.385907383740041E-17,P,P P1987694,F,3.8811744913728846E-18,P,2.7169442275428592E-17,P,P

We have number of miss classifications and they are false negative i.e. student who has actually dropped out is predicted as having completed the course. In our example, false negatives are costlier than false positives, because it’s more important to detect a failing student.

## Improving Classification Accuracy

There are several options worth trying to improve the classification accuracy, as below. I will come back with another post when I have tried some of these options. Some of them are general practices followed and independent of the particular classification algorithm.

♦ The foll0wing techniques are based on changing how the classification decisions are made. With these approaches, we introduce a bias in the decision boundary.

*Modify the decision threshold. If the classification is based on probability or some score, the decision boundary does not have to be at the halfway point. In our example we might classify as dropped out if odds(dropOut) > 0.8**With cost based classification, we assign misclassification cost and then choose a classification that minimizes the cost. In our example, we could assign a hisgher cost for false negative*

♦ With data driven approach we manipulate the training data as follows. These approaches are typically used to address the class imbalance problem.

- I
*ncrease the training data size* *Over sample the minority class data or under sample the majority class data in case of imbalance*

♦ If the data model is high dimensional, we could also focus on the data model and reduce feature set as below

*Reduce features by removing features**irrelevant*to the class attribute

*Remove features that redundant with respect to other features*

♦ Finally, we have techniques that modify the basic classification algorithm as below. The goal is improve the correctness of prediction.

*Try different values for neighborhood count (K). Lower values will cause large error due to variance. Higher values will cause large error due to bias.*

*Use inverse distance to neighbor weighting in neighbor’s vote**Use inverse of attribute class conditioned std deviation as weight in attribute distance calculation aka Class Dependent Mahalanobis distance**Use some of the other variations of KNN listed earlier*

As I mentioned before, I deliberately introduced large class overlap to produce poor quality result. I am interested in finding out which of these techniques is effective in improving the prediction quality.

## KD Tree

The algorithmic complexity of our brute force way to find nearest neighbor is *O(m x n)*, where m is the number of training samples and n is the number of test samples. In other words, for each test sample, we scan through all training samples. You might say that’s not very smart and there is a better way which is *KD* tree.

*KD* tree is a binary search tree, extended for multi dimensional space. If we build a *KD* tree with training samples, the order of complexity becomes *O(log(m) x n)*. However, *Hadoop* is not at all suitable for implementing the iterative logic to construct and navigate such trees. Instead, we have relied on the brute force approach leveraging the horizontal scalability of *Hadoop*.

## Wrapping Up

This is still work in progress, As I mentioned I will try some of the techniques listed above to improves the prediction accuracy. I have prepared the data set in a way that makes the classification extremely challenging.

This tutorial document contains step by step instructions on how to run the example in this post.

Pingback: Location and Time Based Service | Mawazo