Nearest Flunking Neighbors

Adoption of eLearning or Learning Management Systems (LMS) has increased significantly within academic and business world. In some cases, depending on the content  and the eLearning system being used, high drop out rates have been reported as a serious problem. Here is an article form the Journal of Online Learning and Teaching on this topic. In this post I will use K Nearest Neighbor (KNN) classification technique implemented on Hadoop  to predict partially  through a course whether a student is likely to drop out eventually. KNN is a popular and reliable classification technique.

The input features consist of various signals based on the  engagement level of the student  with the eLearning system and the student’s performance so far. With the identification of students who are likely to drop  out, the instructors can be more vigilant and intervene to prevent attrition.

The solution is available in my OSS project avenir, which is a collection machine learning algorithms implemented on Hadoop and Storm. The distance calculation is done using  sifarish, which is another OSS project of mine for personalization and recommendation implemented on Hadoop and Storm.

K Nearest Neighbor

Conceptually KNN is very simple. The training data consists of students engagement data (input or feature variables)  from past classes  and whether the student eventually completed the course or not (class or output variable). The basic KNN algorithm is as follows.

  1. For any test point (i.e. a student about whom we want to make prediction), we find the K nearest neighbors. Each student is represented as a data point in a multi dimensional feature hyper space. Similarity is defined as the distance between data points using Euclidian and some other distance metric.
  2. Each of the neighbors belongs to one class or another. In our example, the class is a binary variable indicating whether a student completed or dropped out. We simply take the majority class among the neighbors and declare the test point to belong to that class.

Unlike other classification techniques, KNN does not build any predictive model and It operates directly on the data. It’s also known as local classification, example based classification etc. Here is a summary of pros and cons of KNN

Pros Cons
Simple and does not require learning and model building Computationally intensive
Works well even  for small training data set Does not work well for high dimensional data
Works well even when different class  populations are not linearly separable and  decision hyper plane is complex

Improvised KNN

There are various improvisations of the basic KNN. Here are some them. The list should not be considered as exhaustive. Some of the of the techniques towards the end of the list have their origins in the academic  world and they may not be well established in the business world.

  • Make the vote by a neighbor inversely proportionally to it’s distance or square of distance from the test point. The idea of making a neighbor further away less influential is intuitive
  • The same idea, except that instead of linear, we apply a kernel function centered on the neighbor. The nature of the kernel function decides how the influence of a neighbor decays as the distance from the test point grows.
  • Use distance metric other than Euclidian e.g. Minkowski, Manhattan etc.Different distance metric choices are offered in sifarish. But for this example, I have use Euclidian.
  • Use different weights for different attributes when calculating distance. This feature is available in sifarish, but I have not used it. The weight could be inverse of the standard deviation of the class conditioned attribute values. The weight could also be learnt iteratively from the training data set.
  • Apply class conditional weighting on the vote of a neighbor. This approach being used in this post
  • Use traditional techniques to find a  neighborhood and then apply other distance metric techniques based on class distribution within the  neighborhood to recalculate distances
  • .Use traditional techniques to find a  neighborhood and then apply local discriminant analysis and recompute distances
  • Weights associated with feature are computed  in the neighborhood of the test point by means of a local feature relevance factor


Our input for student engagement with eLearning consists of eight attributes as below. We are assuming that the eLearning system can generate all these signals. I have not done any feature analysis on them. Perhaps with feature analysis I could get rid of attributes that redundant with respect to other attributes and / or not relevant to the output.

  1. Time spent with content
  2. Time spent in discussion with others
  3. Time spent with organizer
  4. Number of emails sent and received
  5. Number of chat messages
  6. Score is tests
  7. Score in assignments
  8. Number of pages book marked

These are all numerical variables.  The class attribute is a binary variable , where P implies that the student completed the course and it’s F otherwise.

Feature Analysis

As I mentioned, feature analysis should be have been done on the input feature variables. It would have benefited us in two ways. This entropy and mutual information  based analysis is available in avenir. Here is an earlier post on the topic.

Essentially it  assigns score to each attribute based how uncorrelated an attribute is to other attributes and how correlated it is to the class variable. Based on the score, the top n attributes could be selected for inclusion in the prediction model

Class Imbalance and Overlap

These are some thorny issues faced by any machine learning  classification technique. We have imbalance in data when the distribution of data conditioned  on the class attribute value is highly skewed. In other words, there is dis proportionally higher number of data points belonging to one class when  the class values are binary.  Many real worlds data sets tend to have high class imbalance, including our eLearning data set. Students dropping out will be small percentage of the total population.

Class imbalance negatively impacts many  machine learning techniques resulting in poor performance in predicting the the minority class. For KNN, generally imbalance is not problematic especially when the data set is large  and the classes are separable.

The  hyper plane  separating classes is not very crisp in many data sets resulting in what is known as the class overlap problem. When the data points belonging to one class encroach deeply into the region occupied by data points belonging to the other class, we have class overlap problem. Most classification techniques can not withstand the combined force of class class imbalance and overlap and KNN is no exception.

Class Conditional Probability Weighting

I have used a technique called class conditional weighting (CCW) to combat class imbalance and overlap. Details are available here. The vote by a neighbor is weighted by class conditional probability of the data point.

y = arg max Σ I(yi = c) p(xi | yi) where
y = class of test data point
I = Indicator function with value 1 if training data point belongs to class c
p = probability of training data point xi conditioned on class c
the sum is taken over data points belonging to class c1 and c2

As we can see even with imbalance and overlap as the training data points belonging to class c1 encroach into the region belonging to c2, it’s class conditional probability will be lower and hence will carry less weight.

Map Reduce Workflow

Multiple map reduce jobs are orchestrated for KNN prediction as listed below. The first MR SameTypeSimilarity  is from my Recommendation and Personalization engine project  sifarish and it finds distance between objects with multiple attributes. It’s used in sifarish for for content based recommendation.

It works in two modes. If provided with one data set, it computes distance between each pair of data points. If provided with two sets, it finds pair wise distance  between the two data sets. We are using it in the two data sets  mode.

Input Map Reduce Output
training and test data SameTypeSimilarty finds distance between training data points and test data points
training data BayesianDistribution finds class conditional probability for training data attributes
training data BayesianPredictor finds class conditional probability for training data vector
output of the first and previous job FeatureCondProbJoiner joins training test data distance with conditional probability for training data vector
output of the previous job NearestNeighbor predicts class of test data

BayesianDistribution is from the bayesian package of avenir. It calculates feature attribute class conditional probability, feature prior probability and class prior probability. BayesianPredictor is also from the same package. It makes class prediction for Naive Bayes classification. But we are only interested in feature class conditional probability output from this MR.

FeatureCondProbJoiner joins training and test data distances with the  feature class conditional probability of the training data in preparation for the round of processing for KNN prediction.

NearestNeighbor makes the class prediction. Additionally, it provides some additional output which reflect the quality of the prediction.

Input and Output

Input to SameTypeSimilarity consists of two data sets, one for training and the other for validation. They are generated by a python script. Each input attribute is assumed to have normal distribution with specified mean and standard deviation. Data is generated using a technique called rejection sampling.

The data generating script with logic is such that there is class imbalance which is natural for this data set. It also deliberately introduces large class overlap to make classification more challenging. Here is some sample input.


The first field is the student ID and the last field is the class attribute with the value P signifying completion of course and F otherwise. The remaining fields are the feature attributes.

Here is some sample output from the final map reduce.The first field is the student ID. The field before the last is the actual class attribute value and the last one is the predicted class attribute value. In between we have some secondary output that contains scores for different class attribute values.


We have number of miss classifications and they are false negative i.e. student who has actually dropped out is predicted as having completed the course. In our example, false negatives are costlier than false positives, because it’s more important to detect a failing student.

Improving Classification Accuracy

There are several options worth trying to improve the classification accuracy, as below. I will come back with another post when I have tried  some of these options. Some of them are general practices followed and independent of the particular classification algorithm.

♦ The foll0wing  techniques are based on changing how the classification decisions are made. With these approaches, we introduce a bias in the decision boundary.

  • Modify the decision threshold. If the classification is based on probability or some score, the decision boundary does not have to be at the halfway point. In our example we might classify as dropped out if odds(dropOut) > 0.8
  • With cost based classification, we assign misclassification cost and then choose a classification that minimizes the cost. In our example, we could assign a hisgher cost for false negative


♦ With data driven approach we manipulate the training data as follows. These approaches are typically used to address the class imbalance problem.

  • Increase the training data size
  • Over sample the minority class data or under sample the majority class data in case of imbalance


♦ If the data model is high dimensional, we could also focus on the data model and reduce feature set as below

  • Reduce features by removing  features irrelevant to the class attribute
  • Remove features that redundant with respect to other features


♦ Finally, we have techniques that modify the basic classification algorithm as below. The goal is improve the correctness of prediction.

  • Try different values for neighborhood count (K). Lower values will cause large error due to variance. Higher values will cause large error due to bias.
  • Use inverse distance to neighbor weighting in neighbor’s vote
  • Use inverse of attribute class conditioned std deviation as weight in attribute distance calculation aka Class Dependent Mahalanobis distance
  • Use some of the other variations of KNN listed earlier

As I mentioned before, I deliberately introduced large class overlap to produce poor quality result. I am interested in finding out which of these techniques is effective in improving the prediction quality.

KD Tree

The algorithmic complexity of our brute force way to find  nearest neighbor is O(m x n), where m is the number of training samples and n is the number of test samples. In other words, for each test sample, we scan through all training samples. You might say that’s not very smart and there is a better way which is KD tree.

KD tree is  a binary search tree, extended for multi dimensional space. If we build a KD tree with training samples, the order of complexity becomes O(log(m) x n). However, Hadoop is not at all suitable for implementing the iterative logic to construct and navigate such trees.  Instead, we have relied on the brute force approach leveraging the horizontal scalability of Hadoop.

Wrapping Up

This is still work in progress, As I mentioned I will try some of the techniques listed above to improves the prediction accuracy. I have prepared the data set in a way that makes the classification  extremely challenging.

This tutorial document contains step by step instructions on how to run the example in this post.

For commercial support for this solution or other solutions in my github repositories, please talk to ThirdEye Data Science Services. Support is available for Hadoop or Spark deployment on cloud including installation, configuration and testing,


About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source project owner. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Machine Learning and Programming languages. I am fascinated by problems that don't have neat closed form solution.
This entry was posted in Big Data, Data Mining, Data Science, Hadoop and Map Reduce, Predictive Analytic and tagged , , , , , . Bookmark the permalink.

1 Response to Nearest Flunking Neighbors

  1. Pingback: Location and Time Based Service | Mawazo

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s