When we hear about trending, twitter trending immediately comes to mind. However, there are many other scenarios, where such analysis is applicable. Some example use cases are 1. Top 5 videos watched in last 2 hours 2. Top 10 news stories browsed in last 15 minutes 3. Top 10 products that users have interacted with in last 12 hours. 4. Count of some reading from a patient monitoring wearable exceeding some threshold more than 5 times in last 10 minutes. These problems are also known as heavy hitters problem.
These problems have three characteristics. First, we want the answer real time as soon as the data is available. Second, the answer does need to be exact. Since we are interested in the ranking of top n items in terms of the popularity, approximate answer is acceptable as long as the error is small and within some bound. Third, since the computation is done in memory, the memory requirement for the algorithms should be within reasonable bounds. Continue reading
Posted in Approximate Query, Big Data, Internet of Things, Real Time Processing, Storm
Tagged approximate query, heavy hitters, IoT, sketches, synopsis, trending, wearable
When I implemented feature similarity based matching engine in my open source Personalization and Recommendation Engine sifarish, it was for addressing the cold start problem. It allowed me to do content or feature based recommendation for users with limited engagement.
But later on, it evolved into a more general purpose search and matching engine work horse which was used to solve several other problems. In this post, I will show how to implement a location and time based search service by leveraging location and time window attributes that were added to the matching engine. Continue reading
Posted in Big Data, Hadoop and Map Reduce, Mobile, Real Time Processing, Recommendation Engine, Search, Spark, Storm
Tagged contextual search, lbs, location based service, mobile
Adoption of eLearning or Learning Management Systems (LMS) has increased significantly within academic and business world. In some cases, depending on the content and the eLearning system being used, high drop out rates have been reported as a serious problem. Here is an article form the Journal of Online Learning and Teaching on this topic. In this post I will use K Nearest Neighbor (KNN) classification technique implemented on Hadoop to predict partially through a course whether a student is likely to drop out eventually. KNN is a popular and reliable classification technique.
The input features consist of various signals based on the engagement level of the student with the eLearning system and the student’s performance so far. With the identification of students who are likely to drop out, the instructors can be more vigilant Continue reading
We all have the unfortunate experience of being pigeon holed by Personalization and Recommendation engines. When recommendation are based on our past behavior and there is very little opportunity to explore. But our past actions are not always good predictors for our future behavior. At any given moment, our behavior is highly influence by our mood, fleeting thoughts and contextual surrounding. There are are several way to improve the solution e.g., by introducing novelty and diversity in the recommendation list. Even adding some items randomly to the recommendation list has been found to be effective.
I have recently added novelty in my open source Recommendation and Personalization engine sifarish. In this post I will go over the solution as implemented in sifarish. Continue reading
We will be addressing two important issues faced by recommendation systems. First, how do you solve the cold start problem i.e., provide recommendations for new users with very limited behavior data available. Second, even if we have a recommendation list for new users, how do we prevent ourselves from presenting the same recommendation list repeatedly.
I will go over the details of how both of these problems have been solved in sifarish, my OSS recommendation engine. We calculate certain statistical parameters from user engagement historical signals and compute popularity for an item by combining those statistical parameters.
The second issue has to do with what is known as “above the fold issue”. When presented with a long list, typically users will scan only the top few items from the list. Continue reading
Making recommendations based on an user’s current behavior in a small time window is a powerful feature that has been added to sifarish recently. In this post I will go over the details of this feature. The real time feature has been added for social collaborative filtering based recommendations.
In our solution, although Storm is used for processing real time user engagement event stream to find recommended items, Hadoop does lot of heavy lifting by computing the item correlation matrix from historical user engagement event data. Redis has been used Continue reading
Nobody likes hospital readmission soon after discharge, whether it’s the patient or the insurance company. Predictive analytic techniques have been used to predict the likelihood of hospital readmission, using the various medical, personal and demographic input or feature attributes. However, some problems including the one we are discussing has a very large input feature set.
Before we plow ahead with building a predictive model, it’s important to pause and ask ourselves what features are really important, especially with a problem like this with a very large set of input features.
Machine learning algorithms generally work better if the dimensionality i.e. the number of feature attributes is lowered. One of the techniques for lowering the dimensionality is to select a subset of the original feature set, known as feature subset selection. Continue reading