Counting Unique Mobile App Users with HyperLogLog

Continuing along the theme of real time analytic with approximate algorithms, the  focus this time is approximate cardinality estimation. To put the ideas in a context, the use case we will be working with is for counting number of unique users for a mobile app. Analyzing the trend of such unique counts, reveal valuable insights into the popularity of an app.

We will be using  HyperLogLog algorithm which is available in my open source project hoidla as a Java API. The storm implementation of the use case Continue reading

Posted in Approximate Query, Big Data, Mobile, Real Time Processing, Storm | Tagged , , | Leave a comment

Tracking Web Site Bounce Rate in Real Time

Bounce rate for a page  in a web site, is the  proportion of sessions with only that page in the session. This post will show how to calculate bounce rate in real time with Storm using web log data. We will use sampling and window based algorithm called Biased Reservoir Sampling which is from my OSS project hoidla. It is a collection of stream processing algorithms. The storm implementation is part of my web analytic OSS project visitante.

We will see later how this analysis can feed into  a reinforcement learning based approach for web site optimization. To be more specific, consider multiple competing home pages for an web site. Continue reading

Posted in Big Data, Optimization, Real Time Processing, Reinforcement Learning, Storm, Web Analytic | Tagged , | Leave a comment

Realtime Trending Analysis with Approximate Algorithms

When we hear about trending, twitter trending immediately comes to mind. However, there are many other scenarios, where such analysis is applicable. Some example  use cases  are 1. Top 5 videos watched in last 2 hours   2. Top 10 news stories browsed in last 15 minutes 3. Top 10 products that users have interacted with in last 12 hours. 4. Count of some reading from a patient monitoring wearable exceeding some threshold more than 5 times in last 10 minutes.  These problems are also known as heavy hitters problem.

These problems have three characteristics. First, we want the answer real time as soon as the data is available. Second, the answer does need to be exact. Since we are interested in the ranking of top n items in terms of the popularity, approximate answer is acceptable as long as the error is small and within some bound. Third, since the computation is done in memory, the memory requirement for the algorithms should be within reasonable bounds. Continue reading

Posted in Approximate Query, Big Data, Internet of Things, Real Time Processing, Storm | Tagged , , , , , , | Leave a comment

Location and Time Based Service

When I implemented feature similarity based matching engine in my open source Personalization and Recommendation Engine sifarish, it was for addressing the cold start problem. It allowed me to do content or feature based recommendation for users with limited engagement.

But later on, it evolved into a more general purpose search and matching engine work horse which was used to solve several other problems. In this post, I will show how to implement a location and time based search service by leveraging location and time window attributes that were added to the matching engine. Continue reading

Posted in Big Data, Hadoop and Map Reduce, Mobile, Real Time Processing, Recommendation Engine, Search, Spark, Storm | Tagged , , , | Leave a comment

Nearest Flunking Neighbors

Adoption of eLearning or Learning Management Systems (LMS) has increased significantly within academic and business world. In some cases, depending on the content  and the eLearning system being used, high drop out rates have been reported as a serious problem. Here is an article form the Journal of Online Learning and Teaching on this topic. In this post I will use K Nearest Neighbor (KNN) classification technique implemented on Hadoop  to predict partially  through a course whether a student is likely to drop out eventually. KNN is a popular and reliable classification technique.

The input features consist of various signals based on the  engagement level of the student  with the eLearning system and the student’s performance so far. With the identification of students who are likely to drop  out, the instructors can be more vigilant Continue reading

Posted in Big Data, Data Mining, Hadoop and Map Reduce, Predictive Analytic | Tagged , , , , , | 1 Comment

Novelty in Personalization

We all have the unfortunate  experience of being pigeon holed by Personalization and Recommendation engines. When recommendation are based on our past behavior and there is very little  opportunity to explore. But our past actions are not always good predictors for our future behavior.  At any given moment, our behavior is highly influence by our mood, fleeting thoughts and contextual surrounding. There are are several way to improve the solution e.g., by introducing novelty and diversity in the recommendation list. Even adding some items randomly to the recommendation list has been found to be effective.

I have recently added novelty in my open source Recommendation and Personalization engine sifarish.  In this post I will go over the solution as implemented in sifarish. Continue reading

Posted in Big Data, Data Mining, Hadoop and Map Reduce, Personalization, Recommendation Engine | Tagged , | Leave a comment

Popularity Shaken

We will be addressing two important issues faced by recommendation systems. First, how do you solve the cold start problem i.e., provide recommendations for new users with very limited behavior data available. Second, even if we have a recommendation list for new users, how do we prevent ourselves from presenting the same recommendation list repeatedly.

I will go over the details of how both of these problems have been solved in sifarish, my OSS recommendation engine. We calculate certain statistical parameters from user engagement historical signals and compute popularity for an item by combining those statistical parameters.

Why Dithering

The second issue has to do with what is known as “above the fold issue”. When presented with a long list, typically users will scan only the top few items from the list. Continue reading

Posted in Big Data, Hadoop and Map Reduce, Recommendation Engine, Storm | Tagged , , | 2 Comments