As far as analytic of sensor generated data is concerned, in Internet of Things (IoT) and in a connected everything world, it’s mostly about real time analytic of time series data. In this post, I will be addressing an use case involving detecting outliers in sensor generated data with Spark Streaming. Outliers are data points that deviate significantly from most of the data. The implementation is part of my new open source project ruscello, implemented in Scala. The project is focused on real time analytic of sensor data with various IoT use cases in mind.
The specific use we will be addressing in this post Continue reading
One of the nagging problems in personalized recommendation systems is crowding of items with same attribute values in the recommendation list. For example, if you happen to like certain artiste, the songs by the same artiste will tend to flood recommended song list in a music recommendation system. The flooding happens because of the accuracy of the collaborative filtering algorithm. But that accuracy may not always generate the best outcome. The only criterion for success for a recommender is whether or not the the user has clicked on one or more items in the recommendation impression list.
As I mentioned in my earlier post, getting a recommendation list through Collaborative Filtering (CF) machine learning algorithms goes only part of the way towards providing an effective solution. Various post processing mechanisms are required to modify and improve the recommendation list Continue reading
The basic recommendation output consisting of the tuple (user, item, predicted rating), is easy to obtain from any Collaborative Filtering (CF) based Recommendation and Personalization engine, including sifarish. It’s been reported that there is a bigger return for the quality of recommendation results by applying various post processing logic on the basic CF output, instead of tweaking the core machine learning algorithm. We can think of these post processing units as plugins that apply to the basic recommendation output.
Many of those have been implemented in sifarish. They are not based on rigorous machine learning algorithms, but on simple intuition and heuristics. In this post I will go through one of them that was implemented recently. The actual rating of an item by an user is is considered in conjunction with the predicted rating to derive a net rating. Continue reading
Continuing along the theme of real time analytic with approximate algorithms, the focus this time is approximate cardinality estimation. To put the ideas in a context, the use case we will be working with is for counting number of unique users for a mobile app. Analyzing the trend of such unique counts, reveal valuable insights into the popularity of an app.
We will be using HyperLogLog algorithm which is available in my open source project hoidla as a Java API. The storm implementation of the use case Continue reading
Bounce rate for a page in a web site, is the proportion of sessions with only that page in the session. This post will show how to calculate bounce rate in real time with Storm using web log data. We will use sampling and window based algorithm called Biased Reservoir Sampling which is from my OSS project hoidla. It is a collection of stream processing algorithms. The storm implementation is part of my web analytic OSS project visitante.
We will see later how this analysis can feed into a reinforcement learning based approach for web site optimization. To be more specific, consider multiple competing home pages for an web site. Continue reading
When we hear about trending, twitter trending immediately comes to mind. However, there are many other scenarios, where such analysis is applicable. Some example use cases are 1. Top 5 videos watched in last 2 hours 2. Top 10 news stories browsed in last 15 minutes 3. Top 10 products that users have interacted with in last 12 hours. 4. Count of some reading from a patient monitoring wearable exceeding some threshold more than 5 times in last 10 minutes. These problems are also known as heavy hitters problem.
These problems have three characteristics. First, we want the answer real time as soon as the data is available. Second, the answer does need to be exact. Since we are interested in the ranking of top n items in terms of the popularity, approximate answer is acceptable as long as the error is small and within some bound. Third, since the computation is done in memory, the memory requirement for the algorithms should be within reasonable bounds. Continue reading
Posted in Approximate Query, Big Data, Internet of Things, Real Time Processing, Storm
Tagged approximate query, heavy hitters, IoT, sketches, synopsis, trending, wearable
When I implemented feature similarity based matching engine in my open source Personalization and Recommendation Engine sifarish, it was for addressing the cold start problem. It allowed me to do content or feature based recommendation for users with limited engagement.
But later on, it evolved into a more general purpose search and matching engine work horse which was used to solve several other problems. In this post, I will show how to implement a location and time based search service by leveraging location and time window attributes that were added to the matching engine. Continue reading
Posted in Big Data, Hadoop and Map Reduce, Mobile, Real Time Processing, Recommendation Engine, Search, Spark, Storm
Tagged contextual search, lbs, location based service, mobile