The basic input for sifarish or any other collaborative filtering based recommendation engine is user rating of items. However explicit rating by users is not always available. Even when it’s available, it’s been known that generally only users with extreme views tend to explicitly rate items. So the rating data even when available may be biased and not very reliable.
However, user click stream data is always available. The type of engagement an user has with an item (e.g browsing product description, placing an item in shopping cart etc.) reflects the level of interest an user has on the item. Based on this intuition, it’s possible to map engagement events to an implicit rating.
Application of this kind of heuristic is viable option, when there is paucity of explicit rating data or when such data is deemed to be not very reliable. This preprocessing map reduce job to estimate implicit rating is provided by sifarish. In this post, we will go over the details of this map reduce job with an example.
Mapping User Engagement Events
In our example, we consider 5 different event types with decreasing user interest level as below.
Event Type | Description |
1 | Purchased item |
2 | Joined checkout |
3 | Placed item in shopping cart |
4 | Placed item in wish list |
5 | Browsed item from search result |
6 | Browsed item from recommendation list |
7 | Browsed item |
-1 | Returned item |
-2 | Left checkout |
-3 | Removed item from shopping cart |
-4 | Removed item from wish list |
The user rating is a function of the event type and the number of occurrences of such event type. If there are multiple event types associated with an item, The rating associated with each event type is calculated and the highest rating among them is selected.
For a given event type, rating increases asymptotically with increasing number of occurrences up to a threshold rating value.
Estimating Implicit Rating
The map reduce implementation for implicit rating is here. Some sample input data is as follows, which can easily be generated by pre-processing raw click stream data.
0I3GQ6SETOIR,1595e19b-01c1-48a6-835c-e7d55902417e,929BBU0001,6,1397403852 UGU2IS4VW6SC,6238c407-377e-4b02-b0ea-be90dbd5b199,YVY412FGW4,6,1397403868 HW0WP38NWV2V,b73c6b09-390c-4494-b893-b6e265332ade,SQAG41CKO1,7,1397403886 TKQFZM0WCM84,5dcd1252-071e-4299-8d57-f6b0a61fd795,93R93SYKQ5,4,1397403903
The fields are 1. user ID 2. sessionID 3. item ID 4. event type 5. time stamp. Time stamp is included in the input so that time stamped rating data can be generated. One of the features of sifarish is time sensitive recommendation, which requires time stamped rating data.
The mapper output output key is user ID and item ID. It is secondary sorted by event type. On the reducer side, only the event data corresponding to the most engaging event is processed and the rest is ignored.
The event type to rating mapping meta data is provided through a JSON as below. The event types are as described earlier.
{ "eventScores" : [ { "eventType" : 1, "description" : "purchased", "scores" : [100] }, { "eventType" : 2, "description" : "joined checkout", "scores" : [85] }, { "eventType" : 3, "description" : "placed in shopping cart", "scores" : [60] }, { "eventType" : 4, "description" : "placed in wishlist", "scores" : [40] }, { "eventType" : 5, "description" : "browsed from search result", "scores" : [25,32,38,43,47] }, { "eventType" : 6, "description" : "browsed for recommendation list", "scores" : [15,21,26,30,33] }, { "eventType" : 7, "description" : "browsed", "scores" : [5,12,17,21,24] }, { "eventType" : -1, "description" : "returned" }, { "eventType" : -2, "description" : "left checkout" }, { "eventType" : -3, "description" : "removed from shopping cart" }, { "eventType" : -4, "description" : "removed from wish list" } ] }
The scores field provides the mapping between event occurrence count and rating. As the count increases, the rating reaches a limiting value.
Here is some sample output. The fields are 1. user ID 2. item ID 3.rating 4. most engaging event type 5. event count. The last two fields are are optional output, controlled through a configuration parameter.
000R1I1QK4R62,512YL4KC6W,5,1397434130,7,1 000R1I1QK4R62,7A0JLOLVQ2,25,1397803873,5,1 000R1I1QK4R62,7DFGDFU026,100,1397864143,1,1 000R1I1QK4R62,FOD39Y2FTT,15,1397436814,6,1 000R1I1QK4R62,GZF5UQ75N9,25,1397647743,5,1 000R1I1QK4R62,J4LWGR23OI,40,1397645120,4,1 000R1I1QK4R62,QBALZ21R1E,40,1397858317,4,1 000R1I1QK4R62,SC4604N2XQ,5,1397445978,7,1 000W425HZ6JL4,4POOZEJ4HN,60,1397854330,3,1
Negative Events
Some events have negative values indicating negative actions on the part of the user e.g., removing an item from the shopping cart. While processing the the event sequence in the reducer for an user and an item, all the negative events are identified.
For each such negative event, a corresponding positive event is removed from the event sequence, before calculating rating.
Wrapping Up
We have gone through a simple heuristic based process to convert click stream data to implicit rating. Beyond recommendation, the implicit rating can potentially be used for other purposes. One example is targeted personalized marketing.
To run the example, please refer to the Implicit Rating Predictor section of this tutorial document.
For commercial support for this solution or other solutions in my github repositories, please talk to ThirdEye Data Science Services. Support is available for Hadoop or Spark deployment on cloud including installation, configuration and testing,
Pingback: Making Recommendations in Real Time | Mawazo
Pingback: From Item Correlation to Rating Prediction | Mawazo
Pingback: Popularity Shaken | Mawazo
Pingback: Novelty in Personalization | Mawazo
Pingback: Realtime Trending Analysis with Approximate Algorithms | Mawazo
Pingback: Positive Feedback Driven Recommendation Rank Reordering | Mawazo
Pingback: Customer Service and Recommendation System | Mawazo
Hi, Pranab my name is Archie and I am working in a German company. I am tasked with building a recommendation engine that takes as input an item’s position in search and its click trough rate to deliver a score of the item that I can later use for sorting in search. I think I can map an item position to user engagement events and then use sifarish to get a rating. Do you think this is the best way to go around my task?
Archie, you can use sifarish. You could model your events as follows. Let’s say your search results are broken into page, each page containing 10 items. As an example, then the events in increasing order of affinity could be 1)item in 3rd page or later 2)item in 2nd page 3) item in 1st page 4)item clicked irrespective of location in search result. For any item, events later in my list will supersede earlier events. That’s the logic of the map reduce.
For each such event you could define scores with number of occurrences, as shown in the sample JSON. Them you run the implicit rating generator map reduce.
I am looking for some source code/documentation for RedisSpout.withTupleFields but not finding any. Could you point me pls to the API ?
Shob
Check my project sifarish in github
Hi..
I see that you have manually come up with weights.
i.e { “eventType” : 1,”description” : “purchased”,”scores” : [100]},
{“eventType” : 2,”description” : “joined checkout”,”scores” : [85]},
Now { “eventType” : 1,”description” : “purchased”}, can have a score of 95 or 90.
How can we select optimal weights for each event type?
Vij,
The weights are up to you, based on heuristics. Weights increase with the affinity of the event to conversion. For example, “joined checkout” will have higher weight that “browsed product”.
Hi Pranab
I totally agree, “joined checkout” will have higher weight that “browsed product”.
but higher by how much. What I am thinking is, if we can derive these weights using cross validation or gridsearch. pointers to any such resources will be useful.
Vij
As I said these are input configuration parameters. You set it to whatever you like, subject to the guidelines I provided.
These are not machine learning parameters for a prediction problem. Cross validation. grid search don’t make any sense here.
Pingback: Measuring Campaign Effectiveness for an Online Service on Spark | Mawazo
Hi, Thanks for writing useful blog
I tried to run ./brec.sh to generate event followed this guide “./brec.sh genHistEvent ” but as I run “./brec.sh genHistEvent 1000 100 10”, I got “./brec.sh: line 58: $5: ambiguous redirect”. I would greatly appreciate on your help.
@Natsu You need to provide the name of the file where you want to save the output as the last argument. So there will be an additional command line argument. The tutorial is incorrect. I will correct it and check in
Hi, I also try this as i read through script and guess the $5 but another error came out.
Thanks for your attention to my humble question, also sorry not showing gratitute before asking further question.
[root@quickstart resource]# ./brec.sh genHistEvent 1000 3000 10 10
generating historical event data
./engage.rb:139: undefined method `uuid’ for SecureRandom:Module (NoMethodError)
from ./engage.rb:133:in `upto’
from ./engage.rb:133
[root@quickstart resource]# vim brec.sh
[1]+ Stopped vim brec.sh
[root@quickstart resource]# ./brec.sh genHistEvent 1000 3000 10 somefile
generating historical event data
./engage.rb:139: undefined method `uuid’ for SecureRandom:Module (NoMethodError)
from ./engage.rb:133:in `upto’
from ./engage.rb:133
[root@quickstart resource]# fg
vim brec.sh
[1]+ Stopped vim brec.sh
[root@quickstart resource]# ./brec.sh genHistEvent 1000 3000 10 9
generating historical event data
./engage.rb:139: undefined method `uuid’ for SecureRandom:Module (NoMethodError)
from ./engage.rb:133:in `upto’
from ./engage.rb:133
May have something to do with your ruby version. Responded in github. Please use github for this issue.
What ?