Association mining solves many real life problems e.g., frequent items bought together, songs frequently listened together in one session etc. Apriori is a popular algorithm for mining frequent items sets. In this post, we will go over a Hadoop based implementation of Apriori available in my open source project avenir. Frequent items mining results can also be used for collaborative filtering based recommendation and rule mining.
We will use retail sales data with a twist. Our interest will be to mine products frequently bought together from sales data during some special events e.g. natural disaster or some holiday season.
Quick Note on Apriori
Apriori is a bottom up iterative algorithm where we find frequent 1 item sets and then iteratively build on it to find frequent 2 item sets and so forth. Essentially, we generate candidate item set of length n from frequent items sets of length (n-1).
We need to get some terminology taken care of. Support is the fraction of transactions in which an item set appears. Only when the support is above some user defined threshold, the item set is considered to be frequent.
The algorithm can be terminated in two ways, we can either can set a maximum frequent item set size or continue until we don’t find any frequent item set meeting the threshold support criteria.
Apripori algorithm is known to have combinatorial explosion for higher order frequent items. It may generate vast number of item sets which are doomed to fail the support threshold test. We are using an improved Apriori algorithm which generates items sets which have a much better chance of surviving the support threshold test.
The scenario we are going analyse involves a storm or other predictable natural disaster and we are interested in finding items frequently bought together during such times. The retailer could the mine historical data during such disasters in past. The result could be used to replenish stock and to alter the store layout during some predicted and upcoming natural event.
One item frequent item is useful for gauging demand and multiple item frequent item set for planning store layout. An etailer could use such knowledge to pre pick items in warehouse and store them in a holding area to facilitate faster delivery.
The sales transaction data consists of the following fields. There may be other fields before the item list, which will be ignored.
- Transaction ID
- Time stamp
- List of item IDs
Here is some sample data consisting of the fields listed above. the list of itemID starts from the 3rd field.
455L42I5G72A,1446864190,7174K247ZK,042H34WPFA,BGPME786Z7,UZPMUFI85K,389T1OUZHO,Y68360O5Y5,4323PML3V0,9Y0ZPLJ49J,B2UD3DFYX0,6N20FG3GFZ,O0QQ3Z4W2K V5Q4ZVG824BZ,1446864343,GDBNUEK017,J9MTMS866P,E06H61A603,B13PKLB66N,1U0WVD8Z44,K5S71X6NFX,TDY9ZAV1NC BCLCXJ52U6NP,1446864360,00GK7WX87I,T7C88LWU0Q,L0JPV5V3J4,4ML81L5B7W,G02RXOMRV8 JKY3EVLIP4IS,1446864559,GDBNUEK017,J9MTMS866P,E97V9QHL6S 7W820F921M77,1446864584,P6B0WW1AI0,LYK3359MVA,X631ZEHBG4,J9K46N0XR7,O0NCVO3D8T,F33AFKZR1R,C62SLG5FF3,2B5J0CHQ46,3VT3LTBFN1
Temporal Filtering Map Reduce
The first step is processing is filtering the data based on the time range of interest. The Map Reduce class TemporalFilter which consists only of a mapper does the job. The time range of interest is specified through configuration parameter.
Various other timing related filtering parameters can be specified e.g. weekend days, month of year etc. There are various other choices for time range and seasonal cycles. The output of this Map Reduce will be a subset of the original data, filtered according to the specified date time range parameter.
Frequent Item Set Map Reduce
The Apriori based implementation is in the Map Reduce class FrequentItemsApriori. The mapper generates candidate item sets and associated transactions. The item set is the mapper output key and the set of transactions is the value. Emitting the transaction set as the mapper value enables us to perform the optimization on Apriori as mentioned earlier.
The reducer applies minimum support threshold and retains only those items sets with support above the threshold. The Map Reduce job is run repeatedly each time generating frequent items of size 1 more than the previous item set as follows
- Generate 1 item frequent items sets and the associated transactions
- Generate n item item candidate item sets based on (n-1) item frequent item sets.
- Generate n item item frequent item sets from n item item candidate item sets
Here are the details of generating n item candidate items sets based on (n-1) item frequent item sets, which the step 2 above.
- For each transaction
- For each (n -1) item frequent items set
- For each item in the transaction under consideration
- If the item does not belong to the frequent item set and the current transaction is one of the transactions associated with the frequent item set, then generate a n item candidate items set by adding the current item to the (n-1) frequent item set.
The intuition behind the candidate generation is that only the transactions that have (n-1) item frequent item sets are capable of generating potential n item frequent item sets. This optimization prevents unnecessary candidate item set generation.
Here is the output for 1 item, 2 items and 3 items frequent item set output. We stopped our iteration with items sets of 3 items. The last field is the support for the items set.
042H34WPFA,0.189 7174K247ZK,0.186 BGPME786Z7,0.182 LYK3359MVA,0.205 P6B0WW1AI0,0.202 X631ZEHBG4,0.199 042H34WPFA,7174K247ZK,0.172 042H34WPFA,BGPME786Z7,0.170 7174K247ZK,BGPME786Z7,0.171 LYK3359MVA,P6B0WW1AI0,0.191 LYK3359MVA,X631ZEHBG4,0.187 P6B0WW1AI0,X631ZEHBG4,0.187 042H34WPFA,7174K247ZK,BGPME786Z7,0.163 LYK3359MVA,P6B0WW1AI0,X631ZEHBG4,0.181
As expected, we get fewer frequent items sets as the the size of the item set goes up. The support level also goes down with increasing item set size.
Frequent items results could be leveraged to mine rules from data. A rule has an antecedent and consequent, not necessarily implying any causal relationship between them.
To mine rules, a frequent item set (z) in split into two non empty subsets x and y. Depending on the size of x there are many possible ways of splitting into sub sets and all are considered for potential rules.
The candidate rule x -> y becomes a rule if x -> y has high confidence and it meets the minimum confidence threshold requirement. Confidence is defined as below.
The value of confidence is always less than 1. It’s value will be high when x and y co occur most.
Collaborative Filtering based Recommendation
Frequent item sets can be used in item based collaborative filtering. Inter item correlation is a critical input for the collaborative filtering algorithm.Generally Pearson or Cosine correlation is used.
Support values for 2 item frequent sets could be considered as a measure of the correlation between the 2 items and serves as an alternative to Pearson or Cosine correlation.
We have gone through a Hadoop Map Reduce implementation of Apriori algorithm for frequent item set. A tutorial document for running the example in this post is available.