Association Mining with Improved Apriori Algorithm

Association mining solves many real life problems e.g., frequent items bought together, songs frequently listened together in one session etc. Apriori is a popular algorithm for mining frequent items sets. In this post, we will go over a Hadoop based implementation of Apriori available in my open source project avenir. Frequent items mining results can also be used for collaborative filtering based recommendation and rule mining.

We will use retail sales data with a twist. Our interest will be to mine products frequently bought together from sales data during some special events e.g. natural disaster or some holiday season.

Quick Note on Apriori

Apriori is a bottom up iterative algorithm where we find frequent 1 item sets and then iteratively build on it to find frequent 2 item sets and so forth. Essentially, we generate candidate item set of length n from frequent items sets of length (n-1).

We need to get some terminology taken care of. Support is the fraction of transactions in which an item set appears. Only when the support is above some user defined threshold, the item set is considered to be frequent.

sup(x) = Σ t(x) / Σ t
t(x) = transaction that contains item set x
t = transaction

The algorithm can be terminated in two ways, we can either can set a maximum frequent item set size or continue until we don’t find any frequent item set meeting the threshold support criteria.

Apripori algorithm is known to have combinatorial explosion for higher order frequent items. It may generate vast number of item sets which are doomed to fail the support threshold test. We are using an improved Apriori algorithm which generates items sets which have a much better chance of surviving the support threshold test.

Retail Sales

The scenario we are going analyse involves a storm or other predictable natural disaster and we are interested in finding items frequently bought together during such times. The retailer could the mine historical data during such disasters in past. The result could be used to replenish stock and to alter the store layout during some predicted and upcoming natural event.

One item frequent item is useful for gauging demand and multiple item frequent item set for planning store layout. An etailer could use such knowledge to pre pick items in warehouse and store them in a holding area to facilitate faster delivery.

The sales transaction data consists of the following fields. There may be other fields before the item list, which will be ignored.

Transaction ID
Time stamp
List of item IDs

Here is some sample data consisting of the fields listed above. the list of itemID starts from the 3rd field.

455L42I5G72A,1446864190,7174K247ZK,042H34WPFA,BGPME786Z7,UZPMUFI85K,389T1OUZHO,Y68360O5Y5,4323PML3V0,9Y0ZPLJ49J,B2UD3DFYX0,6N20FG3GFZ,O0QQ3Z4W2K
V5Q4ZVG824BZ,1446864343,GDBNUEK017,J9MTMS866P,E06H61A603,B13PKLB66N,1U0WVD8Z44,K5S71X6NFX,TDY9ZAV1NC
BCLCXJ52U6NP,1446864360,00GK7WX87I,T7C88LWU0Q,L0JPV5V3J4,4ML81L5B7W,G02RXOMRV8
JKY3EVLIP4IS,1446864559,GDBNUEK017,J9MTMS866P,E97V9QHL6S
7W820F921M77,1446864584,P6B0WW1AI0,LYK3359MVA,X631ZEHBG4,J9K46N0XR7,O0NCVO3D8T,F33AFKZR1R,C62SLG5FF3,2B5J0CHQ46,3VT3LTBFN1

Temporal Filtering Map Reduce

The first step is processing is filtering the data based on the time range of interest. The Map Reduce class TemporalFilter which consists only of a mapper does the job. The time range of interest is specified through configuration parameter.

Various other timing related filtering parameters can be specified e.g. weekend days, month of year etc. There are various other choices for time range and seasonal cycles. The output of this Map Reduce will be a subset of the original data, filtered according to the specified date time range parameter.

Frequent Item Set Map Reduce

The Apriori based implementation is in the Map Reduce class FrequentItemsApriori. The mapper generates candidate item sets and associated transactions. The item set is the mapper output key and the set of transactions is the value. Emitting the transaction set as the mapper value enables us to perform the optimization on Apriori as mentioned earlier.

The reducer applies minimum support threshold and retains only those items sets with support above the threshold. The Map Reduce job is run repeatedly each time generating frequent items of size 1 more than the previous item set as follows

Generate 1 item frequent items sets and the associated transactions
Generate n item item candidate item sets based on (n-1) item frequent item sets.
Generate n item item frequent item sets from n item item candidate item sets

Here are the details of generating n item candidate items sets based on (n-1) item frequent item sets, which the step 2 above.

For each transaction
For each (n -1) item frequent items set
For each item in the transaction under consideration
If the item does not belong to the frequent item set and the current transaction is one of the transactions associated with the frequent item set, then generate a n item candidate items set by adding the current item to the (n-1) frequent item set.

The intuition behind the candidate generation is that only the transactions that have (n-1) item frequent item sets are capable of generating potential n item frequent item sets. This optimization prevents unnecessary candidate item set generation.

Here is the output for 1 item, 2 items and 3 items frequent item set output. We stopped our iteration with items sets of 3 items. The last field is the support for the items set.

042H34WPFA,0.189
7174K247ZK,0.186
BGPME786Z7,0.182
LYK3359MVA,0.205
P6B0WW1AI0,0.202
X631ZEHBG4,0.199


042H34WPFA,7174K247ZK,0.172
042H34WPFA,BGPME786Z7,0.170
7174K247ZK,BGPME786Z7,0.171
LYK3359MVA,P6B0WW1AI0,0.191
LYK3359MVA,X631ZEHBG4,0.187
P6B0WW1AI0,X631ZEHBG4,0.187

042H34WPFA,7174K247ZK,BGPME786Z7,0.163
LYK3359MVA,P6B0WW1AI0,X631ZEHBG4,0.181

As expected, we get fewer frequent items sets as the the size of the item set goes up. The support level also goes down with increasing item set size.

Rule Mining

Frequent items results could be leveraged to mine rules from data. A rule has an antecedent and consequent, not necessarily implying any causal relationship between them.

To mine rules, a frequent item set (z) in split into two non empty subsets x and y. Depending on the size of x there are many possible ways of splitting into sub sets and all are considered for potential rules.

The candidate rule x -> y becomes a rule if x -> y has high confidence and it meets the minimum confidence threshold requirement. Confidence is defined as below.

conf(x -> y) = Σ t(x,y) / Σ t(x)
t(x,y) = transaction that contains item set x and y
t(x) = transaction that contains item set x

The value of confidence is always less than 1. It’s value will be high when x and y co occur most.

Collaborative Filtering based Recommendation

Frequent item sets can be used in item based collaborative filtering. Inter item correlation is a critical input for the collaborative filtering algorithm.Generally Pearson or Cosine correlation is used.

Support values for 2 item frequent sets could be considered as a measure of the correlation between the 2 items and serves as an alternative to Pearson or Cosine correlation.

Summing Up

We have gone through a Hadoop Map Reduce implementation of Apriori algorithm for frequent item set. A tutorial document for running the example in this post is available.

About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source project owner. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Machine Learning and Programming languages. I am fascinated by problems that don't have neat closed form solution.

View all posts by Pranab →