Association Mining with Improved Apriori Algorithm


Association mining solves many real life  problems e.g., frequent items bought together, songs frequently listened together in one session etc. Apriori is a popular algorithm for mining frequent items sets. In this post, we will go over a Hadoop based implementation of Apriori available in my open source project avenir.  Frequent items mining results can also be used for collaborative filtering based recommendation and rule mining.

We will use retail sales data with a twist. Our interest will be to mine products frequently bought together from sales data during some special events e.g. natural disaster or some holiday season.

Quick Note on Apriori

Apriori is a bottom up iterative algorithm where we find frequent 1 item sets and then iteratively build on it to find frequent 2 item sets and so forth. Essentially, we generate candidate item set of length n from frequent items sets of length (n-1).

We need to get some terminology taken care of. Support is the fraction of transactions in which an item set appears. Only when the support is above some user defined threshold, the item set is considered to be frequent.

sup(x) = Σ t(x) / Σ t
t(x) = transaction that contains item set x
t = transaction

The algorithm can be terminated in two ways, we can either can set a maximum frequent item set size or continue until we don’t find any frequent item set meeting the threshold support criteria.

Apripori algorithm is known to have combinatorial explosion for higher order frequent items. It may generate vast number of item sets which are doomed to fail the support threshold test. We are using an improved Apriori algorithm which generates items sets which have a much better chance of surviving the support threshold test.

Retail Sales

The scenario we are going analyse involves a storm or other predictable natural disaster and we are interested in finding items frequently bought together during such times. The retailer could the mine historical data during such disasters in past. The result could be used to replenish stock and to alter the store layout during some predicted and upcoming  natural event.

One item frequent item is useful for gauging demand and multiple item frequent item set for planning store layout. An etailer  could use such knowledge to pre pick items in warehouse and store them in a holding area to facilitate faster delivery.

The sales transaction data consists of the following fields. There may be other fields before the item list, which will be ignored.

  1. Transaction ID
  2. Time stamp
  3. List of item  IDs

Here is some sample data consisting of the fields listed above. the list of itemID starts from the 3rd field.

455L42I5G72A,1446864190,7174K247ZK,042H34WPFA,BGPME786Z7,UZPMUFI85K,389T1OUZHO,Y68360O5Y5,4323PML3V0,9Y0ZPLJ49J,B2UD3DFYX0,6N20FG3GFZ,O0QQ3Z4W2K
V5Q4ZVG824BZ,1446864343,GDBNUEK017,J9MTMS866P,E06H61A603,B13PKLB66N,1U0WVD8Z44,K5S71X6NFX,TDY9ZAV1NC
BCLCXJ52U6NP,1446864360,00GK7WX87I,T7C88LWU0Q,L0JPV5V3J4,4ML81L5B7W,G02RXOMRV8
JKY3EVLIP4IS,1446864559,GDBNUEK017,J9MTMS866P,E97V9QHL6S
7W820F921M77,1446864584,P6B0WW1AI0,LYK3359MVA,X631ZEHBG4,J9K46N0XR7,O0NCVO3D8T,F33AFKZR1R,C62SLG5FF3,2B5J0CHQ46,3VT3LTBFN1

 

Temporal Filtering Map Reduce

The first step is processing is filtering the data based on the time range of interest.  The Map Reduce class TemporalFilter which consists only of a mapper does the job. The time range of interest is specified through configuration parameter.

Various other timing related filtering parameters can be specified e.g. weekend days, month of year etc. There are  various other choices for time range and seasonal cycles. The output of this Map Reduce will be a subset of the original data, filtered according to the specified date time range parameter.

Frequent Item Set Map Reduce

The Apriori based implementation is in the Map Reduce class FrequentItemsApriori. The mapper generates candidate item sets and associated transactions. The item set is the mapper output key and the set of transactions is the value. Emitting the transaction set as the mapper value enables us to perform the optimization on Apriori as mentioned earlier.

The reducer applies minimum support threshold and retains only those  items sets with support above the threshold. The Map Reduce job is run repeatedly each time generating frequent items of size 1 more than the previous item set as follows

  1. Generate 1 item frequent items sets and the associated transactions
  2. Generate n item item candidate item sets based on (n-1) item frequent item sets.
  3. Generate n item item frequent item sets from n item item candidate item sets

 

Here are the details of generating n item candidate items sets based on (n-1) item frequent item sets, which the step 2 above.

  1. For each transaction
  2. For each (n -1) item frequent items set
  3. For each item in the transaction under consideration
  4. If the item does not belong to the frequent item set and the current transaction is one of the transactions associated with the frequent item set, then generate a n item candidate items set by adding the current item to the (n-1) frequent item set.

 

The intuition behind the candidate generation is that only the transactions that have (n-1) item frequent item sets are capable of generating potential n item frequent item sets. This optimization prevents unnecessary candidate item set  generation.

Here is the output for 1 item, 2 items and 3 items frequent item set output. We stopped our iteration with items sets of 3 items. The last field is the support for the items set.

042H34WPFA,0.189
7174K247ZK,0.186
BGPME786Z7,0.182
LYK3359MVA,0.205
P6B0WW1AI0,0.202
X631ZEHBG4,0.199


042H34WPFA,7174K247ZK,0.172
042H34WPFA,BGPME786Z7,0.170
7174K247ZK,BGPME786Z7,0.171
LYK3359MVA,P6B0WW1AI0,0.191
LYK3359MVA,X631ZEHBG4,0.187
P6B0WW1AI0,X631ZEHBG4,0.187

042H34WPFA,7174K247ZK,BGPME786Z7,0.163
LYK3359MVA,P6B0WW1AI0,X631ZEHBG4,0.181

As expected, we get fewer frequent items sets as the the size of the item set goes up. The support level also goes down with increasing item set size.

Rule Mining

Frequent items results could be leveraged to mine rules from data. A rule has an antecedent and consequent, not necessarily implying any causal relationship between them.

To mine rules, a frequent item set (z)  in split into two non empty subsets x and y. Depending on the size of x there are many possible ways of splitting into sub sets and all are considered for potential rules.

The candidate rule x -> y becomes a rule if x -> y has high confidence and it meets the minimum confidence threshold requirement. Confidence is defined as below.

conf(x -> y) = Σ t(x,y) / Σ t(x)
t(x,y) = transaction that contains item set x and y
t(x) = transaction that contains item set x

The value of confidence is always less than 1. It’s value will be high when x and y co occur most.

Collaborative Filtering based Recommendation

Frequent item sets can be used in item based collaborative filtering. Inter item correlation is a critical input for the collaborative filtering algorithm.Generally Pearson or Cosine correlation is used.

Support values for 2 item frequent sets could be considered as a measure of the correlation between the 2 items and serves as an alternative to Pearson or Cosine correlation.

Summing Up

We have gone through a Hadoop Map Reduce implementation of Apriori algorithm for frequent item set. A tutorial document for running the example in this post is available.

 

 

 

Advertisements

About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source project owner. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Machine Learning and Programming languages. I am fascinated by problems that don't have neat closed form solution.
This entry was posted in Association Mining, Big Data, Data Mining, Data Science, Hadoop and Map Reduce, Marketing Analytic, Rule Mining and tagged , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s