The topic for this post, is of interest to any online retailer. Shopping cart abandonment is dreaded by online stores. It’s more common in online stores than brick and mortar stores. In this post I will be discussing Hadoop based checkout abandonment analysis based on my open source web analytic project visitante hosted on github. As we will see, some other important metrics are also generated as part of this analysis process.
An user may abandon his or her shopping cart for various reasons including feeling uncomfortable about the price tag, compare price in other sites, simply being lazy and deciding to come back later to complete the transaction etc.
The input to the analysis is W3C compliant web server log files. Details can be found in my earlier web analytic post. The analysis is performed with two MR jobs. SessionSummarizer and UserSessionSummary. The basic idea is to collect all page visits by session. The page visits for a session are also sorted by visit time.
We also define page sequences for checkout process through a configuration parameter as below. The page URL in the sequence could also be specified as regular expression
For example you may have introduced a marketing campaign “Deal of the day” and you may want to track follow through from the link to the eventual purchase.
We essentially compare the checkout page sequence with the page sequence for a session. The following 3 scenarios are possible with respect to the match.
|Checkout flow not entered||There is no intersection between checkout sequence and page visit sequence. No interaction with the checkout process|
|Checkout flow entered but not completed||Page visit sequence contains checkout sequence but only partially. Considered to be an abandoned checkout|
|Checkout flow entered and completed||Checkout sequence is fully contained within the page visit sequence. Considered to be a successful checkout|
SessionSummarizer Map Reduce
This map reduce takes log data as input and groups data by session ID. The mapper output key consists of session ID and page visit time. It also computes the checkout flow status. It’s output consists of the following
Here is some sample output.
2BTZOCY42DT15ANB,F9YCR287DXG8,3,1336303827000,229,/addToCart/NJYQ7UCV,0,http://www.google.com 2BV28U2RS5C9RENW,6BJVC65DUTV3,2,1337915930000,194,/product/ZND9R09Z,0,http://www.myhealth.com 2BV5RUHP3DAOCGL5,TQW6X3BDIYM1,5,1337808408000,296,/product/NBGFWW81,0,http://www.google.com 2BVPDAP53JZAC8MB,CVEDAZMAO78R,2,1337915166000,80,/product/4HOK5MOI,0,http://www.myhealth.com 2BW2ZITX4SGB7A0S,3331234N1ZKC,4,1338037974000,310,/addToCart/LFN69GLC,1,http://www.myhealth.com 2BWY4042Z2EU011J,3K22BXB2V00V,2,1337291518000,84,/product/SR3ON54E,0,http://www.google.com 2BZI4VZM4XT43K7V,0YFI0U10Z4LM,2,1336174755000,192,/product/DXM3ZF0N,0,http://www.google.com 2C11GC023MC12JFA,MP22NW0DLW34,7,1336499993000,524,/addToCart/GDCSVGP4,0,http://www.google.com 2C290I13A1USIR2C,84E0M6FH9VQF,6,1336486745000,399,/signup,1,http://www.myhealth.com 2C2FBN3AK6EO3ZJP,0TQMHGX5AZOS,7,1336086821000,493,/product/CG7T015N,2,http://www.facebook.com 2C301FAZ29FQWQJH,7BGYHD344SQN,5,1337224728000,446,/product/TKS93ALL,0,http://www.myhealth.com 2C3M0XHOCK3S1AXH,Z25ZQE2NZM3W,5,1336015288000,197,/product/V8SS47S8,0,http://www.facebook.com
For the flow status (7th field in the output), the values are as follows. In the sample output above, there is only one case of checkout completion.
Once we have this data, we could try to correlate it other parameters (e.g., referrer URL, time spent in the session, products viewed) to get more insight into the checkout process.
Although our focus is on checkout abandonment, some other important and insightful metrics can be derived. Here are some examples
- Bounce rate, i.e., number of sessions that end after the landing page
- Average session duration
- Site penetration i.e., average number of pages visited per session
- User visit time distribution in a 24 hour period
UserSessionSummary Map Reduce
This map reduce takes the output of the previous map reduce and generates user level data by aggregating all sessions for an user. The data is grouped by user ID. The mapper key consists of userID and session start time. It also provides information about conversion of a user. The output consists of the following.
The conversion status is 1 if the user completed the checkout flow and made a purchase in one of the sessions. Here is some sample output
1CKWW5KM8QB5,http://www.facebook.com,0,4,239,0,5,320297000 1CNO4M4WQ4CG,http://www.google.com,0,3,260,0,7,327496833 1D0SZ6WUTMHJ,http://www.google.com,0,4,381,0,2,89114000 1D27D0FEL156,http://www.myhealth.com,0,4,293,0,5,234500000 1D2J3PODJ04U,http://www.myhealth.com,0,5,397,0,6,426026800 1D2MCNFC4X11,http://www.google.com,0,3,213,0,4,572164666 1D3M2VUC1OZI,http://www.google.com,0,8,509,0,3,309378000 1D4JFK1ARITD,http://www.facebook.com,4,3,268,1,6,474538200 1D5S1H834P4V,http://www.facebook.com,0,1,39,0,4,616613333 1D5SX8SWY202,http://www.google.com,0,2,148,0,4,595993333
In the output above we see one case of checkout completion . Some important metrics that can be derived from this data are as follows
This analysis relies on the presence of cookie. However, cookie can be disabled by user, or the user may clear existing cookies present on the client machine. We will also lose our thread of an user context, when an user switches from one machine to another.
In a future post I will discuss some of the Hadoop based predictive analytic techniques that have been implemented in visitante for web click stream data.