The topic for this post, is of interest to any online retailer. Shopping cart abandonment is dreaded by online stores. It’s more common in online stores than brick and mortar stores. In this post I will be discussing Hadoop based checkout abandonment analysis based on my open source web analytic project visitante hosted on github. As we will see, some other important metrics are also generated as part of this analysis process.
An user may abandon his or her shopping cart for various reasons including feeling uncomfortable about the price tag, compare price in other sites, simply being lazy and deciding to come back later to complete the transaction etc.
Solution
The input to the analysis is W3C compliant web server log files. Details can be found in my earlier web analytic post. The analysis is performed with two MR jobs. SessionSummarizer and UserSessionSummary. The basic idea is to collect all page visits by session. The page visits for a session are also sorted by visit time.
We also define page sequences for checkout process through a configuration parameter as below. The page URL in the sequence could also be specified as regular expression
flow.sequence=/shoppingCart,/checkOut,/signin,/signup,/billing,/confirmShipping,/placeOrder
Although our focus is on checkout process, the solution discussed here could be used for detecting any pre defined sequence of pages in a session.
For example you may have introduced a marketing campaign “Deal of the day” and you may want to track follow through from the link to the eventual purchase.
We essentially compare the checkout page sequence with the page sequence for a session. The following 3 scenarios are possible with respect to the match.
| Checkout flow not entered | There is no intersection between checkout sequence and page visit sequence. No interaction with the checkout process |
| Checkout flow entered but not completed | Page visit sequence contains checkout sequence but only partially. Considered to be an abandoned checkout |
| Checkout flow entered and completed | Checkout sequence is fully contained within the page visit sequence. Considered to be a successful checkout |
SessionSummarizer Map Reduce
This map reduce takes log data as input and groups data by session ID. The mapper output key consists of session ID and page visit time. It also computes the checkout flow status. It’s output consists of the following
- User ID (or Cookie)
- Session ID
- Number of pages
- Session start time
- Time spent in the session
- Last page in the session
- Checkout flow status
- Referrer for the session if any
Here is some sample output.
2BTZOCY42DT15ANB,F9YCR287DXG8,3,1336303827000,229,/addToCart/NJYQ7UCV,0,http://www.google.com 2BV28U2RS5C9RENW,6BJVC65DUTV3,2,1337915930000,194,/product/ZND9R09Z,0,http://www.myhealth.com 2BV5RUHP3DAOCGL5,TQW6X3BDIYM1,5,1337808408000,296,/product/NBGFWW81,0,http://www.google.com 2BVPDAP53JZAC8MB,CVEDAZMAO78R,2,1337915166000,80,/product/4HOK5MOI,0,http://www.myhealth.com 2BW2ZITX4SGB7A0S,3331234N1ZKC,4,1338037974000,310,/addToCart/LFN69GLC,1,http://www.myhealth.com 2BWY4042Z2EU011J,3K22BXB2V00V,2,1337291518000,84,/product/SR3ON54E,0,http://www.google.com 2BZI4VZM4XT43K7V,0YFI0U10Z4LM,2,1336174755000,192,/product/DXM3ZF0N,0,http://www.google.com 2C11GC023MC12JFA,MP22NW0DLW34,7,1336499993000,524,/addToCart/GDCSVGP4,0,http://www.google.com 2C290I13A1USIR2C,84E0M6FH9VQF,6,1336486745000,399,/signup,1,http://www.myhealth.com 2C2FBN3AK6EO3ZJP,0TQMHGX5AZOS,7,1336086821000,493,/product/CG7T015N,2,http://www.facebook.com 2C301FAZ29FQWQJH,7BGYHD344SQN,5,1337224728000,446,/product/TKS93ALL,0,http://www.myhealth.com 2C3M0XHOCK3S1AXH,Z25ZQE2NZM3W,5,1336015288000,197,/product/V8SS47S8,0,http://www.facebook.com
For the flow status (7th field in the output), the values are as follows. In the sample output above, there is only one case of checkout completion.
- Not entered (0)
- Entered and not completed (1)
- Entered and completed(2)
Once we have this data, we could try to correlate it other parameters (e.g., referrer URL, time spent in the session, products viewed) to get more insight into the checkout process.
Checkout abandonment is an important metric. It’s the ratio of the number of sessions that abandoned a checkout process and the total number of sessions that entered the checkout process.
Although our focus is on checkout abandonment, some other important and insightful metrics can be derived. Here are some examples
- Bounce rate, i.e., number of sessions that end after the landing page
- Average session duration
- Site penetration i.e., average number of pages visited per session
- User visit time distribution in a 24 hour period
UserSessionSummary Map Reduce
This map reduce takes the output of the previous map reduce and generates user level data by aggregating all sessions for an user. The data is grouped by user ID. The mapper key consists of userID and session start time. It also provides information about conversion of a user. The output consists of the following.
- User ID (or Cookie)
- Referrer for the first session
- Number of sessions before conversion
- Average number of pages visited per session
- Average time spent per session
- Conversion status
- Number of sessions
- Average time gap between sessions
The conversion status is 1 if the user completed the checkout flow and made a purchase in one of the sessions. Here is some sample output
1CKWW5KM8QB5,http://www.facebook.com,0,4,239,0,5,320297000 1CNO4M4WQ4CG,http://www.google.com,0,3,260,0,7,327496833 1D0SZ6WUTMHJ,http://www.google.com,0,4,381,0,2,89114000 1D27D0FEL156,http://www.myhealth.com,0,4,293,0,5,234500000 1D2J3PODJ04U,http://www.myhealth.com,0,5,397,0,6,426026800 1D2MCNFC4X11,http://www.google.com,0,3,213,0,4,572164666 1D3M2VUC1OZI,http://www.google.com,0,8,509,0,3,309378000 1D4JFK1ARITD,http://www.facebook.com,4,3,268,1,6,474538200 1D5S1H834P4V,http://www.facebook.com,0,1,39,0,4,616613333 1D5SX8SWY202,http://www.google.com,0,2,148,0,4,595993333
In the output above we see one case of checkout completion . Some important metrics that can be derived from this data are as follows
- Conversion rate i.e., percentage of unique users converting
- Average number of visits before conversion
- Average number of visits per month
- Average time gap between visits, which is indicative of customer loyalty
- Average number of purchases per year, which is also a good metric for customer loyalty
- Average time gap between purchases
About Cookie
This analysis relies on the presence of cookie. However, cookie can be disabled by user, or the user may clear existing cookies present on the client machine. We will also lose our thread of an user context, when an user switches from one machine to another.
Going Forward
In a future post I will discuss some of the Hadoop based predictive analytic techniques that have been implemented in visitante for web click stream data.