Big Web Checkout Abandonment


The topic for this post, is of interest to any online retailer. Shopping cart abandonment is dreaded by online stores. It’s more common in online stores than brick and mortar stores. In this post I will be discussing Hadoop based checkout abandonment analysis based on my open source web analytic project visitante hosted on github. As we will see, some other important metrics are also generated as part of this analysis process.

An user may abandon his or her shopping cart for various reasons including  feeling uncomfortable about the price tag, compare price in other sites, simply being lazy and deciding to come back later to complete the transaction etc.

Solution

The input to the analysis is W3C compliant web server log files. Details can be found in my earlier web analytic post.  The analysis is performed with two MR jobs. SessionSummarizer and UserSessionSummary. The basic idea is to collect all page visits by session. The page visits for a session are also sorted by visit time.

We also define page sequences for checkout process  through a configuration parameter as below. The page URL in the sequence could also be specified as regular expression

flow.sequence=/shoppingCart,/checkOut,/signin,/signup,/billing,/confirmShipping,/placeOrder

Although our focus is on checkout process, the solution discussed here could be used for detecting any pre defined sequence of pages in a session.

For example you may have introduced a marketing campaign “Deal of the day” and you may want to track follow through from the link to the eventual purchase.

We essentially  compare the checkout page sequence with the page sequence for a session. The following 3 scenarios are possible with respect to the match.

Checkout flow not entered There is no intersection between checkout sequence and page visit sequence. No interaction with the checkout process
Checkout flow entered but not completed Page visit sequence contains checkout sequence but only partially. Considered to be an abandoned checkout
Checkout flow entered and completed Checkout sequence is fully contained within the page visit sequence. Considered to be a successful checkout

SessionSummarizer Map Reduce

This map reduce takes log data as input and groups data by session ID. The mapper output key consists of session ID and page visit time. It also computes the checkout flow status. It’s output  consists of the following

  1. User ID (or Cookie)
  2. Session ID
  3. Number of pages
  4. Session start time
  5. Time spent in the session
  6. Last page in the session
  7. Checkout flow status
  8. Referrer for the session if any

Here is some sample output.

2BTZOCY42DT15ANB,F9YCR287DXG8,3,1336303827000,229,/addToCart/NJYQ7UCV,0,http://www.google.com
2BV28U2RS5C9RENW,6BJVC65DUTV3,2,1337915930000,194,/product/ZND9R09Z,0,http://www.myhealth.com
2BV5RUHP3DAOCGL5,TQW6X3BDIYM1,5,1337808408000,296,/product/NBGFWW81,0,http://www.google.com
2BVPDAP53JZAC8MB,CVEDAZMAO78R,2,1337915166000,80,/product/4HOK5MOI,0,http://www.myhealth.com
2BW2ZITX4SGB7A0S,3331234N1ZKC,4,1338037974000,310,/addToCart/LFN69GLC,1,http://www.myhealth.com
2BWY4042Z2EU011J,3K22BXB2V00V,2,1337291518000,84,/product/SR3ON54E,0,http://www.google.com
2BZI4VZM4XT43K7V,0YFI0U10Z4LM,2,1336174755000,192,/product/DXM3ZF0N,0,http://www.google.com
2C11GC023MC12JFA,MP22NW0DLW34,7,1336499993000,524,/addToCart/GDCSVGP4,0,http://www.google.com
2C290I13A1USIR2C,84E0M6FH9VQF,6,1336486745000,399,/signup,1,http://www.myhealth.com
2C2FBN3AK6EO3ZJP,0TQMHGX5AZOS,7,1336086821000,493,/product/CG7T015N,2,http://www.facebook.com
2C301FAZ29FQWQJH,7BGYHD344SQN,5,1337224728000,446,/product/TKS93ALL,0,http://www.myhealth.com
2C3M0XHOCK3S1AXH,Z25ZQE2NZM3W,5,1336015288000,197,/product/V8SS47S8,0,http://www.facebook.com

For the flow status (7th field in the output), the values are as follows. In the sample output above, there is only one case of checkout completion.

  1. Not entered (0)
  2. Entered and not completed (1)
  3. Entered and completed(2)

Once we have this data, we could try to correlate it other parameters (e.g., referrer URL, time spent in the session, products viewed) to get more insight into the checkout  process.

Checkout abandonment is an important metric. It’s the ratio of the number of sessions that abandoned a checkout process and the total  number of sessions that entered the checkout process.

Although our focus is on checkout abandonment, some other important and insightful metrics can be derived. Here are some examples

  • Bounce rate, i.e., number of sessions that end after the landing page
  • Average session duration
  • Site penetration i.e., average number of pages visited per session 
  • User visit time distribution in a 24 hour period 

UserSessionSummary Map Reduce

This map reduce takes the output of the previous map reduce and generates user level data by aggregating all sessions for an user. The data is grouped by user ID.  The mapper key consists of userID and session start time. It also provides information about conversion of a user. The output consists of the following.

  1. User ID (or Cookie)
  2. Referrer for the first session
  3. Number of sessions before conversion
  4. Average number of pages visited per session
  5. Average time spent per session
  6. Conversion status
  7. Number of sessions
  8. Average time gap between sessions

The conversion status is 1 if the user completed the checkout flow and made a purchase in one of the sessions.  Here is some sample output

1CKWW5KM8QB5,http://www.facebook.com,0,4,239,0,5,320297000
1CNO4M4WQ4CG,http://www.google.com,0,3,260,0,7,327496833
1D0SZ6WUTMHJ,http://www.google.com,0,4,381,0,2,89114000
1D27D0FEL156,http://www.myhealth.com,0,4,293,0,5,234500000
1D2J3PODJ04U,http://www.myhealth.com,0,5,397,0,6,426026800
1D2MCNFC4X11,http://www.google.com,0,3,213,0,4,572164666
1D3M2VUC1OZI,http://www.google.com,0,8,509,0,3,309378000
1D4JFK1ARITD,http://www.facebook.com,4,3,268,1,6,474538200
1D5S1H834P4V,http://www.facebook.com,0,1,39,0,4,616613333
1D5SX8SWY202,http://www.google.com,0,2,148,0,4,595993333

In the output above we  see one case of  checkout completion . Some important metrics that can be derived from this data are as follows

  • Conversion rate i.e., percentage of unique users converting
  • Average number of visits before conversion
  • Average number of visits per month
  • Average time gap between visits, which is indicative of customer loyalty
  • Average number of purchases per year, which is also a good metric for customer loyalty
  • Average time gap between purchases

About Cookie

This analysis relies on the presence of cookie. However, cookie can be disabled by user, or the user may clear existing  cookies present on the client machine. We will also lose our thread of an user context, when an user switches from one machine to another.

Going Forward

In a future post I will discuss some of the Hadoop based predictive analytic techniques that have been implemented in visitante for web click stream data.

About these ads

About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source contributor. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Data Mining and Programming languages. I am fascinated by problems that don't have neat closed form solution.
This entry was posted in Big Data, Hadoop and Map Reduce, Web Analytic and tagged , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s