Big Web Checkout Abandonment

The topic for this post, is of interest to any online retailer. Shopping cart abandonment is dreaded by online stores. It’s more common in online stores than brick and mortar stores. In this post I will be discussing Hadoop based checkout abandonment analysis based on my open source web analytic project visitante hosted on github. As we will see, some other important metrics are also generated as part of this analysis process.

An user may abandon his or her shopping cart for various reasons including  feeling uncomfortable about the price tag, compare price in other sites, simply being lazy and deciding to come back later to complete the transaction etc.


The input to the analysis is W3C compliant web server log files. Details can be found in my earlier web analytic post.  The analysis is performed with two MR jobs. SessionSummarizer and UserSessionSummary. The basic idea is to collect all page visits by session. The page visits for a session are also sorted by visit time.

We also define page sequences for checkout process  through a configuration parameter as below. The page URL in the sequence could also be specified as regular expression


Although our focus is on checkout process, the solution discussed here could be used for detecting any pre defined sequence of pages in a session.

For example you may have introduced a marketing campaign “Deal of the day” and you may want to track follow through from the link to the eventual purchase.

We essentially  compare the checkout page sequence with the page sequence for a session. The following 3 scenarios are possible with respect to the match.

Checkout flow not entered There is no intersection between checkout sequence and page visit sequence. No interaction with the checkout process
Checkout flow entered but not completed Page visit sequence contains checkout sequence but only partially. Considered to be an abandoned checkout
Checkout flow entered and completed Checkout sequence is fully contained within the page visit sequence. Considered to be a successful checkout

SessionSummarizer Map Reduce

This map reduce takes log data as input and groups data by session ID. The mapper output key consists of session ID and page visit time. It also computes the checkout flow status. It’s output  consists of the following

  1. User ID (or Cookie)
  2. Session ID
  3. Number of pages
  4. Session start time
  5. Time spent in the session
  6. Last page in the session
  7. Checkout flow status
  8. Referrer for the session if any

Here is some sample output.


For the flow status (7th field in the output), the values are as follows. In the sample output above, there is only one case of checkout completion.

  1. Not entered (0)
  2. Entered and not completed (1)
  3. Entered and completed(2)

Once we have this data, we could try to correlate it other parameters (e.g., referrer URL, time spent in the session, products viewed) to get more insight into the checkout  process.

Checkout abandonment is an important metric. It’s the ratio of the number of sessions that abandoned a checkout process and the total  number of sessions that entered the checkout process.

Although our focus is on checkout abandonment, some other important and insightful metrics can be derived. Here are some examples

  • Bounce rate, i.e., number of sessions that end after the landing page
  • Average session duration
  • Site penetration i.e., average number of pages visited per session 
  • User visit time distribution in a 24 hour period 

UserSessionSummary Map Reduce

This map reduce takes the output of the previous map reduce and generates user level data by aggregating all sessions for an user. The data is grouped by user ID.  The mapper key consists of userID and session start time. It also provides information about conversion of a user. The output consists of the following.

  1. User ID (or Cookie)
  2. Referrer for the first session
  3. Number of sessions before conversion
  4. Average number of pages visited per session
  5. Average time spent per session
  6. Conversion status
  7. Number of sessions
  8. Average time gap between sessions

The conversion status is 1 if the user completed the checkout flow and made a purchase in one of the sessions.  Here is some sample output


In the output above we  see one case of  checkout completion . Some important metrics that can be derived from this data are as follows

  • Conversion rate i.e., percentage of unique users converting
  • Average number of visits before conversion
  • Average number of visits per month
  • Average time gap between visits, which is indicative of customer loyalty
  • Average number of purchases per year, which is also a good metric for customer loyalty
  • Average time gap between purchases

About Cookie

This analysis relies on the presence of cookie. However, cookie can be disabled by user, or the user may clear existing  cookies present on the client machine. We will also lose our thread of an user context, when an user switches from one machine to another.

Going Forward

In a future post I will discuss some of the Hadoop based predictive analytic techniques that have been implemented in visitante for web click stream data.

For commercial support for this solution or other solutions in my github repositories, please talk to ThirdEye Data Science Services. Support is available for Hadoop or Spark deployment on cloud including installation, configuration and testing,


About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source project owner. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Machine Learning and Programming languages. I am fascinated by problems that don't have neat closed form solution.
This entry was posted in Big Data, Data Science, Hadoop and Map Reduce, Web Analytic and tagged , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s