Big Web Analytic


I had started on a Hadoop based web analytic open source project some time ago. Recently I did some work on it and decided blog about the development I did on the the project. The project is  called visitante and it’s available on github. It’s goal is two fold. First, there are a set of MR jobs for various descriptive analytic metric e.g.,  bounce rate, checkout abandonment etc. I find the blog site of Avinash Kaushik to be the best resource for web analytic. It’s better than reading a book. I am implementing many  of the metrics defined in this post of Avinash. Second, I will have a set of MR jobs for predictive analytic on web log data  e.g., prediction of user conversion, making product recommendation.

In this post I will start off with some simple session based analytic. In follow up posts in future I will address more complex metrics, including predictive metrics.

Log Input

The input to to most of  MR jobs in visitante is W3C compliant  web server log data.  Some of the MR jobs will the consume the raw log data and some will consume output of other MR jobs. Here is some sample data. It happens to be for IIS web server.

2012-04-27  00:07:40  __RequestVerificationToken_Lw__=3GQ426510U4H;+.ASPXAUTH=DJ4XNH6EMMW5CCC5  /product/N19C4MX1 http://www.healthyshopping.com/product/T0YJZ1QH
2012-04-27  00:08:21  __RequestVerificationToken_Lw__=2XXW0J4N117Q;+.ASPXAUTH=X6DUSPR2R53VZ53G  /product/FPR477BM http://www.google.com
2012-04-27  00:08:24  __RequestVerificationToken_Lw__=3GQ426510U4H;+.ASPXAUTH=DJ4XNH6EMMW5CCC5  /product/NL0ZJO2L http://www.healthyshopping.com/product/N19C4MX1
2012-04-27  00:09:31  __RequestVerificationToken_Lw__=3GQ426510U4H;+.ASPXAUTH=DJ4XNH6EMMW5CCC5  /addToCart/NL0ZJO2L /product/NL0ZJO2L
2012-04-27  00:09:35  __RequestVerificationToken_Lw__=2XXW0J4N117Q;+.ASPXAUTH=X6DUSPR2R53VZ53G  /addToCart/FPR477BM /product/FPR477BM
2012-04-27  00:09:45  __RequestVerificationToken_Lw__=UJAQ1TQWAGVL;+.ASPXAUTH=C142FL33KKCV603E  /product/7Y4FP655 http://www.twitter.com

The input consists of the following fields. The cookie field contains userID and sessionID.

  1. Date
  2. Time
  3. Cookie containing sessionID and userID
  4. Page URL
  5. Referrer URL

The goal of the MR jobs is to to produce structured tabular data. Once we have tabular  data, we can wrap them in a Hive table and generate whatever aggregate metrics we are interested in through Hive queries.

Session Extractor Map Reduce

This MapReduce processes log data and does grouping by sessionID with secondary sorting on visit time . In the reducer output we will get data for each session with pages visited in the session in the order of the visit. It generates as output the following

  1. SessionID
  2. UserID
  3. Session start date time (unix epoch time)
  4. Page visit time
  5. Visited page URL
  6. Time spent on a page

You can think of this MR job as an ETL process. Interesting metric can be found by encapsulating the output with a Hive table

Hive in Action

To wrap the output  in a Hive table, you can either take the data as is in the HDFS output directory and define a Hive external table. Another option is to use Hive managed table and use the Hive load command to load the data into the Hive managed table warehouse location.

You could partition the data by visit date, if you are planning to do date specific aggregate queries, e.g. most popular page on a given data. The details about Hive DDL can be found here in Hive Wiki.

Partitioning in Hive gives you an index like feature as in RDBMS. It can give a significant boost to the Hive query performance. In your processing pipe line you could process every 24 hour worth of log data after midnight and store the result into a new Hive date partition.

Sometimes there may be mismatch between the actual data and Hive schema. For example, I have session start time as a long (unix epoch time). However your Hive table may have date and time as string.  The mismatch could be handled in the following way

  1. Load data into a Hive staging table.
  2. Define a Hive view on the staging table.
  3. The view should match the target Hive table.
  4. In the view use UDF to map unix epoch time to date and time as String
  5. Read from the view and insert over write to the target Hive table

How About Some Metrics

There are various metric you could calculate executing Hive queries on the data. Here are some examples

Most popular page in a day or month
Distribution of visit time by hour of day
Distribution of visits by referrer

Wrapping Up

In future posts I will talk about some more complex metrics. It will get really interesting when we get into predictive analytic, applying data mining algorithm on web log data  using Hadoop.

About these ads

About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source contributor. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Data Mining and Programming languages. I am fascinated by problems that don't have neat closed form solution.
This entry was posted in Big Data, Hadoop and Map Reduce, Hive, Web Analytic and tagged , , . Bookmark the permalink.

8 Responses to Big Web Analytic

  1. Manish Malhotra says:

    Hi Pranab,

    Quick Question, For the analytics query, you are using MR, any specific reason to use MR, and not using HIVE or PIG Language to implement business logic.

    Thanks for your time.

    Regards,
    Manish

  2. Pranab says:

    Manish
    I am using Java MR as an ETL process to analyze log data and then using Hive on the MR output. For unstructured, semi structured with complex parsing logic, java MR works better. Once in a tabular structure form, using Hive on it works great. That is the approach I have taken.

  3. Sambit Tripathy says:

    Hi Pranab,

    Well I liked your idea of having “visitante” Big Web Analytic. It will be some serious stuff if you go ahead, can be used heavily by interactive analytic applications.

    I am not an expert in analytic domain, yet somewhere I feel if its a case where a subsets of the given data with given a filtering condition retrieves data from HBase( I assume there is no aggregation of data). So logically if can create a sort of lexical parser which can parse the user queries and hit HBase and get the results and perform aggregation then:

    1. For simple data this may be a better performing application( I dont like the level of I/O involved in hive case).

    2. I am not sure if we create contingency tables with the web log data, if that is the case HBase can store the sparse data.

    Summary: These are the first few things came to my mind after going through this, I will go through the visitante code. So not sure if we can have a Hbase thing alongwith hive.

    Thanks and Regards,
    Sambit Tripathy

  4. Pranab says:

    Sambit

    Sure. HBase can play a role. The other day, I saw some one building an analytic solution where the facts are in Hive and dimensions in HBase. Dimensions are essentially different business entities and they were already in HBase in their case.

    Pranab

  5. Pingback: Big Web Analytic | BigDataCloud.com

  6. Pingback: Big Web Checkout Abandonment | Mawazo

  7. saeed says:

    thanks.is there any udf or lib for data mining like mahout in pig or hive?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s