I had started on a Hadoop based web analytic open source project some time ago. Recently I did some work on it and decided blog about the development I did on the the project. The project is called visitante and it’s available on github. It’s goal is two fold. First, there are a set of MR jobs for various descriptive analytic metric e.g., bounce rate, checkout abandonment etc. I find the blog site of Avinash Kaushik to be the best resource for web analytic. It’s better than reading a book. I am implementing many of the metrics defined in this post of Avinash. Second, I will have a set of MR jobs for predictive analytic on web log data e.g., prediction of user conversion, making product recommendation.
In this post I will start off with some simple session based analytic. In follow up posts in future I will address more complex metrics, including predictive metrics.
The input to to most of MR jobs in visitante is W3C compliant web server log data. Some of the MR jobs will the consume the raw log data and some will consume output of other MR jobs. Here is some sample data. It happens to be for IIS web server.
2012-04-27 00:07:40 __RequestVerificationToken_Lw__=3GQ426510U4H;+.ASPXAUTH=DJ4XNH6EMMW5CCC5 /product/N19C4MX1 http://www.healthyshopping.com/product/T0YJZ1QH 2012-04-27 00:08:21 __RequestVerificationToken_Lw__=2XXW0J4N117Q;+.ASPXAUTH=X6DUSPR2R53VZ53G /product/FPR477BM http://www.google.com 2012-04-27 00:08:24 __RequestVerificationToken_Lw__=3GQ426510U4H;+.ASPXAUTH=DJ4XNH6EMMW5CCC5 /product/NL0ZJO2L http://www.healthyshopping.com/product/N19C4MX1 2012-04-27 00:09:31 __RequestVerificationToken_Lw__=3GQ426510U4H;+.ASPXAUTH=DJ4XNH6EMMW5CCC5 /addToCart/NL0ZJO2L /product/NL0ZJO2L 2012-04-27 00:09:35 __RequestVerificationToken_Lw__=2XXW0J4N117Q;+.ASPXAUTH=X6DUSPR2R53VZ53G /addToCart/FPR477BM /product/FPR477BM 2012-04-27 00:09:45 __RequestVerificationToken_Lw__=UJAQ1TQWAGVL;+.ASPXAUTH=C142FL33KKCV603E /product/7Y4FP655 http://www.twitter.com
The input consists of the following fields. The cookie field contains userID and sessionID.
- Cookie containing sessionID and userID
- Page URL
- Referrer URL
The goal of the MR jobs is to to produce structured tabular data. Once we have tabular data, we can wrap them in a Hive table and generate whatever aggregate metrics we are interested in through Hive queries.
Session Extractor Map Reduce
This MapReduce processes log data and does grouping by sessionID with secondary sorting on visit time . In the reducer output we will get data for each session with pages visited in the session in the order of the visit. It generates as output the following
- Session start date time (unix epoch time)
- Page visit time
- Visited page URL
- Time spent on a page
You can think of this MR job as an ETL process. Interesting metric can be found by encapsulating the output with a Hive table
Hive in Action
You could partition the data by visit date, if you are planning to do date specific aggregate queries, e.g. most popular page on a given data. The details about Hive DDL can be found here in Hive Wiki.
Partitioning in Hive gives you an index like feature as in RDBMS. It can give a significant boost to the Hive query performance. In your processing pipe line you could process every 24 hour worth of log data after midnight and store the result into a new Hive date partition.
Sometimes there may be mismatch between the actual data and Hive schema. For example, I have session start time as a long (unix epoch time). However your Hive table may have date and time as string. The mismatch could be handled in the following way
- Load data into a Hive staging table.
- Define a Hive view on the staging table.
- The view should match the target Hive table.
- In the view use UDF to map unix epoch time to date and time as String
- Read from the view and insert over write to the target Hive table
How About Some Metrics
There are various metric you could calculate executing Hive queries on the data. Here are some examples
|Most popular page in a day or month|
|Distribution of visit time by hour of day|
|Distribution of visits by referrer|
In future posts I will talk about some more complex metrics. It will get really interesting when we get into predictive analytic, applying data mining algorithm on web log data using Hadoop. Here is a tutorial document with instructions on how to run this use case.