Analyzing vast amount of machine generated unstructured or semi structured data is Hadoop’s forte. Many of us have gone through the exercise of searching log files, most likely with grep, for some pattern and then looking at surrounding log lines as a context to get a better understanding of some incident or event reported in the log.
This effort could be necessary as part of troubleshooting some problem or to gain insight from log data about some significant event . The Hadoop based solution presented in this post can be thought of as grep on steroid.
When dealing with large volumes of log data, the manual way of searching is not viable. One of my earlier posts outlined the Hadoop based solution for this problem. In this post, the focus will be on the implementation, which is part of my open source project visitante.
A log can be loosely defined as a set of unstructured or semi structured records chronologically ordered, generated continuously by some computing device to describe the internal evolving state of some software.
There are two kinds of log data, with session and without session.If the log reflects the behavior or action of some human e.g., web server log, there is usually a session associated with it.
A session is associated with a finite time span and some user. In these logs, a log record can span multiple lines e.g., app server log with multi line exception stack dump. We will further assume that the the first line of a record contains a date time field.
Logs generated from many devices e.g. network devices don’t have any notion of session. it’s a simply a set of records chronologically ordered.
Our Map Reduce implementation can handle both types of log data. The example we will be working on is for web server log, which has session.
The Map Reduce implementation uses mapper and reducer for logs with session and mapper only for logs without session. For logs with session, we will need a reducer. The mapper uses sessionID and timeStamp as the composite key. The time stamp with the key helps to chronologically sort the log records with a session using secondary sorting.
Multiple regular expressions can be specified through configuration. For each regex and each record of a session, we apply pattern matching. If a match is found, we output the following.
- Event name
- Set of records prior to the matched record
- Matched record
- A set of records following the matched record
Set of records prior to and following the matched record serves as the context. The number of such records prior and after can be specified through configuration parameters.
For log data without session, no reducer is required. The mapper simply goes through log lines, captures set of lines in a buffer and then applies pattern matching as done in the reducer logic for log data with session.
Shopping Cart Example
When an item is removed from a shopping cart, the retailer may be interested in gaining some insight from the incident. The context surrounding the event when an user removes an item from the shopping cart is manually analyzed to gain those insights. We used a pattern to match the remove from shopping cart request as captured in the web server log. Here are some potential findings from analyzing the log lines in the surrounding context. Whatever the finding is, it provides valuable insights on user behavior.
- A long pause with the possible implication that the user may have gone to another site for better price
- Browsing other products implicating that the user may replace an exiting item in the shopping cart with a newly browsed item
Here are some sample log lines from the input. The log data is from IIS web server logs. The log data is from a fictitious on line health food and supplement store.
2016-1-15 21:59:05 __RequestVerificationToken_Lw__=PFLN3HOA60;+.ASPXAUTH=T0KW3N1S30QOALBD /product/1E3E3KR1 http://www.healthyshopping.com/product/B3W8EJQY 2016-1-15 21:59:36 __RequestVerificationToken_Lw__=PFLN3HOA60;+.ASPXAUTH=T0KW3N1S30QOALBD /search/multivitamin http://www.healthyshopping.com/product/1E3E3KR1 2016-1-15 21:59:43 __RequestVerificationToken_Lw__=UIXITDK6TP;+.ASPXAUTH=SYEZXMP6T0NO74HL /product/K152052J http://www.healthyshopping.com/addToCart/FDXPB4DG 2016-1-15 22:00:13 __RequestVerificationToken_Lw__=PFLN3HOA60;+.ASPXAUTH=T0KW3N1S30QOALBD /product/69FEZCNC http://www.healthyshopping.com/search/multivitamin 2016-1-15 22:01:28 __RequestVerificationToken_Lw__=PFLN3HOA60;+.ASPXAUTH=T0KW3N1S30QOALBD /addToCart/69FEZCNC /product/69FEZCNC 2016-1-15 22:02:51 __RequestVerificationToken_Lw__=PFLN3HOA60;+.ASPXAUTH=T0KW3N1S30QOALBD /product/8L307T2Z http://www.healthyshopping.com
The fields in the log are as follows. Incidentally, the content of log output for most web servers is configurable. Various other fields besides fields in the list below are available.
- Date time stamp
- Referrer URL
Here is some sample output showing the log lines surrounding a remove from shopping cart event. Here we find that the user browsed some item before removing an existing item from the shopping cart and browsed another item after the removal.
event: removedFromCart 2016-1-15 20:15:20 __RequestVerificationToken_Lw__=9IRRD9D08U;+.ASPXAUTH=9P1V6M8WI43333B0 /addToCart/X243NR0I /product/X243NR0I 2016-1-15 20:16:49 __RequestVerificationToken_Lw__=9IRRD9D08U;+.ASPXAUTH=9P1V6M8WI43333B0 /myAccount http://www.healthyshopping.com/addToCart/X243NR0I 2016-1-15 20:17:06 __RequestVerificationToken_Lw__=9IRRD9D08U;+.ASPXAUTH=9P1V6M8WI43333B0 /product/JBBFRERB http://www.healthyshopping.com/myAccount 2016-1-15 20:18:19 __RequestVerificationToken_Lw__=9IRRD9D08U;+.ASPXAUTH=9P1V6M8WI43333B0 /remFromCart/X243NR0I http://www.healthyshopping.com/product/JBBFRERB 2016-1-15 20:20:08 __RequestVerificationToken_Lw__=9IRRD9D08U;+.ASPXAUTH=9P1V6M8WI43333B0 /product/5KW4AKFN http://www.healthyshopping.com/remFromCart/X243NR0I
Correlating with Multiple Log Files
In some situations, it is necessary to merge sort log files from multiple servers in a cluster, before applying pattern matching Map Reduce as shown here. Consider logs generated from a Cassandara NOSQL database cluster.
For trouble shooting certain problems associated with global cluster behavior, it may be necessary merge sort multiple log files from various servers in the Cassandra cluster, because in a cluster behavior in one machine may be influenced by behavior in other machines. Without log consolidation, we can not get the holistic view necessary to troubleshoot complex problems related to cluster behavior. In a future post I will report on a Map Reduce solution for merge sort of multiple log files.
Valuable insights can be gained from simple analysis of log data as shown in this post. The tutorial for the example used in this post is available.
For commercial support for this solution or other solutions in my github repositories, please talk to ThirdEye Data Science Services. Support is available for Hadoop or Spark deployment on cloud including installation, configuration and testing,