Detecting Incidents with Context from Log Data

Analyzing vast amount of machine generated unstructured or semi structured data is Hadoop’s forte. Many of us have gone through the exercise of searching log files, most likely with grep,  for some pattern and then looking at surrounding log lines as a context to get a better understanding of some incident or event reported in the log.

This effort could be necessary as part of troubleshooting some problem or to gain insight from log data about some significant event .  The Hadoop based solution presented in this post can be thought of as grep on steroid.

When dealing with large volumes of log data, the manual way of searching  is not viable. One of my earlier posts outlined the Hadoop based solution for this problem. In this post, the focus will be on the implementation, which is part of my open source project visitante.

Log Data

A log can be loosely defined as a set of unstructured or semi structured records chronologically ordered, generated continuously  by some computing device to describe the internal evolving state of some software.

There are two kinds of log data, with session and without session.If the log reflects the behavior or action of some human e.g., web server log, there is usually a session associated with it.

A session is associated with a finite time span and some user. In these logs, a log record can span multiple lines e.g., app server log with multi line exception stack dump. We will further assume that the the first line of a record contains a date time field.

Logs generated from many devices e.g. network devices don’t have any notion of session. it’s a simply a set of records chronologically ordered.

Our Map Reduce implementation can handle both types of log data. The example we will be working on is for web server log, which has session.

Map Reduce

The Map Reduce implementation uses mapper and reducer for logs with session and mapper only for logs without session. For logs with session, we will need a reducer. The mapper uses sessionID and timeStamp as the composite key. The time stamp with the key helps to chronologically sort the log records with a session using secondary sorting.

Multiple regular expressions can be specified through configuration. For each regex and each record of a session, we apply pattern matching. If a match is found, we output the following.

  1. Event name
  2. Set of records prior to the matched record
  3. Matched record
  4. A set of records following the matched record


Set of records prior to and following the matched record serves as the context. The number of such records prior and after can be specified through configuration parameters.

For log data without session, no reducer is required. The mapper simply goes through log lines, captures set of lines in a buffer and then applies pattern matching as done in the reducer logic for log data with session.

Shopping Cart Example

When an item is removed from a shopping cart, the retailer may be interested in gaining some insight from the incident. The context surrounding the event when an user removes an item from the shopping cart is manually analyzed to gain those insights. We used a pattern to match the remove from shopping cart request as captured in the web server log. Here are some potential findings from analyzing the log lines in the surrounding context. Whatever the finding is, it provides valuable insights on user behavior.

  • A long pause with the possible implication that the user may have gone to another site for better price
  • Browsing other products implicating that the user may replace an exiting item in the shopping cart with a newly browsed item


Here are some sample log lines from the input. The log data is from IIS web server logs. The log data is from a fictitious on line health food and supplement store.

2016-1-15  21:59:05  __RequestVerificationToken_Lw__=PFLN3HOA60;+.ASPXAUTH=T0KW3N1S30QOALBD  /product/1E3E3KR1
2016-1-15  21:59:36  __RequestVerificationToken_Lw__=PFLN3HOA60;+.ASPXAUTH=T0KW3N1S30QOALBD  /search/multivitamin
2016-1-15  21:59:43  __RequestVerificationToken_Lw__=UIXITDK6TP;+.ASPXAUTH=SYEZXMP6T0NO74HL  /product/K152052J
2016-1-15  22:00:13  __RequestVerificationToken_Lw__=PFLN3HOA60;+.ASPXAUTH=T0KW3N1S30QOALBD  /product/69FEZCNC
2016-1-15  22:01:28  __RequestVerificationToken_Lw__=PFLN3HOA60;+.ASPXAUTH=T0KW3N1S30QOALBD  /addToCart/69FEZCNC /product/69FEZCNC
2016-1-15  22:02:51  __RequestVerificationToken_Lw__=PFLN3HOA60;+.ASPXAUTH=T0KW3N1S30QOALBD  /product/8L307T2Z

The fields in the log are as follows. Incidentally, the content of log output for most web servers is configurable.  Various other fields besides fields in the list below are available.

  1. Date time stamp
  2. Cookie
  3. URL
  4. Referrer URL


Here is some sample output showing the log lines surrounding a remove from shopping cart event. Here we find that the user browsed some item before removing an existing item from the shopping cart and browsed another item after the removal.

event: removedFromCart
2016-1-15  20:15:20  __RequestVerificationToken_Lw__=9IRRD9D08U;+.ASPXAUTH=9P1V6M8WI43333B0  /addToCart/X243NR0I /product/X243NR0I
2016-1-15  20:16:49  __RequestVerificationToken_Lw__=9IRRD9D08U;+.ASPXAUTH=9P1V6M8WI43333B0  /myAccount
2016-1-15  20:17:06  __RequestVerificationToken_Lw__=9IRRD9D08U;+.ASPXAUTH=9P1V6M8WI43333B0  /product/JBBFRERB
2016-1-15  20:18:19  __RequestVerificationToken_Lw__=9IRRD9D08U;+.ASPXAUTH=9P1V6M8WI43333B0  /remFromCart/X243NR0I
2016-1-15  20:20:08  __RequestVerificationToken_Lw__=9IRRD9D08U;+.ASPXAUTH=9P1V6M8WI43333B0  /product/5KW4AKFN


Correlating with Multiple Log Files

In some situations, it is necessary to merge sort log files from multiple servers in a cluster, before applying pattern matching Map Reduce as shown here. Consider logs generated from a Cassandara NOSQL database cluster.

For trouble shooting certain problems associated with global cluster behavior, it may be necessary merge sort multiple log files from various servers in the Cassandra cluster, because in a cluster behavior in one machine may be influenced by behavior in other machines. Without log consolidation, we can not get the holistic view necessary to troubleshoot complex problems related to cluster behavior.  In a future post I will report on a Map Reduce solution for merge sort of multiple log files.

Summing Up

Valuable insights can be gained from simple analysis of log data as shown in this post. The tutorial for the example used in this post is available.

For commercial support for this solution or other solutions in my github repositories, please talk to ThirdEye Data Science Services. Support is available for Hadoop or Spark deployment on cloud including installation, configuration and testing,






About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source project owner. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Machine Learning and Programming languages. I am fascinated by problems that don't have neat closed form solution.
This entry was posted in Big Data, Hadoop and Map Reduce, Log Analysis, Uncategorized, Web Analytic and tagged , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s