Counting Unique Mobile App Users with HyperLogLog


Continuing along the theme of real time analytic with approximate algorithms, the  focus this time is approximate cardinality estimation. To put the ideas in a context, the use case we will be working with is for counting number of unique users for a mobile app. Analyzing the trend of such unique counts, reveal valuable insights into the popularity of an app.

We will be using  HyperLogLog algorithm which is available in my open source project hoidla as a Java API. The storm implementation of the use case is available in my other open source web analytic project visitante.

HyperLogLog

Cardinality is the number of unique items in a list. In a naive implementation, cardinality can be estimated using memory proportional to cardinality size.However, when the cardinality is very high (e.g., IP address, phone number), such a naive approach is not pragmatic. Various approximate probabilistic algorithms can be used for cardinality estimation.

HyperLogLog is based on analyzing some bit patterns in the hashed value of an item. We look at the length of the sequence of most significant zero bits. The maximum length among such zero bit sequences from all the hashed values is indicative of the unique item count.

In reality, to improve the quality of the result, multiple independent hash functions are used and the longest sequence zero bits resulting from each hash function is used to produce the final averaged unique count value.

Instead of using multiple hash functions, we will use a technique called stochastic averaging. We set aside a sequence of significant bits of the hash for buckets. From the remaining bits we find the sequence of most significant zero bits. For example, to use 256 buckets we use the most significant 8 bits for buckets and the remaining 24 bits to find the sequence of most significant zero bits. For each bucket, we maintain a count of maximum length of sequence of zero bits.

Small Cardinality

The HyperLogLog algorithm as described in the original paper does not work well for small cardinality. As suggested in the paper, when the unique count falls below a threshold an algorithm based on probabilistic properties of random allocations.

The correction for  small cardinality is included in the implementation in hoidla. However, if the knowledge of small cardinality is not known a priori, you could do simple hash based counting instead of HyperLogLog.

Mobile App Usage Data

By instrumenting the  SDK calls of the app, usage data is created with the following 5 fields. The phone number is used as an identifier for the user.

  1. Date
  2. Time
  3. Session ID
  4. Phone area code
  5. Phone number

As we will see later, the phone area code is used to partition the data in the Storm implementation. The data is fed to storm through a message queue. Here is some sample data

2014-11-16 02:17:48 c080f1fa-6d79-11e4-aa9d-c42c030f8af1 310 (310)6121967
2014-11-16 02:17:50 d0997c05-6d79-11e4-a360-c42c030f8af1 408 (408)4937187
2014-11-16 02:17:50 cbd4af2e-6d79-11e4-b7e1-c42c030f8af1 339 (339)8242149
2014-11-16 02:17:52 d0997c05-6d79-11e4-a360-c42c030f8af1 408 (408)4937187
2014-11-16 02:17:54 cbd4af2e-6d79-11e4-b7e1-c42c030f8af1 339 (339)8242149
2014-11-16 02:17:56 d0997c05-6d79-11e4-a360-c42c030f8af1 408 (408)4937187
2014-11-16 02:17:56 cbd4af2e-6d79-11e4-b7e1-c42c030f8af1 339 (339)8242149
2014-11-16 02:17:57 d5f8956b-6d79-11e4-891d-c42c030f8af1 213 (213)7703334

Storm Topology

The storm topology architecture consists of  a spout and two bolts. The spout reads usage  from a message queue. A simple message queue abstraction available in my open source project chombo  is used. It facilitates usage of any message queue. I  have used Redis.

The data emitted by the spout is field grouped on area code. It’s essentially hash partitioning on the area code. All the data for the same are code is processed by the same bolt instance of  UniqueVisitorCounterBolt. Each bolt instance maintains an instance of HyperLogLog object. When a new tuple arrives, it”s processed by the HyperLogLog object.

When the UniqueVisitorCounterBolt receives a tick tuple, it obtains the unique count from the HyperLogLog object and emits the tuple (boltID,  uniqueCount).

The tuple emitted by  UniqueVisitorCounterBolt  is processed by the  UniqueVisitorAggregatorBolt, of which there is only instance. As the name suggests it aggregates the counts, which is simply summing up the unique counts from the predecessor bolt layer. The result is written to a message queue, which any client application can consume for further trend analysis of unique user count data.

The output is simply a the tuple (currentTime, uniqueCount).  Here is some sample output. As new records are processed, the unique count grows. Unique count is always monotonically increasing.

1416191997120  76
1416192007121  76
1416192017123  77
1416192027584  78
1416192037125  78
1416192047126  78
1416192057126  78
1416192067128  79
1416192077128  79

Temporal Reference

The unique count has a temporal reference point for time series data. The counting is with respect to some point in past. In case of of a mobile app, it will be the launch date of the app.

Although generally, the temporal reference does not change once set, some times it may be necessary to change  it. There is a mechanism to clear the counter and start counting with  a clean slate.

A simple publish subscribe mechanism is used to dispatch commands to bolt instances. A simple pub sub interface is available in chombo, with implementation for different messaging provider. I have used Redis. On receipt of a tick tuple, theUniqueVisitorCounterBolt bolt  fetches  the command if any from the pub-sub system. Then the HyperLogLog counter is cleared.

Summing Up

We have gone through a exercise of using probabilistic counting algorithm for approximate unique count estimation. The HyperLogLog algorithm has been used for network traffic data analysis and query planner in databases. Step by step instruction to run this use case in available in this tutorial document.

Advertisements

About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source project owner. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Machine Learning and Programming languages. I am fascinated by problems that don't have neat closed form solution.
This entry was posted in Approximate Query, Big Data, Data Science, Mobile, Real Time Processing, Storm and tagged , , . Bookmark the permalink.

One Response to Counting Unique Mobile App Users with HyperLogLog

  1. Pingback: Alarm Flooding Control with Event Clustering Using Spark Streaming | Mawazo

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s