## Counting Unique Mobile App Users with HyperLogLog

Continuing along the theme of real time analytic with approximate algorithms, the  focus this time is approximate cardinality estimation. To put the ideas in a context, the use case we will be working with is for counting number of unique users for a mobile app. Analyzing the trend of such unique counts, reveal valuable insights into the popularity of an app.

We will be using  HyperLogLog algorithm which is available in my open source project hoidla as a Java API. The storm implementation of the use case is available in my other open source web analytic project visitante.

## HyperLogLog

Cardinality is the number of unique items in a list. In a naive implementation, cardinality can be estimated using memory proportional to cardinality size.However, when the cardinality is very high (e.g., IP address, phone number), such a naive approach is not pragmatic. Various approximate probabilistic algorithms can be used for cardinality estimation.

HyperLogLog is based on analyzing some bit patterns in the hashed value of an item. We look at the length of the sequence of most significant zero bits. The maximum length among such zero bit sequences from all the hashed values is indicative of the unique item count.

In reality, to improve the quality of the result, multiple independent hash functions are used and the longest sequence zero bits resulting from each hash function is used to produce the final averaged unique count value.

Instead of using multiple hash functions, we will use a technique called stochastic averaging. We set aside a sequence of significant bits of the hash for buckets. From the remaining bits we find the sequence of most significant zero bits. For example, to use 256 buckets we use the most significant 8 bits for buckets and the remaining 24 bits to find the sequence of most significant zero bits. For each bucket, we maintain a count of maximum length of sequence of zero bits.

## Small Cardinality

The HyperLogLog algorithm as described in the original paper does not work well for small cardinality. As suggested in the paper, when the unique count falls below a threshold an algorithm based on probabilistic properties of random allocations.

The correction for  small cardinality is included in the implementation in hoidla. However, if the knowledge of small cardinality is not known a priori, you could do simple hash based counting instead of HyperLogLog.

## Mobile App Usage Data

By instrumenting the  SDK calls of the app, usage data is created with the following 5 fields. The phone number is used as an identifier for the user.

1. Date
2. Time
3. Session ID
4. Phone area code
5. Phone number

As we will see later, the phone area code is used to partition the data in the Storm implementation. The data is fed to storm through a message queue. Here is some sample data

```2014-11-16 02:17:48 c080f1fa-6d79-11e4-aa9d-c42c030f8af1 310 (310)6121967
2014-11-16 02:17:50 d0997c05-6d79-11e4-a360-c42c030f8af1 408 (408)4937187
2014-11-16 02:17:50 cbd4af2e-6d79-11e4-b7e1-c42c030f8af1 339 (339)8242149
2014-11-16 02:17:52 d0997c05-6d79-11e4-a360-c42c030f8af1 408 (408)4937187
2014-11-16 02:17:54 cbd4af2e-6d79-11e4-b7e1-c42c030f8af1 339 (339)8242149
2014-11-16 02:17:56 d0997c05-6d79-11e4-a360-c42c030f8af1 408 (408)4937187
2014-11-16 02:17:56 cbd4af2e-6d79-11e4-b7e1-c42c030f8af1 339 (339)8242149
2014-11-16 02:17:57 d5f8956b-6d79-11e4-891d-c42c030f8af1 213 (213)7703334
```

## Storm Topology

The storm topology architecture consists of  a spout and two bolts. The spout reads usage  from a message queue. A simple message queue abstraction available in my open source project chombo  is used. It facilitates usage of any message queue. I  have used Redis.

The data emitted by the spout is field grouped on area code. It’s essentially hash partitioning on the area code. All the data for the same are code is processed by the same bolt instance of  UniqueVisitorCounterBolt. Each bolt instance maintains an instance of HyperLogLog object. When a new tuple arrives, it”s processed by the HyperLogLog object.

When the UniqueVisitorCounterBolt receives a tick tuple, it obtains the unique count from the HyperLogLog object and emits the tuple (boltID,  uniqueCount).

The tuple emitted by  UniqueVisitorCounterBolt  is processed by the  UniqueVisitorAggregatorBolt, of which there is only instance. As the name suggests it aggregates the counts, which is simply summing up the unique counts from the predecessor bolt layer. The result is written to a message queue, which any client application can consume for further trend analysis of unique user count data.

The output is simply a the tuple (currentTime, uniqueCount).  Here is some sample output. As new records are processed, the unique count grows. Unique count is always monotonically increasing.

```1416191997120  76
1416192007121  76
1416192017123  77
1416192027584  78
1416192037125  78
1416192047126  78
1416192057126  78
1416192067128  79
1416192077128  79
```

## Temporal Reference

The unique count has a temporal reference point for time series data. The counting is with respect to some point in past. In case of of a mobile app, it will be the launch date of the app.

Although generally, the temporal reference does not change once set, some times it may be necessary to change  it. There is a mechanism to clear the counter and start counting with  a clean slate.

A simple publish subscribe mechanism is used to dispatch commands to bolt instances. A simple pub sub interface is available in chombo, with implementation for different messaging provider. I have used Redis. On receipt of a tick tuple, theUniqueVisitorCounterBolt bolt  fetches  the command if any from the pub-sub system. Then the HyperLogLog counter is cleared.

## Summing Up

We have gone through a exercise of using probabilistic counting algorithm for approximate unique count estimation. The HyperLogLog algorithm has been used for network traffic data analysis and query planner in databases. Step by step instruction to run this use case in available in this tutorial document.

For commercial support for this solution or other solutions in my github repositories, please talk to ThirdEye Data Science Services. Support is available for Hadoop or Spark deployment on cloud including installation, configuration and testing,