Continuing along the theme of real time analytic with approximate algorithms, the focus this time is approximate cardinality estimation. To put the ideas in a context, the use case we will be working with is for counting number of unique users for a mobile app. Analyzing the trend of such unique counts, reveal valuable insights into the popularity of an app.
We will be using HyperLogLog algorithm which is available in my open source project hoidla as a Java API. The storm implementation of the use case is available in my other open source web analytic project visitante.
Cardinality is the number of unique items in a list. In a naive implementation, cardinality can be estimated using memory proportional to cardinality size.However, when the cardinality is very high (e.g., IP address, phone number), such a naive approach is not pragmatic. Various approximate probabilistic algorithms can be used for cardinality estimation.
HyperLogLog is based on analyzing some bit patterns in the hashed value of an item. We look at the length of the sequence of most significant zero bits. The maximum length among such zero bit sequences from all the hashed values is indicative of the unique item count.
In reality, to improve the quality of the result, multiple independent hash functions are used and the longest sequence zero bits resulting from each hash function is used to produce the final averaged unique count value.
Instead of using multiple hash functions, we will use a technique called stochastic averaging. We set aside a sequence of significant bits of the hash for buckets. From the remaining bits we find the sequence of most significant zero bits. For example, to use 256 buckets we use the most significant 8 bits for buckets and the remaining 24 bits to find the sequence of most significant zero bits. For each bucket, we maintain a count of maximum length of sequence of zero bits.
The HyperLogLog algorithm as described in the original paper does not work well for small cardinality. As suggested in the paper, when the unique count falls below a threshold an algorithm based on probabilistic properties of random allocations.
The correction for small cardinality is included in the implementation in hoidla. However, if the knowledge of small cardinality is not known a priori, you could do simple hash based counting instead of HyperLogLog.
Mobile App Usage Data
By instrumenting the SDK calls of the app, usage data is created with the following 5 fields. The phone number is used as an identifier for the user.
- Session ID
- Phone area code
- Phone number
As we will see later, the phone area code is used to partition the data in the Storm implementation. The data is fed to storm through a message queue. Here is some sample data
2014-11-16 02:17:48 c080f1fa-6d79-11e4-aa9d-c42c030f8af1 310 (310)6121967 2014-11-16 02:17:50 d0997c05-6d79-11e4-a360-c42c030f8af1 408 (408)4937187 2014-11-16 02:17:50 cbd4af2e-6d79-11e4-b7e1-c42c030f8af1 339 (339)8242149 2014-11-16 02:17:52 d0997c05-6d79-11e4-a360-c42c030f8af1 408 (408)4937187 2014-11-16 02:17:54 cbd4af2e-6d79-11e4-b7e1-c42c030f8af1 339 (339)8242149 2014-11-16 02:17:56 d0997c05-6d79-11e4-a360-c42c030f8af1 408 (408)4937187 2014-11-16 02:17:56 cbd4af2e-6d79-11e4-b7e1-c42c030f8af1 339 (339)8242149 2014-11-16 02:17:57 d5f8956b-6d79-11e4-891d-c42c030f8af1 213 (213)7703334
The storm topology architecture consists of a spout and two bolts. The spout reads usage from a message queue. A simple message queue abstraction available in my open source project chombo is used. It facilitates usage of any message queue. I have used Redis.
The data emitted by the spout is field grouped on area code. It’s essentially hash partitioning on the area code. All the data for the same are code is processed by the same bolt instance of UniqueVisitorCounterBolt. Each bolt instance maintains an instance of HyperLogLog object. When a new tuple arrives, it”s processed by the HyperLogLog object.
When the UniqueVisitorCounterBolt receives a tick tuple, it obtains the unique count from the HyperLogLog object and emits the tuple (boltID, uniqueCount).
The tuple emitted by UniqueVisitorCounterBolt is processed by the UniqueVisitorAggregatorBolt, of which there is only instance. As the name suggests it aggregates the counts, which is simply summing up the unique counts from the predecessor bolt layer. The result is written to a message queue, which any client application can consume for further trend analysis of unique user count data.
The output is simply a the tuple (currentTime, uniqueCount). Here is some sample output. As new records are processed, the unique count grows. Unique count is always monotonically increasing.
1416191997120 76 1416192007121 76 1416192017123 77 1416192027584 78 1416192037125 78 1416192047126 78 1416192057126 78 1416192067128 79 1416192077128 79
The unique count has a temporal reference point for time series data. The counting is with respect to some point in past. In case of of a mobile app, it will be the launch date of the app.
Although generally, the temporal reference does not change once set, some times it may be necessary to change it. There is a mechanism to clear the counter and start counting with a clean slate.
A simple publish subscribe mechanism is used to dispatch commands to bolt instances. A simple pub sub interface is available in chombo, with implementation for different messaging provider. I have used Redis. On receipt of a tick tuple, theUniqueVisitorCounterBolt bolt fetches the command if any from the pub-sub system. Then the HyperLogLog counter is cleared.
We have gone through a exercise of using probabilistic counting algorithm for approximate unique count estimation. The HyperLogLog algorithm has been used for network traffic data analysis and query planner in databases. Step by step instruction to run this use case in available in this tutorial document.