HyperLogLog
   HOME

TheInfoList



OR:

HyperLogLog is an algorithm for the
count-distinct problem In computer science, the count-distinct problem (also known in applied mathematics as the cardinality estimation problem) is the problem of finding the number of distinct elements in a data stream with repeated elements. This is a well-known proble ...
, approximating the number of distinct elements in a
multiset In mathematics, a multiset (or bag, or mset) is a modification of the concept of a set that, unlike a set, allows for multiple instances for each of its elements. The number of instances given for each element is called the ''multiplicity'' of ...
. Calculating the ''exact''
cardinality The thumb is the first digit of the hand, next to the index finger. When a person is standing in the medical anatomical position (where the palm is facing to the front), the thumb is the outermost digit. The Medical Latin English noun for thum ...
of the distinct elements of a multiset requires an amount of memory proportional to the cardinality, which is impractical for very large data sets. Probabilistic cardinality estimators, such as the HyperLogLog algorithm, use significantly less memory than this, but can only approximate the cardinality. The HyperLogLog algorithm is able to estimate cardinalities of > 109 with a typical accuracy (standard error) of 2%, using 1.5 kB of memory. HyperLogLog is an extension of the earlier LogLog algorithm, itself deriving from the 1984
Flajolet–Martin algorithm The Flajolet–Martin algorithm is an algorithm for approximating the number of distinct elements in a stream with a single pass and space-consumption logarithmic in the maximal number of possible distinct elements in the stream (the count-distinct ...
.


Terminology

In the original paper by Flajolet ''et al.'' and in related literature on the
count-distinct problem In computer science, the count-distinct problem (also known in applied mathematics as the cardinality estimation problem) is the problem of finding the number of distinct elements in a data stream with repeated elements. This is a well-known proble ...
, the term "cardinality" is used to mean the number of distinct elements in a data stream with repeated elements. However in the theory of
multiset In mathematics, a multiset (or bag, or mset) is a modification of the concept of a set that, unlike a set, allows for multiple instances for each of its elements. The number of instances given for each element is called the ''multiplicity'' of ...
s the term refers to the sum of multiplicities of each member of a multiset. This article chooses to use Flajolet's definition for consistency with the sources.


Algorithm

The basis of the HyperLogLog algorithm is the observation that the cardinality of a multiset of uniformly distributed random numbers can be estimated by calculating the maximum number of leading zeros in the binary representation of each number in the set. If the maximum number of leading zeros observed is ''n'', an estimate for the number of distinct elements in the set is 2''n''. In the HyperLogLog algorithm, a
hash function A hash function is any Function (mathematics), function that can be used to map data (computing), data of arbitrary size to fixed-size values, though there are some hash functions that support variable-length output. The values returned by a ...
is applied to each element in the original multiset to obtain a multiset of uniformly distributed random numbers with the same cardinality as the original multiset. The cardinality of this randomly distributed set can then be estimated using the algorithm above. The simple estimate of cardinality obtained using the algorithm above has the disadvantage of a large
variance In probability theory and statistics, variance is the expected value of the squared deviation from the mean of a random variable. The standard deviation (SD) is obtained as the square root of the variance. Variance is a measure of dispersion ...
. In the HyperLogLog algorithm, the variance is minimised by splitting the multiset into numerous subsets, calculating the maximum number of leading zeros in the numbers in each of these subsets, and using a
harmonic mean In mathematics, the harmonic mean is a kind of average, one of the Pythagorean means. It is the most appropriate average for ratios and rate (mathematics), rates such as speeds, and is normally only used for positive arguments. The harmonic mean ...
to combine these estimates for each subset into an estimate of the cardinality of the whole set.


Operations

The HyperLogLog has three main operations: add to add a new element to the set, count to obtain the cardinality of the set and merge to obtain the union of two sets. Some derived operations can be computed using the
inclusion–exclusion principle In combinatorics, the inclusion–exclusion principle is a counting technique which generalizes the familiar method of obtaining the number of elements in the union (set theory), union of two finite sets; symbolically expressed as : , A \cup B, ...
like the cardinality of the intersection or the cardinality of the difference between two HyperLogLogs combining the merge and count operations. The data of the HyperLogLog is stored in an array of counters (or "registers") that are initialized to 0. Array initialized from a multiset is called ''HyperLogLog'' ''sketch of S.''


Add

The add operation consists of computing the hash of the input data with a hash function , getting the first bits (where is \log_2(m)), and adding 1 to them to obtain the address of the register to modify. With the remaining bits compute \rho(w) which returns the position of the leftmost 1, where leftmost position is 1 (in other words: number of leading zeros plus 1). The new value of the register will be the maximum between the current value of the register and \rho(w). : \begin x &:= h(v) \\ j &:= 1 +\langle x_x_...x_\rangle_2 \\ w &:= x_x_... \\ M &:= \max(M \rho(w)) \\ \end


Count

The count algorithm consists in computing the harmonic mean of the registers, and using a constant to derive an estimate E of the count: : Z = \Bigg(\sum_^\Bigg)^ : \alpha_ = \left(m\int_^\left(\log_\left(\frac\right)\right)^\,du\right)^ : E = \alpha_m m^2 Z The intuition is that being the unknown cardinality of , each subset M_j will have n/m elements. Then \max_ \rho(x) should be close to \log_2(n/m). The harmonic mean of 2 to these quantities is mZ which should be near n/m. Thus, m^2 Z should be approximately. Finally, the constant \alpha_m is introduced to correct a systematic multiplicative bias present in m^2 Z due to hash collisions.


Practical considerations

The constant \alpha_m is not simple to calculate, and can be approximated with the formula : \alpha_\approx\begin 0.673, & \text m=16; \\ 0.697, & \text m=32; \\ 0.709, & \text m=64; \\ \frac, & \text m\ge128. \end The HyperLogLog technique, though, is biased for small cardinalities below a threshold of \fracm. The original paper proposes using a different algorithm for small cardinalities known as Linear Counting. In the case where the estimate provided above is less than the threshold E < \fracm, the alternative calculation can be used: # Let V be the count of registers equal to 0. # If V = 0, use the standard HyperLogLog estimator E above. # Otherwise, use Linear Counting: E^ = m\log\left(\frac\right) Additionally, for very large cardinalities approaching the limit of the size of the registers (E > \frac for 32-bit registers), the cardinality can be estimated with: : E^=-2^\log\left(1-\frac\right) With the above corrections for lower and upper bounds, the error can be estimated as \sigma=1.04/\sqrt.


Merge

The merge operation for two HLLs (\mathit_1, \mathit_2) consists in obtaining the maximum for each pair of registers j: 1..m : \mathit_\text = \max(\mathit_1 \mathit_2


Complexity

To analyze the complexity, the data streaming (\epsilon,\delta) model is used, which analyzes the space necessary to get a 1\pm \epsilon approximation with a fixed success probability 1-\delta. The relative error of HLL is 1.04/\sqrt and it needs O(\epsilon^ \log\log n + \log n) space, where is the set cardinality and is the number of registers (usually less than one byte size). The add operation depends on the size of the output of the hash function. As this size is fixed, we can consider the running time for the add operation to be O(1). The count and merge operations depend on the number of registers and have a theoretical cost of O(m). In some implementations (
Redis Redis (; Remote Dictionary Server) is an in-memory key–value database, used as a distributed cache and message broker, with optional durability. Because it holds all data in memory and because of its design, Redis offers low- latency reads ...
) the number of registers is fixed and the cost is considered to be O(1) in the documentation.


HLL++

The HyperLogLog++ algorithm proposes several improvements in the HyperLogLog algorithm to reduce memory requirements and increase accuracy in some ranges of cardinalities: * 64-bit hash function is used instead of the 32 bits used in the original paper. This reduces the hash collisions for large cardinalities allowing to remove the large range correction. * Some bias is found for small cardinalities when switching from linear counting to the HLL counting. An empirical bias correction is proposed to mitigate the problem. * A sparse representation of the registers is proposed to reduce memory requirements for small cardinalities, which can be later transformed to a dense representation if the cardinality grows.


Streaming HLL

When the data arrives in a single stream, the Historic Inverse Probability or martingale estimator significantly improves the accuracy of the HLL sketch and uses 36% less memory to achieve a given error level. This estimator is provably optimal for any duplicate insensitive approximate distinct counting sketch on a single stream. The single stream scenario also leads to variants in the HLL sketch construction. HLL-TailCut+ uses 45% less memory than the original HLL sketch but at the cost of being dependent on the data insertion order and not being able to merge sketches.


Further reading

*


References

{{reflist Probabilistic data structures Statistical algorithms