The Flajolet–Martin algorithm is an
algorithm
In mathematics and computer science, an algorithm () is a finite sequence of Rigour#Mathematics, mathematically rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algo ...
for approximating the number of distinct elements in a stream with a single pass and space-consumption logarithmic in the maximal number of possible distinct elements in the stream (the
count-distinct problem
In computer science, the count-distinct problem
(also known in applied mathematics as the cardinality estimation problem) is the problem of finding the number of distinct elements in a data stream with repeated elements.
This is a well-known proble ...
). The algorithm was introduced by
Philippe Flajolet and
G. Nigel Martin in their 1984 article "Probabilistic Counting Algorithms for Data Base Applications". Later it has been refined in "LogLog counting of large cardinalities" by
Marianne Durand and
Philippe Flajolet, and "
HyperLogLog: The analysis of a near-optimal cardinality estimation algorithm" by
Philippe Flajolet et al.
In their 2010 article "An optimal algorithm for the distinct elements problem",
Daniel M. Kane,
Jelani Nelson and
David P. Woodruff give an improved algorithm, which uses nearly optimal space and has optimal ''O''(1) update and reporting times.
The algorithm
Assume that we are given a
hash function
A hash function is any Function (mathematics), function that can be used to map data (computing), data of arbitrary size to fixed-size values, though there are some hash functions that support variable-length output. The values returned by a ...
that maps input
to integers in the range
multiset
In mathematics, a multiset (or bag, or mset) is a modification of the concept of a set that, unlike a set, allows for multiple instances for each of its elements. The number of instances given for each element is called the ''multiplicity'' of ...
M is as follows:
# Initialize a bit-vector BITMAP to be of length
L and contain all 0s.
# For each element
x in
M:
## Calculate the index
i = \rho(\mathrm(x)).
## Set
\mathrm = 1.
# Let
R denote the smallest index
i such that
\mathrm = 0.
# Estimate the cardinality of
M as
2^R / \phi, where
\phi \approx 0.77351.
The idea is that if
n is the number of distinct elements in the multiset
M, then
\mathrm /math> is accessed approximately n/2 times, \mathrm /math> is accessed approximately n/4 times and so on. Consequently, if i \gg \log_2 n, then \mathrm /math> is almost certainly 0, and if i \ll \log_2 n, then \mathrm /math> is almost certainly 1. If i \approx \log_2 n, then \mathrm /math> can be expected to be either 1 or 0.
The correction factor \phi \approx 0.77351 is found by calculations, which can be found in the original article.
Improving accuracy
A problem with the Flajolet–Martin algorithm in the above form is that the results vary significantly. A common solution has been to run the algorithm multiple times with
k different hash functions and combine the results from the different runs. One idea is to take the mean of the
k results together from each hash function, obtaining a single estimate of the cardinality. The problem with this is that averaging is very susceptible to outliers (which are likely here). A different idea is to use the
median
The median of a set of numbers is the value separating the higher half from the lower half of a Sample (statistics), data sample, a statistical population, population, or a probability distribution. For a data set, it may be thought of as the “ ...
, which is less prone to be influences by outliers. The problem with this is that the results can only take form
2^R / \phi, where
R is integer. A common solution is to combine both the mean and the median: Create
k \cdot l hash functions and split them into
k distinct groups (each of size
l). Within each group use the mean for aggregating together the
l results, and finally take the median of the
k group estimates as the final estimate.
The 2007
HyperLogLog algorithm splits the multiset into subsets and estimates their cardinalities, then it uses the
harmonic mean
In mathematics, the harmonic mean is a kind of average, one of the Pythagorean means.
It is the most appropriate average for ratios and rate (mathematics), rates such as speeds, and is normally only used for positive arguments.
The harmonic mean ...
to combine them into an estimate for the original cardinality.
[
]
See also
* Streaming algorithm
In computer science, streaming algorithms are algorithms for processing data streams in which the input is presented as a sequence of items and can be examined in only a few passes, typically one-pass algorithm, just one. These algorithms are desi ...
* HyperLogLog
References
Additional sources
*
{{DEFAULTSORT:Flajolet-Martin algorithm
Algorithms