In computer science, the count-distinct problem (also known in applied mathematics as the cardinality estimation problem) is the problem of finding the number of distinct elements in a data stream with repeated elements. This is a well-known problem with numerous applications. The elements might represent

IP addresses An Internet Protocol address (IP address) is a numerical label such as that is assigned to a device connected to a computer network that uses the Internet Protocol for communication. IP addresses serve two main functions: network interface id ...

of packets passing through a router,

unique visitor A unique user is a term in web analytics that refers to data of a Pageview of a unique IP, whose presence is only counted once, regardless of the number of pages they visit. This definition does not count repeat or returning users for a standard pe ...

s to a web site, elements in a large database, motifs in a

DNA Deoxyribonucleic acid (; DNA) is a polymer composed of two polynucleotide chains that coil around each other to form a double helix. The polymer carries genetic instructions for the development, functioning, growth and reproduction of al ...

sequence, or elements of

RFID Radio-frequency identification (RFID) uses electromagnetic fields to automatically identify and track tags attached to objects. An RFID system consists of a tiny radio transponder called a tag, a radio receiver, and a transmitter. When tri ...

sensor networks Wireless sensor networks (WSNs) refer to networks of spatially dispersed and dedicated sensors that monitor and record the physical conditions of the environment and forward the collected data to a central location. WSNs can measure environmental ...

Formal definition

: Instance: Consider a stream of elements

x_1, x_2, \ldots, x_s

with repetitions. Let

n

denote the number of distinct elements in the stream, with the set of distinct elements represented as

\

. : Objective: Find an estimate

\widehat

n

using only

m

storage units, where

m \ll n

. An example of an instance for the cardinality estimation problem is the stream:

a,b,a,c,d,b,d

. For this instance,

n = , \left\,  = 4

Naive solution

The naive solution to the problem is as follows: Initialize a counter, , to zero, Initialize an efficient dictionary data structure, , such as hash table or search tree in which insertion and membership can be performed quickly. , a membership query is issued. Increase by one, Otherwise do nothing. As long as the number of distinct elements is not too big, fits in main memory and an exact answer can be retrieved. However, this approach does not scale for bounded storage, or if the computation performed for each element

x_i

should be minimized. In such a case, several

streaming algorithms In computer science, streaming algorithms are algorithms for processing data streams in which the input is presented as a sequence of items and can be examined in only a few passes, typically one-pass algorithm, just one. These algorithms are desi ...

have been proposed that use a fixed number of storage units.

HyperLogLog algorithm

Streaming algorithms

To handle the bounded storage constraint,

streaming algorithm In computer science, streaming algorithms are algorithms for processing data streams in which the input is presented as a sequence of items and can be examined in only a few passes, typically one-pass algorithm, just one. These algorithms are desi ...

s use a randomization to produce a non-exact estimation of the distinct number of elements,

n

. State-of-the-art estimators

hash Hash, hashes, hash mark, or hashing may refer to: Substances * Hash (food), a coarse mixture of ingredients, often based on minced meat * Hash (stew), a pork and onion-based gravy found in South Carolina * Hash, a nickname for hashish, a canna ...

every element

e_j

into a low-dimensional data sketch using a hash function,

h(e_j)

. The different techniques can be classified according to the data sketches they store.

Min/max sketches

Min/max sketches store only the minimum/maximum hashed values. Examples of known min/max sketch estimators: Chassaing et al. presents max sketch which is the

minimum-variance unbiased estimator In statistics a minimum-variance unbiased estimator (MVUE) or uniformly minimum-variance unbiased estimator (UMVUE) is an unbiased estimator that has lower variance than any other unbiased estimator for all possible values of the parameter. For pra ...

for the problem. The continuous max sketches estimator is the

maximum likelihood In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed stati ...

estimator. The estimator of choice in practice is the HyperLogLog algorithm. The intuition behind such estimators is that each sketch carries information about the desired quantity. For example, when every element

e_j

is associated with a uniform RV,

h(e_j) \sim U(0,1)

, the expected minimum value of

h(e_1),h(e_2), \ldots, h(e_n)

1/(n+1)

. The hash function guarantees that

h(e_j)

is identical for all the appearances of

e_j

. Thus, the existence of duplicates does not affect the value of the extreme order statistics. There are other estimation techniques other than min/max sketches. The first paper on count-distinct estimation describes the Flajolet–Martin algorithm, a bit pattern sketch. In this case, the elements are hashed into a bit vector and the sketch holds the logical OR of all hashed values. The first asymptotically space- and time-optimal algorithm for this problem was given by Daniel M. Kane, Jelani Nelson, and David P. Woodruff.

Bottom-''m'' sketches

Bottom-''m'' sketches are a generalization of min sketches, which maintain the

m

minimal values, where

m \geq 1

. See Cosma et al. for a theoretical overview of count-distinct estimation algorithms, and Metwally for a practical overview with comparative simulation results.

Python implementation of Knuth's CVM algorithm

def algorithm_d(stream, s: int): m = len(stream) # We assume that this is given to us in advance. t = -1 # Note that Knuth indexes the stream from 1. p = 1 a = 0 buffer = [] while t < (m - 1): t += 1 a = stream[t] u = uniform(0, 1) buffer = list(filter(lambda x: x[1] != a, buffer)) if u < p: if len(buffer) < s: buffer.append(

, a The comma is a punctuation mark that appears in several variants in different languages. Some typefaces render it as a small line, slightly curved or straight, but inclined from the vertical; others give it the appearance of a miniature fille ...

else: buffer = sorted(buffer) p = max(buffer 10], u) buffer.pop() buffer.append(

return len(buffer) / p

CVM algorithm

Compared to other approximation algorithms for the count-distinct problem the CVM Algorithm (named by

Donald Knuth Donald Ervin Knuth ( ; born January 10, 1938) is an American computer scientist and mathematician. He is a professor emeritus at Stanford University. He is the 1974 recipient of the ACM Turing Award, informally considered the Nobel Prize of comp ...

after the initials of Sourav Chakraborty, N. V. Vinodchandran, and Kuldeep S. Meel) uses sampling instead of hashing. The CVM Algorithm provides an unbiased estimator for the number of distinct elements in a stream, in addition to the standard (ε-δ) guarantees. Below is the CVM algorithm, including the slight modification by Donald Knuth. Initialize max buffer size

s

, where

s \geq 1

Initialize an empty buffer, in data stream

A

of size

n

do: insert

(a_t, u)

in else

(a',u')

such that

u' = \max\

(a',u')

whose

u'

is maximum in */ If

u > u'

then

p\leftarrow u

else Replace

(a',u')

with

(a_t, u)

p\leftarrow u'

The previous version of the CVM algorithm is improved with the following modification by Donald Knuth, that adds the while loop to ensure B is reduced. Initialize max buffer size

s

, where

s \geq 1

Initialize an empty buffer, in data stream

A

of size

n

do: Insert

(a_t, u)

into Remove every element of

(a', u')

of with

u' > \frac

u < p

then Insert

(a_t, u)

into

Weighted count-distinct problem

In its weighted version, each element is associated with a weight and the goal is to estimate the total sum of weights. Formally, : Instance: A stream of weighted elements

x_1,x_2,\ldots,x_s

with repetitions, and an integer

m

. Let

n

be the number of distinct elements, namely

n = , \left\,

, and let these elements be

\left\

. Finally, let

w_j

be the weight of

e_j

. : Objective: Find an estimate

\widehat

w = \sum_^w_j

using only

m

storage units, where

m \ll n

. An example of an instance for the weighted problem is:

a(3),b(4),a(3),c(2),d(3),b(4),d(3)

. For this instance,

e_1=a, e_2=b, e_3=c, e_4=d

, the weights are

w_1=3, w_2=4, w_3=2, w_4=3

and

\sum=12

. As an application example,

x_1,x_2,\ldots,x_s

could be IP packets received by a server. Each packet belongs to one of

n

IP flows

e_1,e_2,\ldots,e_n

. The weight

w_j

can be the load imposed by flow

e_j

on the server. Thus,

\sum_^

represents the total load imposed on the server by all the flows to which packets

x_1,x_2,\ldots,x_s

belong.

Solving the weighted count-distinct problem

Any extreme order statistics estimator (min/max sketches) for the unweighted problem can be generalized to an estimator for the weighted problem . For example, the weighted estimator proposed by Cohen et al. can be obtained when the continuous max sketches estimator is extended to solve the weighted problem. In particular, the HyperLogLog algorithm can be extended to solve the weighted problem. The extended HyperLogLog algorithm offers the best performance, in terms of statistical accuracy and memory usage, among all the other known algorithms for the weighted problem.

References

{{reflist Statistical algorithms Articles with example Python (programming language) code