A counting Bloom filter is a generalized

data structure In computer science, a data structure is a data organization, management, and storage format that is usually chosen for Efficiency, efficient Data access, access to data. More precisely, a data structure is a collection of data values, the rel ...

Bloom filter A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in ...

, that is used to test whether a count number of a given element is smaller than a given threshold when a sequence of elements is given. As a generalized form of

false positive A false positive is an error in binary classification in which a test result incorrectly indicates the presence of a condition (such as a disease when the disease is not present), while a false negative is the opposite error, where the test resul ...

matches are possible, but

false negatives A false positive is an error in binary classification in which a test result incorrectly indicates the presence of a condition (such as a disease when the disease is not present), while a false negative is the opposite error, where the test resul ...

are not – in other words, a query returns either "possibly bigger or equal than the threshold" or "definitely smaller than the threshold".

Algorithm description

Most of the parameters are defined same with

, such as ''m, k. m'' is the number of counters in counting Bloom filter, which is expansion of ''m'' bits in Bloom filter. An ''empty counting Bloom filter'' is a ''m'' counters, all set to 0. Similar to Bloom filter, there must also be ''k'' different

hash function A hash function is any function that can be used to map data of arbitrary size to fixed-size values. The values returned by a hash function are called ''hash values'', ''hash codes'', ''digests'', or simply ''hashes''. The values are usually ...

s defined, each of which

maps A map is a symbolic depiction emphasizing relationships between elements of some space, such as objects, regions, or themes. Many maps are static, fixed to paper or some other durable medium, while others are dynamic or interactive. Althoug ...

or hashes some set element to one of the ''m'' counter array positions, generating a uniform random distribution. It is also similar that ''k'' is a constant, much smaller than ''m'', which is proportional to the number of elements to be added. The main generalization of Bloom filter is adding an element. To ''add'' an element, feed it to each of the ''k'' hash functions to get ''k'' array positions and ''increment'' the counters 1 at all these positions. To ''query'' for an element with a threshold ''θ'' (test whether the count number of an element is smaller than ''θ''), feed it to each of the ''k'' hash functions to get ''k'' counter positions. If any of the counters at these positions is less than ''θ'', the count number of element is definitely less than ''θ'' – if it were more and equal, then all the corresponding counters would have been greater or equal to ''θ''. If all are greater or equal to ''θ'', then either the count is really greater or equal to ''θ'', or the counters have by chance been greater or equal to ''θ''. If all are greater or equal to θ even though the count is smaller than ''θ'', this circumstance is defined as

. This also should be minimized like Bloom filter. About hashing problem and advantages, see

. A counting Bloom filter is essentially the same data structure as

count–min sketch In computing, the count–min sketch (CM sketch) is a probabilistic data structure that serves as a frequency table of events in a stream of data. It uses hash functions to map events to frequencies, but unlike a hash table uses only sub-linear s ...

es, but are used differently.

Potential for false negatives

Several implementations of counting bloom filters allow for deletion, by decrementing each of the ''k'' counters for a given input. This will introduce the probability of false negatives during a query if the deleted input has not previously been inserted into the filter. Gou ''et al.'' present the problem in great detail, and provide heuristics for the parameters ''m'', ''k'', and ''n'' which minimize the probability of false negatives.

Probability of false positives

The same assumptions in

, which hash functions make insertions uniform random, is also assumed here. In the ''m'' pots, ''kn'' balls are inserted randomly. So the probability of one of counter in counting Bloom filter counts ''l'' is

b(l, kn, \frac) = (\frac)^l (1 - \frac)^

, where ''b'' is binomial distribution. By the way, counting Bloom filter determines an element is greater or equal to ''θ'' when the corresponding ''k'' counters are greater or equal to ''θ.'' Therefore, the probability that counting Bloom filter determines an element is greater or equal to ''θ'' is

p_(\theta, k, n, m) = (1 - \sum\limits_ b(l, kn, \frac))^k

. This is different from formal definition of

in counting Bloom filter. However, following the assumption in Bloom filter, above probability is defined as false positive of counting Bloom filter. If ''θ''=1, the equation becomes false positive of Bloom filter.

Optimal number of hash functions

For large but fixed ''n'' and ''m'', the false positive decreases from ''k''=1 to a point defined

k_

, and increases from

k_

to positive infinity.
Kim et al. (2019)
shows numerical values of

k_

within

1\leq\theta\leq30

. For

\theta > 30

they suggested using the floor or ceiling of

\frac (0.2037\theta + 0.9176)

References

{{reflist Hash-based data structures