A counting Bloom filter is a generalized
data structure
In computer science, a data structure is a data organization, management, and storage format that is usually chosen for Efficiency, efficient Data access, access to data. More precisely, a data structure is a collection of data values, the rel ...
of
Bloom filter
A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in ...
, that is used to test whether a count number of a given
element is smaller than a given threshold when a sequence of elements is given. As a generalized form of
Bloom filter
A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in ...
,
false positive
A false positive is an error in binary classification in which a test result incorrectly indicates the presence of a condition (such as a disease when the disease is not present), while a false negative is the opposite error, where the test resul ...
matches are possible, but
false negatives
A false positive is an error in binary classification in which a test result incorrectly indicates the presence of a condition (such as a disease when the disease is not present), while a false negative is the opposite error, where the test resul ...
are not – in other words, a query returns either "possibly bigger or equal than the threshold" or "definitely smaller than the threshold".
Algorithm description
Most of the parameters are defined same with
Bloom filter
A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in ...
, such as ''m, k. m'' is the number of counters in counting Bloom filter, which is expansion of ''m'' bits in Bloom filter. An ''empty counting Bloom filter'' is a ''m'' counters, all set to 0. Similar to Bloom filter, there must also be ''k'' different
hash function
A hash function is any function that can be used to map data of arbitrary size to fixed-size values. The values returned by a hash function are called ''hash values'', ''hash codes'', ''digests'', or simply ''hashes''. The values are usually ...
s defined, each of which
maps
A map is a symbolic depiction emphasizing relationships between elements of some space, such as objects, regions, or themes.
Many maps are static, fixed to paper or some other durable medium, while others are dynamic or interactive. Althoug ...
or hashes some set element to one of the ''m'' counter array positions, generating a uniform random distribution. It is also similar that ''k'' is a constant, much smaller than ''m'', which is proportional to the number of elements to be added.
The main generalization of Bloom filter is adding an element. To ''add'' an element, feed it to each of the ''k'' hash functions to get ''k'' array positions and ''increment'' the counters 1 at all these positions.
To ''query'' for an element with a threshold ''θ'' (test whether the count number of an element is smaller than ''θ''), feed it to each of the ''k'' hash functions to get ''k'' counter positions. If any of the counters at these positions is less than ''θ'', the count number of element is definitely less than ''θ'' – if it were more and equal, then all the corresponding counters would have been greater or equal to ''θ''. If all are greater or equal to ''θ'', then either the count is really greater or equal to ''θ'', or the counters have by chance been greater or equal to ''θ''. If all are greater or equal to θ even though the count is smaller than ''θ'', this circumstance is defined as
false positive
A false positive is an error in binary classification in which a test result incorrectly indicates the presence of a condition (such as a disease when the disease is not present), while a false negative is the opposite error, where the test resul ...
. This also should be minimized like Bloom filter.
About hashing problem and advantages, see
Bloom filter
A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in ...
.
A counting Bloom filter is essentially the same data structure as
count–min sketch
In computing, the count–min sketch (CM sketch) is a probabilistic data structure that serves as a frequency table of events in a stream of data. It uses hash functions to map events to frequencies, but unlike a hash table uses only sub-linear s ...
es, but are used differently.
Potential for false negatives
Several implementations of counting bloom filters allow for deletion, by decrementing each of the ''k'' counters for a given input. This will introduce the probability of false negatives during a query if the deleted input has not previously been inserted into the filter. Gou ''et al.'' present the problem in great detail, and provide heuristics for the parameters ''m'', ''k'', and ''n'' which minimize the probability of false negatives.
Probability of false positives
The same assumptions in
Bloom filter
A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in ...
, which hash functions make insertions uniform random, is also assumed here. In the ''m'' pots, ''kn'' balls are inserted randomly. So the probability of one of counter in counting Bloom filter counts ''l'' is
,
where ''b'' is binomial distribution. By the way, counting Bloom filter determines an element is greater or equal to ''θ'' when the corresponding ''k'' counters are greater or equal to ''θ.'' Therefore, the probability that counting Bloom filter determines an element is greater or equal to ''θ'' is
.
This is different from formal definition of
false positive
A false positive is an error in binary classification in which a test result incorrectly indicates the presence of a condition (such as a disease when the disease is not present), while a false negative is the opposite error, where the test resul ...
in counting Bloom filter. However, following the assumption in Bloom filter, above probability is defined as false positive of counting Bloom filter. If ''θ''=1, the equation becomes false positive of Bloom filter.
Optimal number of hash functions
For large but fixed ''n'' and ''m'', the false positive decreases from ''k''=1 to a point defined
, and increases from
to positive infinity.
Kim et al. (2019)shows numerical values of
within
. For
they suggested using the floor or ceiling of
.
References
{{reflist
Hash-based data structures