A Bloom filter is a space-efficient
probabilistic
Probability is the branch of mathematics concerning numerical descriptions of how likely an Event (probability theory), event is to occur, or how likely it is that a proposition is true. The probability of an event is a number between 0 and ...
data structure
In computer science, a data structure is a data organization, management, and storage format that is usually chosen for efficient access to data. More precisely, a data structure is a collection of data values, the relationships among them, a ...
, conceived by
Burton Howard Bloom
Burton, Burtons, or Burton's may refer to:
Companies
* Burton (retailer), a clothing retailer
**Burton's, Abergavenny, a shop built for the company in 1937
**The Montague Burton Building, Dublin a shop built for the company between 1929 and 1 ...
in 1970, that is used to test whether an
element is a member of a
set.
False positive matches are possible, but
false negatives are not – in other words, a query returns either "possibly in set" or "definitely not in set". Elements can be added to the set, but not removed (though this can be addressed with the
counting Bloom filter A counting Bloom filter is a generalized data structure of Bloom filter, that is used to test whether a count number of a given element is smaller than a given threshold when a sequence of elements is given. As a generalized form of Bloom filter, f ...
variant); the more items added, the larger the probability of false positives.
Bloom proposed the technique for applications where the amount of source data would require an impractically large amount of memory if "conventional" error-free
hashing techniques were applied. He gave the example of a
hyphenation algorithm for a dictionary of 500,000 words, out of which 90% follow simple hyphenation rules, but the remaining 10% require expensive disk accesses to retrieve specific hyphenation patterns. With sufficient
core memory, an error-free hash could be used to eliminate all unnecessary disk accesses; on the other hand, with limited core memory, Bloom's technique uses a smaller hash area but still eliminates most unnecessary accesses. For example, a hash area only 15% of the size needed by an ideal error-free hash still eliminates 85% of the disk accesses.
More generally, fewer than 10
bits per element are required for a 1% false positive probability, independent of the size or number of elements in the set.
Algorithm description

An ''empty Bloom filter'' is a
bit array of bits, all set to 0. There must also be different
hash functions defined, each of which
maps or hashes some set element to one of the array positions, generating a uniform random distribution. Typically, is a small constant which depends on the desired false error rate , while is proportional to and the number of elements to be added.
To ''add'' an element, feed it to each of the hash functions to get array positions. Set the bits at all these positions to 1.
To ''query'' for an element (test whether it is in the set), feed it to each of the hash functions to get array positions. If ''any'' of the bits at these positions is 0, the element is definitely not in the set; if it were, then all the bits would have been set to 1 when it was inserted. If all are 1, then either the element is in the set, ''or'' the bits have by chance been set to 1 during the insertion of other elements, resulting in a
false positive. In a simple Bloom filter, there is no way to distinguish between the two cases, but more advanced techniques can address this problem.
The requirement of designing different independent hash functions can be prohibitive for large . For a good hash function with a wide output, there should be little if any correlation between different bit-fields of such a hash, so this type of hash can be used to generate multiple "different" hash functions by slicing its output into multiple bit fields. Alternatively, one can pass different initial values (such as 0, 1, ..., − 1) to a hash function that takes an initial value; or add (or append) these values to the key. For larger and/or , independence among the hash functions can be relaxed with negligible increase in false positive rate. (Specifically, show the effectiveness of deriving the indices using
enhanced double hashing
and
triple hashing, variants of
double hashing that are effectively simple random number generators seeded with the two or three hash values.)
Removing an element from this simple Bloom filter is impossible because there is no way to tell which of the bits it maps to should be cleared. Although setting any one of those bits to zero suffices to remove the element, it would also remove any other elements that happen to map onto that bit. Since the simple algorithm provides no way to determine whether any other elements have been added that affect the bits for the element to be removed, clearing any of the bits would introduce the possibility of false negatives.
One-time removal of an element from a Bloom filter can be simulated by having a second Bloom filter that contains items that have been removed. However, false positives in the second filter become false negatives in the composite filter, which may be undesirable. In this approach re-adding a previously removed item is not possible, as one would have to remove it from the "removed" filter.
It is often the case that all the keys are available but are expensive to enumerate (for example, requiring many disk reads). When the false positive rate gets too high, the filter can be regenerated; this should be a relatively rare event.
Space and time advantages

While risking false positives, Bloom filters have a substantial space advantage over other data structures for representing sets, such as
self-balancing binary search trees,
tries,
hash tables, or simple
arrays or
linked list
In computer science, a linked list is a linear collection of data elements whose order is not given by their physical placement in memory. Instead, each element points to the next. It is a data structure consisting of a collection of nodes whic ...
s of the entries. Most of these require storing at least the data items themselves, which can require anywhere from a small number of bits, for small integers, to an arbitrary number of bits, such as for strings ( are an exception since they can share storage between elements with equal prefixes). However, Bloom filters do not store the data items at all, and a separate solution must be provided for the actual storage. Linked structures incur an additional linear space overhead for pointers. A Bloom filter with a 1% error and an optimal value of , in contrast, requires only about 9.6 bits per element, regardless of the size of the elements. This advantage comes partly from its compactness, inherited from arrays, and partly from its probabilistic nature. The 1% false-positive rate can be reduced by a factor of ten by adding only about 4.8 bits per element.
However, if the number of potential values is small and many of them can be in the set, the Bloom filter is easily surpassed by the deterministic
bit array, which requires only one bit for each potential element. Hash tables gain a space and time advantage if they begin ignoring collisions and store only whether each bucket contains an entry; in this case, they have effectively become Bloom filters with .
Bloom filters also have the unusual property that the time needed either to add items or to check whether an item is in the set is a fixed constant, , completely independent of the number of items already in the set. No other constant-space set data structure has this property, but the average access time of sparse
hash tables can make them faster in practice than some Bloom filters. In a hardware implementation, however, the Bloom filter shines because its lookups are independent and can be parallelized.
To understand its space efficiency, it is instructive to compare the general Bloom filter with its special case when . If , then in order to keep the false positive rate sufficiently low, a small fraction of bits should be set, which means the array must be very large and contain long runs of zeros. The
information content
In information theory, the information content, self-information, surprisal, or Shannon information is a basic quantity derived from the probability of a particular event occurring from a random variable. It can be thought of as an alternative w ...
of the array relative to its size is low. The generalized Bloom filter ( greater than 1) allows many more bits to be set while still maintaining a low false positive rate; if the parameters ( and ) are chosen well, about half of the bits will be set, and these will be apparently random, minimizing redundancy and maximizing information content.
Probability of false positives

Assume that a
hash function selects each array position with equal probability. If ''m'' is the number of bits in the array, the probability that a certain bit is not set to 1 by a certain hash function during the insertion of an element is
:
If ''k'' is the number of hash functions and each has no significant correlation between each other, then the probability that the bit is not set to 1 by any of the hash functions is
:
We can use the well-known identity for
''e''−1
:
to conclude that, for large ''m'',
:
If we have inserted ''n'' elements, the probability that a certain bit is still 0 is
:
the probability that it is 1 is therefore
:
Now test membership of an element that is not in the set. Each of the ''k'' array positions computed by the hash functions is 1 with a probability as above. The probability of all of them being 1, which would cause the
algorithm
In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for performing ...
to erroneously claim that the element is in the set, is often given as
:
This is not strictly correct as it assumes independence for the probabilities of each bit being set. However, assuming it is a close approximation we have that the probability of false positives decreases as ''m'' (the number of bits in the array) increases, and increases as ''n'' (the number of inserted elements) increases.
The true probability of a false positive, without assuming independence, is
:
where the denote
Stirling numbers of the second kind
In mathematics, particularly in combinatorics, a Stirling number of the second kind (or Stirling partition number) is the number of ways to partition a set of ''n'' objects into ''k'' non-empty subsets and is denoted by S(n,k) or \textstyle \le ...
.
An alternative analysis arriving at the same approximation without the assumption of independence is given by Mitzenmacher and Upfal. After all ''n'' items have been added to the Bloom filter, let ''q'' be the fraction of the ''m'' bits that are set to 0. (That is, the number of bits still set to 0 is ''qm''.) Then, when testing membership of an element not in the set, for the array position given by any of the ''k'' hash functions, the probability that the bit is found set to 1 is
. So the probability that all ''k'' hash functions find their bit set to 1 is
. Further, the expected value of ''q'' is the probability that a given array position is left untouched by each of the ''k'' hash functions for each of the ''n'' items, which is (as above)
:
.
It is possible to prove, without the independence assumption, that ''q'' is very strongly concentrated around its expected value. In particular, from the
Azuma–Hoeffding inequality, they prove that
:
Because of this, we can say that the exact probability of false positives is
:
as before.
Optimal number of hash functions
The number of hash functions, ''k'', must be a positive integer. Putting this constraint aside, for a given ''m'' and ''n'', the value of ''k'' that minimizes the false positive probability is
:
The required number of bits, ''m'', given ''n'' (the number of inserted elements) and a desired false positive probability ''ε'' (and assuming the optimal value of ''k'' is used) can be computed by substituting the optimal value of ''k'' in the probability expression above:
:
which can be simplified to:
:
This results in:
:
So the optimal number of bits per element is
:
with the corresponding number of hash functions ''k'' (ignoring integrality):
:
This means that for a given false positive probability ''ε'', the length of a Bloom filter ''m'' is proportionate to the number of elements being filtered ''n'' and the required number of hash functions only depends on the target false positive probability ''ε''.
The formula
is approximate for three reasons. First, and of least concern, it approximates
as
, which is a good asymptotic approximation (i.e., which holds as ''m'' →∞). Second, of more concern, it assumes that during the membership test the event that one tested bit is set to 1 is independent of the event that any other tested bit is set to 1. Third, of most concern, it assumes that
is fortuitously integral.
Goel and Gupta, however, give a rigorous upper bound that makes no approximations and requires no assumptions. They show that the false positive probability for a finite Bloom filter with ''m'' bits (
), ''n'' elements, and ''k'' hash functions is at most
:
This bound can be interpreted as saying that the approximate formula
can be applied at a penalty of at most half an extra element and at most one fewer bit.
Approximating the number of items in a Bloom filter
showed that the number of items in a Bloom filter can be approximated with the following formula,
:
where
is an estimate of the number of items in the filter,
m is the length (size) of the filter,
k is the number of hash functions, and
X is the number of bits set to one.
The union and intersection of sets
Bloom filters are a way of compactly representing a set of items. It is common to try to compute the size of the intersection or union between two sets. Bloom filters can be used to approximate the size of the intersection and union of two sets. showed that for two Bloom filters of length , their counts, respectively can be estimated as
:
and
:
The size of their union can be estimated as
:
where
is the number of bits set to one in either of the two Bloom filters. Finally, the intersection can be estimated as
:
using the three formulas together.
Interesting properties
* Unlike a standard
hash table using
open addressing for
collision resolution, a Bloom filter of a fixed size can represent a set with an arbitrarily large number of elements; adding an element never fails due to the data structure "filling up". However, the false positive rate increases steadily as elements are added until all bits in the filter are set to 1, at which point ''all'' queries yield a positive result. With open addressing hashing, false positives are never produced, but performance steadily deteriorates until it approaches linear search.
*
Union and
intersection of Bloom filters with the same size and set of hash functions can be implemented with
bitwise
In computer programming, a bitwise operation operates on a bit string, a bit array or a binary numeral (considered as a bit string) at the level of its individual bits. It is a fast and simple action, basic to the higher-level arithmetic oper ...
OR and AND operations, respectively. The union operation on Bloom filters is lossless in the sense that the resulting Bloom filter is the same as the Bloom filter created from scratch using the union of the two sets. The intersect operation satisfies a weaker property: the false positive probability in the resulting Bloom filter is at most the false-positive probability in one of the constituent Bloom filters, but may be larger than the false positive probability in the Bloom filter created from scratch using the intersection of the two sets.
* Some kinds of
superimposed code can be seen as a Bloom filter implemented with physical
edge-notched cards. An example is
Zatocoding, invented by
Calvin Mooers in 1947, in which the set of categories associated with a piece of information is represented by notches on a card, with a random pattern of four notches for each category.
Examples
*
Fruit flies use a modified version of Bloom filters to detect novelty of odors, with additional features including similarity of novel odor to that of previously experienced examples, and time elapsed since previous experience of the same odor.
*The servers of
Akamai Technologies, a
content delivery provider, use Bloom filters to prevent "one-hit-wonders" from being stored in its disk caches. One-hit-wonders are web objects requested by users just once, something that Akamai found applied to nearly three-quarters of their caching infrastructure. Using a Bloom filter to detect the second request for a web object and caching that object only on its second request prevents one-hit wonders from entering the disk cache, significantly reducing disk workload and increasing disk cache hit rates.
* Google
Bigtable,
Apache HBase and
Apache Cassandra and
PostgreSQL use Bloom filters to reduce the disk lookups for non-existent rows or columns. Avoiding costly disk lookups considerably increases the performance of a database query operation.
* The
Google Chrome web browser used to use a Bloom filter to identify malicious URLs. Any URL was first checked against a local Bloom filter, and only if the Bloom filter returned a positive result was a full check of the URL performed (and the user warned, if that too returned a positive result).
* Microsoft
Bing (search engine) uses multi-level hierarchical Bloom filters for its search index,
BitFunnel. Bloom filters provided lower cost than the previous Bing index, which was based on
inverted files.
* The
Squid
True squid are molluscs with an elongated soft body, large eyes, eight arms, and two tentacles in the superorder Decapodiformes, though many other molluscs within the broader Neocoleoidea are also called squid despite not strictly fitting ...
Web Proxy
Cache uses Bloom filters for cache digests.
*
Bitcoin
Bitcoin ( abbreviation: BTC; sign: ₿) is a decentralized digital currency that can be transferred on the peer-to-peer bitcoin network. Bitcoin transactions are verified by network nodes through cryptography and recorded in a public di ...
used Bloom filters to speed up wallet synchronization until privacy vulnerabilities with the implementation of Bloom filters were discovered.
* The
Venti archival storage system uses Bloom filters to detect previously stored data.
* The
SPIN model checker uses Bloom filters to track the reachable state space for large verification problems.
* The
Cascading analytics framework uses Bloom filters to speed up asymmetric joins, where one of the joined data sets is significantly larger than the other (often called Bloom join in the database literature).
* The
Exim mail transfer agent (MTA) uses Bloom filters in its rate-limit feature.
*
Medium uses Bloom filters to avoid recommending articles a user has previously read.
*
Ethereum uses Bloom filters for quickly finding logs on the Ethereum blockchain.
*
Grafana Tempo uses Bloom filters to improve query performance by storing bloom filters for each backend block. These are accessed on each query to determine the blocks containing data that meets the supplied search criteria
Alternatives
Classic Bloom filters use
bits of space per inserted key, where
is the false positive rate of the Bloom filter. However, the space that is strictly necessary for any data structure playing the same role as a Bloom filter is only
per key. Hence Bloom filters use 44% more space than an equivalent optimal data structure. Instead, Pagh et al. provide an optimal-space data structure. Moreover, their data structure has constant
locality of reference
In computer science, locality of reference, also known as the principle of locality, is the tendency of a processor to access the same set of memory locations repetitively over a short period of time. There are two basic types of reference localit ...
independent of the false positive rate, unlike Bloom filters, where a lower false positive rate
leads to a greater number of memory accesses per query,
. Also, it allows elements to be deleted without a space penalty, unlike Bloom filters. The same improved properties of optimal space usage, constant locality of reference, and the ability to delete elements are also provided by the
cuckoo filter A cuckoo filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set, like a Bloom filter does. False positive matches are possible, but false negatives are not – in other words, a query r ...
of , an open source implementation of which is available.
describe a probabilistic structure based on
hash tables,
hash compaction
Hash, hashes, hash mark, or hashing may refer to:
Substances
* Hash (food), a coarse mixture of ingredients
* Hash, a nickname for hashish, a cannabis product
Hash mark
* Hash mark (sports), a marking on hockey rinks and gridiron football fie ...
, which identify as significantly more accurate than a Bloom filter when each is configured optimally. Dillinger and Manolios, however, point out that the reasonable accuracy of any given Bloom filter over a wide range of numbers of additions makes it attractive for probabilistic enumeration of state spaces of unknown size. Hash compaction is, therefore, attractive when the number of additions can be predicted accurately; however, despite being very fast in software, hash compaction is poorly suited for hardware because of worst-case linear access time.
have studied some variants of Bloom filters that are either faster or use less space than classic Bloom filters. The basic idea of the fast variant is to locate the k hash values associated with each key into one or two blocks having the same size as processor's memory cache blocks (usually 64 bytes). This will presumably improve performance by reducing the number of potential memory
cache misses. The proposed variants have however the drawback of using about 32% more space than classic Bloom filters.
The space efficient variant relies on using a single hash function that generates for each key a value in the range