A rolling hash (also known as recursive hashing or rolling checksum) is a
hash function
A hash function is any function that can be used to map data of arbitrary size to fixed-size values. The values returned by a hash function are called ''hash values'', ''hash codes'', ''digests'', or simply ''hashes''. The values are usually ...
where the input is hashed in a window that moves through the input.
A few hash functions allow a rolling hash to be computed very quickly—the new hash value is rapidly calculated given only the old hash value, the old value removed from the window, and the new value added to the window—similar to the way a
moving average
In statistics, a moving average (rolling average or running average) is a calculation to analyze data points by creating a series of averages of different subsets of the full data set. It is also called a moving mean (MM) or rolling mean and is ...
function can be computed much more quickly than other low-pass filters.
One of the main applications is the
Rabin–Karp string search algorithm, which uses the rolling hash described below. Another popular application is the
rsync
rsync is a utility for efficiently transferring and synchronizing files between a computer and a storage drive and across networked computers by comparing the modification times and sizes of files. It is commonly found on Unix-like operatin ...
program, which uses a checksum based on Mark Adler's
adler-32 as its rolling hash. Low Bandwidth Network Filesystem (LBFS) uses a
Rabin fingerprint The Rabin fingerprinting scheme is a method for implementing fingerprints using polynomials over a finite field. It was proposed by Michael O. Rabin.
Scheme
Given an ''n''-bit message ''m''0,...,''m''n-1, we view it as a polynomial of degree ''n''- ...
as its rolling hash. FastCDC (Fast Content-Defined Chunking) uses a compute-efficient Gear fingerprint as its rolling hash.
At best, rolling hash values are
pairwise independent[Daniel Lemire, Owen Kaser: Recursive ''n''-gram hashing is pairwise independent, at best, Computer Speech & Language 24 (4), pages 698–710, 2010. arXiv:0705.4676.] or strongly
universal
Universal is the adjective for universe.
Universal may also refer to:
Companies
* NBCUniversal, a media and entertainment company
** Universal Animation Studios, an American Animation studio, and a subsidiary of NBCUniversal
** Universal TV, a ...
. They cannot be
3-wise independent, for example.
Polynomial rolling hash
The
Rabin–Karp string search algorithm is often explained using a rolling hash function that only uses multiplications and additions:
:
,
where
is a constant, and
are the input characters (but this function is not a
Rabin fingerprint The Rabin fingerprinting scheme is a method for implementing fingerprints using polynomials over a finite field. It was proposed by Michael O. Rabin.
Scheme
Given an ''n''-bit message ''m''0,...,''m''n-1, we view it as a polynomial of degree ''n''- ...
, see below).
In order to avoid manipulating huge
values, all math is done
modulo . The choice of
and
is critical to get good hashing; in particular, the modulus
is typically a prime number. See
linear congruential generator
A linear congruential generator (LCG) is an algorithm that yields a sequence of pseudo-randomized numbers calculated with a discontinuous piecewise linear equation. The method represents one of the oldest and best-known pseudorandom number generat ...
for more discussion.
Removing and adding characters simply involves adding or subtracting the first or last term. Shifting all characters by one position to the left requires multiplying the entire sum
by
. Shifting all characters by one position to the right requires dividing the entire sum
by
. Note that in modulo arithmetic,
can be chosen to have a
multiplicative inverse
In mathematics, a multiplicative inverse or reciprocal for a number ''x'', denoted by 1/''x'' or ''x''−1, is a number which when multiplied by ''x'' yields the multiplicative identity, 1. The multiplicative inverse of a fraction ''a''/''b ...
by which
can be multiplied to get the result of the division without actually performing a division.
Rabin fingerprint
The
Rabin fingerprint The Rabin fingerprinting scheme is a method for implementing fingerprints using polynomials over a finite field. It was proposed by Michael O. Rabin.
Scheme
Given an ''n''-bit message ''m''0,...,''m''n-1, we view it as a polynomial of degree ''n''- ...
is another hash, which also interprets the input as a polynomial, but over the
Galois field
In mathematics, a finite field or Galois field (so-named in honor of Évariste Galois) is a field that contains a finite number of elements. As with any field, a finite field is a set on which the operations of multiplication, addition, subt ...
GF(2)
(also denoted \mathbb F_2, or \mathbb Z/2\mathbb Z) is the finite field of two elements (GF is the initialism of ''Galois field'', another name for finite fields). Notations and \mathbb Z_2 may be encountered although they can be confused with ...
. Instead of seeing the input as a polynomial of bytes, it is seen as a polynomial of bits, and all arithmetic is done in GF(2) (similarly to
CRC32). The hash is the result of the division of that polynomial by an irreducible polynomial over GF(2). It is possible to update a Rabin fingerprint using only the entering and the leaving byte, making it effectively a rolling hash.
Because it shares the same author as the Rabin–Karp string search algorithm, which is often explained with another, simpler rolling hash, and because this simpler rolling hash is also a polynomial, both rolling hashes are often mistaken for each other. The backup softwar
resticuses a Rabin fingerprint for splitting files, with blob size varying between and .
Cyclic polynomial
Hashing by cyclic polynomial—sometimes called Buzhash—is also simple, but it has the benefit of avoiding multiplications, using
barrel shifts instead. It is a form of
tabulation hashing: it presumes that there is some hash function
from characters to integers in the interval
. This hash function might be simply an array or a hash table mapping characters to random integers. Let the function
be a cyclic binary rotation (or
circular shift
In combinatorial mathematics, a circular shift is the operation of rearranging the entries in a tuple, either by moving the final entry to the first position, while shifting all other entries to the next position, or by performing the inverse op ...
): it rotates the bits by 1 to the left, pushing the latest bit in the first position. E.g.,
. Let
be the bitwise
exclusive or
Exclusive or or exclusive disjunction is a logical operation that is true if and only if its arguments differ (one is true, the other is false).
It is symbolized by the prefix operator J and by the infix operators XOR ( or ), EOR, EXOR, , ...
. The hash values are defined as
:
where the multiplications by powers of two can be implemented by binary shifts. The result is a number in