computer science Computer science is the study of computation, information, and automation. Computer science spans Theoretical computer science, theoretical disciplines (such as algorithms, theory of computation, and information theory) to Applied science, ...

, a family of

hash functions A hash function is any function that can be used to map data of arbitrary size to fixed-size values, though there are some hash functions that support variable-length output. The values returned by a hash function are called ''hash values'', ...

is said to be ''k''-independent, ''k''-wise independent or ''k''-universal if selecting a function at random from the family guarantees that the hash codes of any designated ''k'' keys are

independent random variables Independence is a fundamental notion in probability theory, as in statistics and the theory of stochastic processes. Two events are independent, statistically independent, or stochastically independent if, informally speaking, the occurrence of ...

(see precise mathematical definitions below). Such families allow good average case performance in randomized algorithms or data structures, even if the input data is chosen by an adversary. The trade-offs between the degree of independence and the efficiency of evaluating the hash function are well studied, and many ''k''-independent families have been proposed.

Background

The goal of hashing is usually to map keys from some large domain (universe)

U

into a smaller range, such as

m

bins (labelled

= \

). In the analysis of randomized algorithms and data structures, it is often desirable for the hash codes of various keys to "behave randomly". For instance, if the hash code of each key were an independent random choice in

/math>, the number of keys per bin could be analyzed using the Chernoff bound . A deterministic hash function cannot offer any such guarantee in an adversarial setting, as the adversary may choose the keys to be the precisely the

preimage In mathematics, for a function f: X \to Y, the image of an input value x is the single output value produced by f when passed x. The preimage of an output value y is the set of input values that produce y. More generally, evaluating f at each ...

of a bin. Furthermore, a deterministic hash function does not allow for ''rehashing'': sometimes the input data turns out to be bad for the hash function (e.g. there are too many collisions), so one would like to change the hash function. The solution to these problems is to pick a function ''randomly'' from a large family of hash functions. The randomness in choosing the hash function can be used to guarantee some desired random behavior of the hash codes of any keys of interest. The first definition along these lines was

universal hashing In mathematics and computing, universal hashing (in a randomized algorithm or data structure) refers to selecting a hash function at random from a family of hash functions with a certain mathematical property (see definition below). This guarantees ...

, which guarantees a low collision probability for any two designated keys. The concept of

k

-independent hashing, introduced by Wegman and Carter in 1981, strengthens the guarantees of random behavior to families of

k

designated keys, and adds a guarantee on the uniform distribution of hash codes.

Definitions

The strictest definition, introduced by Wegman and Carter under the name "strongly universal

_k

hash family", is the following. A family of hash functions

H=\

k

-independent if for any

k

distinct keys

(x_1, \dots, x_k) \in U^k

and any

k

hash codes (not necessarily distinct)

(y_1, \dots, y_k) \in k

, we have: :

\Pr_ \left h(x_1)=y_1 \land \cdots \land h(x_k)=y_k \right = m^

This definition is equivalent to the following two conditions: # for any fixed

x\in U

, as

h

is drawn randomly from

H

h(x)

is uniformly distributed in

/math>.
# for any fixed, distinct keys x_1, \dots, x_k \in U, as h is drawn randomly from H, h(x_1), \dots, h(x_k) are independent random variables.

Often it is inconvenient to achieve the perfect joint probability of m^due to rounding issues. Following, one may define a (\mu, k) -independent family to satisfy:

: \forall distinct (x_1, \dots, x_k) \in U^k and \forall (y_1, \dots, y_k) \in k, ~~\Pr_ \left h(x_1)=y_1 \land \cdots \land h(x_k)=y_k \right \le  \mu / m^k Observe that, even if \mu is close to 1, h(x_i) are no longer independent random variables, which is often a problem in the analysis of randomized algorithms. Therefore, a more common alternative to dealing with rounding issues is to prove that the hash family is close in

statistical distance In statistics, probability theory, and information theory, a statistical distance quantifies the distance between two statistical objects, which can be two random variables, or two probability distributions or samples, or the distance can be bet ...

to a

k

-independent family, which allows black-box use of the independence properties.

Techniques

Polynomials with random coefficients

The original technique for constructing -independent hash functions, given by Carter and Wegman, was to select a large prime number , choose random numbers modulo , and use these numbers as the coefficients of a

polynomial In mathematics, a polynomial is a Expression (mathematics), mathematical expression consisting of indeterminate (variable), indeterminates (also called variable (mathematics), variables) and coefficients, that involves only the operations of addit ...

of degree whose values modulo are used as the value of the hash function. All polynomials of the given degree modulo are equally likely, and any polynomial is uniquely determined by any -tuple of argument-value pairs with distinct arguments, from which it follows that any -tuple of distinct arguments is equally likely to be mapped to any -tuple of hash values. In general the polynomial can be evaluated in any

finite field In mathematics, a finite field or Galois field (so-named in honor of Évariste Galois) is a field (mathematics), field that contains a finite number of Element (mathematics), elements. As with any field, a finite field is a Set (mathematics), s ...

. Besides the fields modulo prime, a popular choice is the field of size

2^n

, which supports fast

finite field arithmetic In mathematics, finite field arithmetic is arithmetic in a finite field (a field containing a finite number of elements) contrary to arithmetic in a field with an infinite number of elements, like the field of rational numbers. There are infinit ...

on modern computers. This was the approach taken by Daniel Lemire and Owen Kaser for CLHash.

Tabulation hashing

Tabulation hashing A table is an arrangement of information or data, typically in rows and columns, or possibly in a more complex structure. Tables are widely used in communication, research, and data analysis. Tables appear in print media, handwritten notes, comp ...

is a technique for mapping keys to hash values by partitioning each key into

byte The byte is a unit of digital information that most commonly consists of eight bits. Historically, the byte was the number of bits used to encode a single character of text in a computer and for this reason it is the smallest addressable un ...

s, using each byte as the index into a table of random numbers (with a different table for each byte position), and combining the results of these table lookups by a bitwise

exclusive or Exclusive or, exclusive disjunction, exclusive alternation, logical non-equivalence, or logical inequality is a logical operator whose negation is the logical biconditional. With two inputs, XOR is true if and only if the inputs differ (on ...

operation. Thus, it requires more randomness in its initialization than the polynomial method, but avoids possibly-slow multiplication operations. It is 3-independent but not 4-independent. Variations of tabulation hashing can achieve higher degrees of independence by performing table lookups based on overlapping combinations of bits from the input key, or by applying simple tabulation hashing iteratively.

Independence needed by different types of collision resolution

The notion of ''k''-independence can be used to differentiate between different collision resolution in hashtables, according to the level of independence required to guarantee constant expected time per operation. For instance, hash chaining takes constant expected time even with a 2-independent family of hash functions, because the expected time to perform a search for a given key is bounded by the expected number of collisions that key is involved in. By linearity of expectation, this expected number equals the sum, over all other keys in the hash table, of the probability that the given key and the other key collide. Because the terms of this sum only involve probabilistic events involving two keys, 2-independence is sufficient to ensure that this sum has the same value that it would for a truly random hash function.

Double hashing Double hashing is a computer programming technique used in conjunction with open addressing in hash tables to resolve hash collisions, by using a secondary hash of the key as an offset when a collision occurs. Double hashing with open addressing is ...

is another method of hashing that requires a low degree of independence. It is a form of open addressing that uses two hash functions: one to determine the start of a probe sequence, and the other to determine the step size between positions in the probe sequence. As long as both of these are 2-independent, this method gives constant expected time per operation. On the other hand,

linear probing Linear probing is a scheme in computer programming for resolving hash collision, collisions in hash tables, data structures for maintaining a collection of Attribute–value pair, key–value pairs and looking up the value associated with a giv ...

, a simpler form of open addressing where the step size is always one can be guaranteed to work in constant expected time per operation with a 5-independent hash function, and there exist 4-independent hash functions for which it takes logarithmic time per operation. For Cuckoo hashing the required k-independence is not known as of 2021. In 2009 it was shown that

O(\log n)

-independence suffices, and at least 6-independence is needed. Another approach is to use

, which is not 6-independent, but was shown in 2012 to have other properties sufficient for Cuckoo hashing. A third approach from 2014 is to slightly modify the cuckoo hashtable with a so-called stash, which makes it possible to use nothing more than 2-independent hash functions.

Other applications

Kane,

Nelson Nelson may refer to: Arts and entertainment * ''Nelson'' (1918 film), a historical film directed by Maurice Elvey * ''Nelson'' (1926 film), a historical film directed by Walter Summers * ''Nelson'' (opera), an opera by Lennox Berkeley to a lib ...

and David Woodruff improved the Flajolet–Martin algorithm for the Distinct Elements Problem in 2010. To give an

\varepsilon

approximation to the correct answer, they need a

\tfrac

-independent hash function. The Count sketch algorithm for

dimensionality reduction Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally ...

requires two hash functions, one 2-independent and one 4-independent. The Karloff–Zwick algorithm for the MAX-3SAT problem can be implemented with 3-independent random variables. The MinHash algorithm can be implemented using a

\log\tfrac

-independent hash function as was proven by

Piotr Indyk Piotr Indyk is Thomas D. and Virginia W. Cabot Professor in the Theory of Computation Group at the Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology. Academic biography Indyk received the Magister (M ...

in 1999Indyk, Piotr. "A small approximately min-wise independent family of hash functions." Journal of Algorithms 38.1 (2001): 84-90.