Overview
In a hash table, a hash function takes a key as an input, which is associated with a datum or record and used to identify it to the data storage and retrieval application. The keys may be fixed-length, like an integer, or variable-length, like a name. In some cases, the key is the datum itself. The output is a hash code used to index a hash table holding the data or records, or pointers to them. A hash function may be considered to perform three functions: *Convert variable-length keys into fixed-length (usually machine-word-length or less) values, by folding them by words or other units using a parity-preserving operator like ADD or XOR, *Scramble the bits of the key so that the resulting values are uniformly distributed over the keyspace, and *Map the key values into ones less than or equal to the size of the table. A good hash function satisfies two basic properties: it should be very fast to compute, and it should minimize duplication of output values ( collisions). Hash functions rely on generating favorableHash tables
Hash functions are used in conjunction with hash tables to store and retrieve data items or data records. The hash function translates the key associated with each datum or record into a hash code, which is used to index the hash table. When an item is to be added to the table, the hash code may index an empty slot (also called a bucket), in which case the item is added to the table there. If the hash code indexes a full slot, then some kind of collision resolution is required: the new item may be omitted (not added to the table), or replace the old item, or be added to the table in some other location by a specified procedure. That procedure depends on the structure of the hash table. In ''chained hashing'', each slot is the head of a linked list or chain, and items that collide at the slot are added to the chain. Chains may be kept in random order and searched linearly, or in serial order, or as a self-ordering list by frequency to speed up access. In ''open address hashing'', the table is probed starting from the occupied slot in a specified manner, usually by linear probing, quadratic probing, or double hashing until an open slot is located or the entire table is probed (overflow). Searching for the item follows the same procedure until the item is located, an open slot is found, or the entire table has been searched (item not in table).Specialized uses
Hash functions are also used to build caches for large data sets stored in slow media. A cache is generally simpler than a hashed search table, since any collision can be resolved by discarding or writing back the older of the two colliding items. Hash functions are an essential ingredient of the Bloom filter, a space-efficient probabilisticProperties
Uniformity
A good hash function should map the expected inputs as evenly as possible over its output range. That is, every hash value in the output range should be generated with roughly the sameTesting and measurement
When testing a hash function, the uniformity of the distribution of hash values can be evaluated by theEfficiency
In data storage and retrieval applications, the use of a hash function is a trade-off between search time and data storage space. If search time were unbounded, then a very compact unordered linear list would be the best medium; if storage space were unbounded, then a randomly accessible structure indexable by the key-value would be very large and very sparse, but very fast. A hash function takes a finite amount of time to map a potentially large keyspace to a feasible amount of storage space searchable in a bounded amount of time regardless of the number of keys. In most applications, the hash function should be computable with minimum latency and secondarily in a minimum number of instructions. Computational complexity varies with the number of instructions required and latency of individual instructions, with the simplest being the bitwise methods (folding), followed by the multiplicative methods, and the most complex (slowest) are the division-based methods. Because collisions should be infrequent, and cause a marginal delay but are otherwise harmless, it is usually preferable to choose a faster hash function over one that needs more computation but saves a few collisions. Division-based implementations can be of particular concern because a division requires multiple cycles on nearly all processor microarchitectures. Division (Universality
A ''universal hashing'' scheme is a randomized algorithm that selects a hash function among a family of such functions, in such a way that the probability of a collision of any two distinct keys is , where is the number of distinct hash values desired—independently of the two keys. Universal hashing ensures (in a probabilistic sense) that the hash function application will behave as well as if it were using a random function, for any distribution of the input data. It will, however, have more collisions than perfect hashing and may require more operations than a special-purpose hash function.Applicability
A hash function that allows only certain table sizes or strings only up to a certain length, or cannot accept a seed (i.e. allow double hashing) is less useful than one that does. A hash function is applicable in a variety of situations. Particularly within cryptography, notable applications include: * Integrity checking: Identical hash values for different files imply equality, providing a reliable means to detect file modifications. * Key derivation: Minor input changes result in a random-looking output alteration, known as the diffusion property. Thus, hash functions are valuable for key derivation functions. * Message authentication codes (MACs): Through the integration of a confidential key with the input data, hash functions can generate MACs ensuring the genuineness of the data, such as in HMACs. * Password storage: The password's hash value does not expose any password details, emphasizing the importance of securely storing hashed passwords on the server. * Signatures: Message hashes are signed rather than the whole message.Deterministic
A hash procedure must be deterministic—for a given input value, it must always generate the same hash value. In other words, it must be a function of the data to be hashed, in the mathematical sense of the term. This requirement excludes hash functions that depend on external variable parameters, such as pseudo-random number generators or the time of day. It also excludes functions that depend on the memory address of the object being hashed, because the address may change during execution (as may happen on systems that use certain methods of garbage collection), although sometimes rehashing of the item is possible. The determinism is in the context of the reuse of the function. For example, Python adds the feature that hash functions make use of a randomized seed that is generated once when the Python process starts in addition to the input to be hashed. The Python hash ( SipHash) is still a valid hash function when used within a single run, but if the values are persisted (for example, written to disk), they can no longer be treated as valid hash values, since in the next run the random value might differ.Defined range
It is often desirable that the output of a hash function have fixed size (but see below). If, for example, the output is constrained to 32-bit integer values, then the hash values can be used to index into an array. Such hashing is commonly used to accelerate data searches. Producing fixed-length output from variable-length input can be accomplished by breaking the input data into chunks of specific size. Hash functions used for data searches use some arithmetic expression that iteratively processes chunks of the input (such as the characters in a string) to produce the hash value.Variable range
In many applications, the range of hash values may be different for each run of the program or may change along the same run (for instance, when a hash table needs to be expanded). In those situations, one needs a hash function which takes two parameters—the input data , and the number of allowed hash values. A common solution is to compute a fixed hash function with a very large range (say, to ), divide the result by , and use the division'sVariable range with minimal movement (dynamic hash function)
When the hash function is used to store values in a hash table that outlives the run of the program, and the hash table needs to be expanded or shrunk, the hash table is referred to as a dynamic hash table. A hash function that will relocate the minimum number of records when the table is resized is desirable. What is needed is a hash function (where is the key being hashed and is the number of allowed hash values) such that with probability close to . Linear hashing and spiral hashing are examples of dynamic hash functions that execute in constant time but relax the property of uniformity to achieve the minimal movement property. Extendible hashing uses a dynamic hash function that requires space proportional to to compute the hash function, and it becomes a function of the previous keys that have been inserted. Several algorithms that preserve the uniformity property but require time proportional to to compute the value of have been invented. A hash function with minimal movement is especially useful inData normalization
In some applications, the input data may contain features that are irrelevant for comparison purposes. For example, when looking up a personal name, it may be desirable to ignore the distinction between upper and lower case letters. For such data, one must use a hash function that is compatible with the data equivalence criterion being used: that is, any two inputs that are considered equivalent must yield the same hash value. This can be accomplished by normalizing the input before hashing it, as by upper-casing all letters.Hashing integer data types
There are several common algorithms for hashing integers. The method giving the best distribution is data-dependent. One of the simplest and most common methods in practice is the modulo division method.Identity hash function
If the data to be hashed is small enough, then one can use the data itself (reinterpreted as an integer) as the hashed value. The cost of computing this '' identity'' hash function is effectively zero. This hash function is perfect, as it maps each input to a distinct hash value. The meaning of "small enough" depends on the size of the type that is used as the hashed value. For example, inInteger
and 32-bit floating-point Float
objects can simply use the value directly, whereas the 64-bit integer Long
and 64-bit floating-point Double
cannot.
Other types of data can also use this hashing scheme. For example, when mapping character strings between upper and lower case, one can use the binary encoding of each character, interpreted as an integer, to index a table that gives the alternative form of that character ("A" for "a", "8" for "8", etc.). If each character is stored in 8 bits (as in Trivial hash function
If the keys are uniformly or sufficiently uniformly distributed over the key space, so that the key values are essentially random, then they may be considered to be already "hashed". In this case, any number of any bits in the key may be extracted and collated as an index into the hash table. For example, a simple hash function might mask off the least significant bits and use the result as an index into a hash table of size .Mid-squares
A mid-squares hash code is produced by squaring the input and extracting an appropriate number of middle digits or bits. For example, if the input is and the hash table size , then squaring the key produces , so the hash code is taken as the middle 4 digits of the 17-digit number (ignoring the high digit) 8750. The mid-squares method produces a reasonable hash code if there is not a lot of leading or trailing zeros in the key. This is a variant of multiplicative hashing, but not as good because an arbitrary key is not a good multiplier.Division hashing
A standard technique is to use a modulo function on the key, by selecting a divisor which is a prime number close to the table size, so . The table size is usually a power of 2. This gives a distribution from . This gives good results over a large number of key sets. A significant drawback of division hashing is that division requires multiple cycles on most modern architectures (including x86) and can be 10 times slower than multiplication. A second drawback is that it will not break up clustered keys. For example, the keys 123000, 456000, 789000, etc. modulo 1000 all map to the same address. This technique works well in practice because many key sets are sufficiently random already, and the probability that a key set will be cyclical by a large prime number is small.Algebraic coding
Algebraic coding is a variant of the division method of hashing which uses division by a polynomial modulo 2 instead of an integer to map bits to bits. In this approach, , and we postulate an th-degree polynomial . A key can be regarded as the polynomial . The remainder using polynomial arithmetic modulo 2 is . Then . If is constructed to have or fewer non-zero coefficients, then keys which share fewer than bits are guaranteed to not collide. is a function of , , and (the last of which is a divisor of ) and is constructed from theUnique permutation hashing
Unique permutation hashing has a guaranteed best worst-case insertion time.Multiplicative hashing
Standard multiplicative hashing uses the formula , which produces a hash value in . The value is an appropriately chosen value that should be relatively prime to ; it should be large, and its binary representation a random mix of 1s and 0s. An important practical special case occurs when and are powers of 2 and is the machine word size. In this case, this formula becomes . This is special because arithmetic modulo is done by default in low-level programming languages and integer division by a power of 2 is simply a right-shift, so, in C, for example, this function becomesFibonacci hashing
Zobrist hashing
Tabulation hashing, more generally known as ''Zobrist hashing'' after Albert Zobrist, is a method for constructing universal families of hash functions by combining table lookup with XOR operations. This algorithm has proven to be very fast and of high quality for hashing purposes (especially hashing of integer-number keys). Zobrist hashing was originally introduced as a means of compactly representing chess positions in computer game-playing programs. A unique random number was assigned to represent each type of piece (six each for black and white) on each space of the board. Thus a table of 64×12 such numbers is initialized at the start of the program. The random numbers could be any length, but 64 bits was natural due to the 64 squares on the board. A position was transcribed by cycling through the pieces in a position, indexing the corresponding random numbers (vacant spaces were not included in the calculation) and XORing them together (the starting value could be 0 (the identity value for XOR) or a random seed). The resulting value was reduced by modulo, folding, or some other operation to produce a hash table index. The original Zobrist hash was stored in the table as the representation of the position. Later, the method was extended to hashing integers by representing each byte in each of 4 possible positions in the word by a unique 32-bit random number. Thus, a table of 28×4 random numbers is constructed. A 32-bit hashed integer is transcribed by successively indexing the table with the value of each byte of the plain text integer and XORing the loaded values together (again, the starting value can be the identity value or a random seed). The natural extension to 64-bit integers is by use of a table of 28×8 64-bit random numbers. This kind of function has some nice theoretical properties, one of which is called ''3-tuple independence'', meaning that every 3-tuple of keys is equally likely to be mapped to any 3-tuple of hash values.Customized hash function
A hash function can be designed to exploit existing entropy in the keys. If the keys have leading or trailing zeros, or particular fields that are unused, always zero or some other constant, or generally vary little, then masking out only the volatile bits and hashing on those will provide a better and possibly faster hash function. Selected divisors or multipliers in the division and multiplicative schemes may make more uniform hash functions if the keys are cyclic or have other redundancies.Hashing variable-length data
When the data values are long (or variable-length) character strings—such as personal names, web page addresses, or mail messages—their distribution is usually very uneven, with complicated dependencies. For example, text in any natural language has highly non-uniform distributions of characters, and character pairs, characteristic of the language. For such data, it is prudent to use a hash function that depends on all characters of the string—and depends on each character in a different way.Middle and ends
Simplistic hash functions may add the first and last characters of a string along with the length, or form a word-size hash from the middle 4 characters of a string. This saves iterating over the (potentially long) string, but hash functions that do not hash on all characters of a string can readily become linear due to redundancies, clustering, or other pathologies in the key set. Such strategies may be effective as a custom hash function if the structure of the keys is such that either the middle, ends, or other fields are zero or some other invariant constant that does not differentiate the keys; then the invariant parts of the keys can be ignored.Character folding
The paradigmatic example of folding by characters is to add up the integer values of all the characters in the string. A better idea is to multiply the hash total by a constant, typically a sizable prime number, before adding in the next character, ignoring overflow. Using exclusive-or instead of addition is also a plausible alternative. The final operation would be a modulo, mask, or other function to reduce the word value to an index the size of the table. The weakness of this procedure is that information may cluster in the upper or lower bits of the bytes; this clustering will remain in the hashed result and cause more collisions than a proper randomizing hash. ASCII byte codes, for example, have an upper bit of 0, and printable strings do not use the last byte code or most of the first 32 byte codes, so the information, which uses the remaining byte codes, is clustered in the remaining bits in an unobvious manner. The classic approach, dubbed the PJW hash based on the work of Peter J. Weinberger atWord length folding
Modern microprocessors will allow for much faster processing if 8-bit character strings are not hashed by processing one character at a time, but by interpreting the string as an array of 32-bit or 64-bit integers and hashing/accumulating these "wide word" integer values by means of arithmetic operations (e.g. multiplication by constant and bit-shifting). The final word, which may have unoccupied byte positions, is filled with zeros or a specified randomizing value before being folded into the hash. The accumulated hash code is reduced by a final modulo or other operation to yield an index into the table.Radix conversion hashing
Analogous to the way an ASCII or EBCDIC character string representing a decimal number is converted to a numeric quantity for computing, a variable-length string can be converted as . This is simply a polynomial in aRolling hash
In some applications, such as substring search, one can compute a hash function for every -character substring of a given -character string by advancing a window of width characters along the string, where is a fixed integer, and . The straightforward solution, which is to extract such a substring at every character position in the text and compute separately, requires a number of operations proportional to . However, with the proper choice of , one can use the technique of rolling hash to compute all those hashes with an effort proportional to where is the number of occurrences of the substring. The most familiar algorithm of this type is Rabin-Karp with best and average case performance and worst case (in all fairness, the worst case here is gravely pathological: both the text string and substring are composed of a repeated single character, such as ="AAAAAAAAAAA", and ="AAA"). The hash function used for the algorithm is usually the Rabin fingerprint, designed to avoid collisions in 8-bit character strings, but other suitable hash functions are also used.Fuzzy hash
Perceptual hash
Analysis
Worst case results for a hash function can be assessed two ways: theoretical and practical. The theoretical worst case is the probability that all keys map to a single slot. The practical worst case is the expected longest probe sequence (hash function + collision resolution method). This analysis considers uniform hashing, that is, any key will map to any particular slot with probability , a characteristic of universal hash functions. While Knuth worries about adversarial attack on real time systems, Gonnet has shown that the probability of such a case is "ridiculously small". His representation was that the probability of of keys mapping to a single slot is , where is the load factor, .History
The term ''hash'' offers a natural analogy with its non-technical meaning (to chop up or make a mess out of something), given how hash functions scramble their input data to derive their output. In his research for the precise origin of the term, Donald Knuth notes that, while Hans Peter Luhn ofSee also
Notes
References
External links