In computing, a hash table, also known as hash map, is a

data structure In computer science, a data structure is a data organization, management, and storage format that is usually chosen for efficient access to data. More precisely, a data structure is a collection of data values, the relationships among them, a ...

that implements an associative array or dictionary. It is an abstract data type that maps

keys Key or The Key may refer to: Common meanings * Key (cryptography), a piece of information that controls the operation of a cryptography algorithm * Key (lock), device used to control access to places or facilities restricted by a lock * Key (map ...

to values. A hash table uses a hash function to compute an ''index'', also called a ''hash code'', into an array of ''buckets'' or ''slots'', from which the desired value can be found. During lookup, the key is hashed and the resulting hash indicates where the corresponding value is stored. Ideally, the hash function will assign each key to a unique bucket, but most hash table designs employ an imperfect hash function, which might cause hash ''

collisions In physics, a collision is any event in which two or more bodies exert forces on each other in a relatively short time. Although the most common use of the word ''collision'' refers to incidents in which two or more objects collide with great f ...

'' where the hash function generates the same index for more than one key. Such collisions are typically accommodated in some way. In a well-dimensioned hash table, the average time complexity for each lookup is independent of the number of elements stored in the table. Many hash table designs also allow arbitrary insertions and deletions of key–value pairs, at amortized constant average cost per operation. Charles E. Leiserson
''Amortized Algorithms, Table Doubling, Potential Method''
Lecture 13, course MIT 6.046J/18.410J Introduction to Algorithms—Fall 2005 Hashing is an example of a

space-time tradeoff In physics, spacetime is a mathematical model that combines the three dimensions of space and one dimension of time into a single four-dimensional manifold. Spacetime diagrams can be used to visualize relativistic effects, such as why differ ...

. If memory is infinite, the entire key can be used directly as an index to locate its value with a single memory access. On the other hand, if infinite time is available, values can be stored without regard for their keys, and a binary search or

linear search In computer science, a linear search or sequential search is a method for finding an element within a list. It sequentially checks each element of the list until a match is found or the whole list has been searched. A linear search runs in at ...

can be used to retrieve the element. In many situations, hash tables turn out to be on average more efficient than search trees or any other table lookup structure. For this reason, they are widely used in many kinds of computer software, particularly for associative arrays, database indexing, caches, and sets.

History

The idea of hashing arose independently in different places. In January 1953, Hans Peter Luhn wrote an internal IBM memorandum that used hashing with chaining. Open addressing was later proposed by A. D. Linh building on Luhn's paper. Around the same time, Gene Amdahl,

Elaine M. McGraw Elaine M. McGraw (née Boehme) was an American computer programmer who, together with Arthur Samuel and Gene Amdahl, invented open addressing based hash tables in 1954. After studying economics, McGraw began working as a computer programmer for ...

, Nathaniel Rochester, and Arthur Samuel of IBM Research implemented hashing for the

IBM 701 The IBM 701 Electronic Data Processing Machine, known as the Defense Calculator while in development, was IBM’s first commercial scientific computer and its first series production mainframe computer, which was announced to the public on May ...

assembler. Open addressing with linear probing is credited to Amdahl, although Ershov independently had the same idea. The term "open addressing" was coined by

W. Wesley Peterson William Wesley Peterson (April 22, 1924 – May 6, 2009) was an American mathematician and computer scientist. He was best known for designing the cyclic redundancy check (CRC), – The original paper on CRCs for which research he was awarded ...

on his article which discusses the problem of search in large files. The first published work on hashing with chaining is credited to

Arnold Dumey Arnold I. Dumey (1906-1995) was the co-inventor of the postal sorting machine and cryptanalyst first for Signals Intelligence Service, SIS and then NSA. During World War II he worked for the Army Signal Corpstheoretical analysis of linear probing was submitted originally by Konheim and Weiss.

Overview

An associative array stores a set of (key, value) pairs and allows insertion, deletion, and lookup (search), with the constraint of unique keys. In the hash table implementation of associative arrays, an array

A

of length

m

is partially filled with

n

elements, where

m \ge n

. A value

x

gets stored at an index location

A

(x) An emoticon (, , rarely , ), short for "emotion icon", also known simply as an emote, is a pictorial representation of a facial expression using characters—usually punctuation marks, numbers, and letters—to express a person's feelings, ...

/math>, where

h

is a hash function, and

h(x) < m

. Under reasonable assumptions, hash tables have better time complexity bounds on search, delete, and insert operations in comparison to self-balancing binary search trees. Hash tables are also commonly used to implement sets, by omitting the stored value for each key and merely tracking whether the key is present.

Load factor

A ''load factor''

\alpha

is a critical statistic of a hash table, and is defined as follows:

\text\ (\alpha) = \frac,

where *

n

is the number of entries occupied in the hash table. *

k

is the number of buckets. The performance of the hash table deteriorates in relation to the load factor

\alpha

. Therefore a hash table is resized or ''rehashed'' if the load factor

\alpha

approaches 1. A table is also resized if the load factor drops below

\alpha_/4

. Acceptable figures of load factor

\alpha

include 0.6 and 0.75.

Hash function

A hash function

h

maps the universe

U

of keys

h : U \rightarrow \

to array indices or slots within the table for each

h(x) \in

where

x \in S

and

m < n

. The conventional implementations of hash functions are based on the ''integer universe assumption'' that all elements of the table stem from the universe

U = \

, where the

bit length Bit-length or bit width is the number of binary digits, called bits, necessary to represent an unsigned integer as a binary number. Formally, the bit-length of a natural number n \geq 0 is :\ell(n) = \lceil \log_2(n+1) \rceil where \log_2 is the ...

u

is confined within the word size of a computer architecture. A perfect hash function

h

is defined as an

injective function In mathematics, an injective function (also known as injection, or one-to-one function) is a function that maps distinct elements of its domain to distinct elements; that is, implies . (Equivalently, implies in the equivalent contrapositiv ...

such that each element

x

S

maps to a unique value in

. A perfect hash function can be created if all the keys are known ahead of time.

Integer universe assumption

The schemes of hashing used in ''integer universe assumption'' include hashing by division, hashing by multiplication, universal hashing, dynamic perfect hashing, and static perfect hashing. However, hashing by division is the commonly used scheme.

Hashing by division

The scheme in hashing by division is as follows:

h(x)\ =\ M\, \bmod\, n

Where

M

is the hash digest of

x \in S

and

n

is the size of the table.

Hashing by multiplication

The scheme in hashing by multiplication is as follows:

h(k) = \lfloor n \bigl((M A) \bmod 1\bigr) \rfloor

Where

A

is a real-valued constant. An advantage of the hashing by multiplication is that the

m

is not critical. Although any value

A

produces a hash function, Donald Knuth suggests using the golden ratio.

Choosing a hash function

Uniform distribution Uniform distribution may refer to: * Continuous uniform distribution * Discrete uniform distribution * Uniform distribution (ecology) * Equidistributed sequence In mathematics, a sequence (''s''1, ''s''2, ''s''3, ...) of real numbers is said to be ...

of the hash values is a fundamental requirement of a hash function. A non-uniform distribution increases the number of collisions and the cost of resolving them. Uniformity is sometimes difficult to ensure by design, but may be evaluated empirically using statistical tests, e.g., a

Pearson's chi-squared test Pearson's chi-squared test (\chi^2) is a statistical test applied to sets of categorical data to evaluate how likely it is that any observed difference between the sets arose by chance. It is the most widely used of many chi-squared tests (e.g., ...

for discrete uniform distributions. The distribution needs to be uniform only for table sizes that occur in the application. In particular, if one uses dynamic resizing with exact doubling and halving of the table size, then the hash function needs to be uniform only when the size is a power of two. Here the index can be computed as some range of bits of the hash function. On the other hand, some hashing algorithms prefer to have the size be a prime number. For open addressing schemes, the hash function should also avoid ''clustering'', the mapping of two or more keys to consecutive slots. Such clustering may cause the lookup cost to skyrocket, even if the load factor is low and collisions are infrequent. The popular multiplicative hash is claimed to have particularly poor clustering behavior.

K-independent hashing In computer science, a family of hash functions is said to be ''k''-independent, ''k''-wise independent or ''k''-universal if selecting a function at random from the family guarantees that the hash codes of any designated ''k'' keys are independe ...

offers a way to prove a certain hash function does not have bad keysets for a given type of hashtable. A number of K-independence results are known for collision resolution schemes such as linear probing and cuckoo hashing. Since K-independence can prove a hash function works, one can then focus on finding the fastest possible such hash function.

Collision resolution

A search algorithm that uses hashing consists of two parts. The first part is computing a hash function which transforms the search key into an array index. The ideal case is such that no two search keys hashes to the same array index. However, this is not always the case and is impossible to guarantee for unseen given data. Hence the second part of the algorithm is collision resolution. The two common methods for collision resolution are separate chaining and open addressing.

Separate chaining

In separate chaining, the process involves building a

linked list In computer science, a linked list is a linear collection of data elements whose order is not given by their physical placement in memory. Instead, each element points to the next. It is a data structure consisting of a collection of nodes whic ...

with key–value pair for each search array index. The collided items are chained together through a single linked list, which can be traversed to access the item with a unique search key. Collision resolution through chaining with linked list is a common method of implementation of hash tables. Let

T

and

x

be the hash table and the node respectively, the operation involves as follows: Chained-Hash-Insert(''T'', ''k'') ''insert'' ''x'' ''at the head of linked list'' ''T'' 'h''(''k'') Chained-Hash-Search(''T'', ''k'') ''search for an element with key'' ''k'' ''in linked list'' ''T'' 'h''(''k'') Chained-Hash-Delete(''T'', ''k'') ''delete'' ''x'' ''from the linked list'' ''T'' 'h''(''k'') If the element is comparable either

numerically Numerical analysis is the study of algorithms that use numerical approximation (as opposed to symbolic manipulations) for the problems of mathematical analysis (as distinguished from discrete mathematics). It is the study of numerical methods th ...

or lexically, and inserted into the list by maintaining the total order, it results in faster termination of the unsuccessful searches.

Other data structures for separate chaining

If the keys are ordered, it could be efficient to use " self-organizing" concepts such as using a self-balancing binary search tree, through which the theoretical worst case could be brought down to

O(\log)

, although it introduces additional complexities. In dynamic perfect hashing, two-level hash tables are used to reduce the look-up complexity to be a guaranteed

O(1)

in the worst case. In this technique, the buckets of

k

entries are organized as perfect hash tables with

k^2

slots providing constant worst-case lookup time, and low amortized time for insertion. A study shows array based separate chaining to be 97% more performant when compared to the standard linked list method under heavy load. Techniques such as using

fusion tree In computer science, a fusion tree is a type of tree data structure that implements an associative array on -bit integers on a finite universe, where each of the input integer has size less than 2w and is non-negative. When operating on a collecti ...

for each buckets also result in constant time for all operations with high probability.

Caching and locality of reference

The linked list of separate chaining implementation may not be cache-conscious due to

spatial locality In physics, the principle of locality states that an object is influenced directly only by its immediate surroundings. A theory that includes the principle of locality is said to be a "local theory". This is an alternative to the concept of ins ...

— locality of reference—when the nodes of the linked list are scattered across memory, thus the list traversal during insert and search may entail CPU cache inefficiencies. In cache-conscious variants, a

dynamic array In computer science, a dynamic array, growable array, resizable array, dynamic table, mutable array, or array list is a random access, variable-size list data structure that allows elements to be added or removed. It is supplied with standard lib ...

found to be more cache-friendly is used in the place where a linked list or self-balancing binary search trees is usually deployed for collision resolution through separate chaining, since the contiguous allocation pattern of the array could be exploited by hardware-cache prefetchers—such as translation lookaside buffer—resulting in reduced access time and memory consumption.

Open addressing

Open addressing is another collision resolution technique in which every entry record is stored in the bucket array itself, and the hash resolution is performed through probing. When a new entry has to be inserted, the buckets are examined, starting with the hashed-to slot and proceeding in some ''probe sequence'', until an unoccupied slot is found. When searching for an entry, the buckets are scanned in the same sequence, until either the target record is found, or an unused array slot is found, which indicates an unsuccessful search. Well-known probe sequences include: * Linear probing, in which the interval between probes is fixed (usually 1). *

Quadratic probing Quadratic probing is an open addressing scheme in computer programming for resolving hash collisions in hash tables. Quadratic probing operates by taking the original hash index and adding successive values of an arbitrary quadratic polynomial unti ...

, in which the interval between probes is increased by adding the successive outputs of a quadratic polynomial to the value given by the original hash computation. *

Double hashing Double hashing is a computer programming technique used in conjunction with open addressing in hash tables to resolve hash collisions, by using a secondary hash of the key as an offset when a collision occurs. Double hashing with open addressing is ...

, in which the interval between probes is computed by a secondary hash function. The performance of open addressing may be slower compared to separate chaining since the probe sequence increases when the load factor

\alpha

approaches 1. The probing results in an infinite loop if the load factor reaches 1, in the case of a completely filled table. The average cost of linear probing depends on the hash function's ability to distribute the elements

uniformly Uniform distribution may refer to: * Continuous uniform distribution * Discrete uniform distribution * Uniform distribution (ecology) * Equidistributed sequence In mathematics, a sequence (''s''1, ''s''2, ''s''3, ...) of real numbers is said to be ...

throughout the table to avoid clustering, since formation of clusters would result in increased search time.

Caching and locality of reference

Since the slots are located in successive locations, linear probing could lead to better utilization of CPU cache due to locality of references resulting in reduced memory latency.

Other collision resolution techniques based on open addressing

=Coalesced hashing

= Coalesced hashing is a hybrid of both separate chaining and open addressing in which the buckets or nodes link within the table. The algorithm is ideally suited for fixed memory allocation. The collision in coalesced hashing is resolved by identifying the largest-indexed empty slot on the hash table, then the colliding value is inserted into that slot. The bucket is also linked to the inserted node's slot which contains its colliding hash address.

=Cuckoo hashing

= Cuckoo hashing is a form of open addressing collision resolution technique which guarantees

O(1)

worst-case lookup complexity and constant amortized time for insertions. The collision is resolved through maintaining two hash tables, each having its own hashing function, and collided slot gets replaced with the given item, and the preoccupied element of the slot gets displaced into the other hash table. The process continues until every key has its own spot in the empty buckets of the tables; if the procedure enters into infinite loop—which is identified through maintaining a threshold loop counter—both hash tables get rehashed with newer hash functions and the procedure continues.

=Hopscotch hashing

Hopscotch hashing Hopscotch hashing is a scheme in computer programming for resolving hash collisions of values of hash functions in a table using open addressing. It is also well suited for implementing a concurrent hash table. Hopscotch hashing was introduced by ...

is an open addressing based algorithm which combines the elements of cuckoo hashing, linear probing and chaining through the notion of a ''neighbourhood'' of buckets—the subsequent buckets around any given occupied bucket, also called a "virtual" bucket. The algorithm is designed to deliver better performance when the load factor of the hash table grows beyond 90%; it also provides high throughput in concurrent settings, thus well suited for implementing resizable concurrent hash table. The neighbourhood characteristic of hopscotch hashing guarantees a property that, the cost of finding the desired item from any given buckets within the neighbourhood is very close to the cost of finding it in the bucket itself; the algorithm attempts to be an item into its neighbourhood—with a possible cost involved in displacing other items. Each bucket within the hash table includes an additional "hop-information"—an ''H''-bit bit array for indicating the relative distance of the item which was originally hashed into the current virtual bucket within ''H''-1 entries. Let

k

and

Bk

be the key to be inserted and bucket to which the key is hashed into respectively; several cases are involved in the insertion procedure such that the neighbourhood property of the algorithm is vowed: if

Bk

is empty, the element is inserted, and the leftmost bit of bitmap is set to 1; if not empty, linear probing is used for finding an empty slot in the table, the bitmap of the bucket gets updated followed by the insertion; if the empty slot is not within the range of the ''neighbourhood,'' i.e. ''H''-1, subsequent swap and hop-info bit array manipulation of each bucket is performed in accordance with its neighbourhood invariant properties.

=Robin Hood hashing

= Robin hood hashing is an open addressing based collision resolution algorithm; the collisions are resolved through favouring the displacement of the element that is farthest—or longest ''probe sequence length'' (PSL)—from its "home location" i.e. the bucket to which the item was hashed into. Although robin hood hashing does not change the theoretical search cost, it significantly affects the variance of the

distribution Distribution may refer to: Mathematics *Distribution (mathematics), generalized functions used to formulate solutions of partial differential equations * Probability distribution, the probability of a particular value or value range of a vari ...

of the items on the buckets, i.e. dealing with cluster formation in the hash table. Each node within the hash table that uses robin hood hashing should be augmented to store an extra PSL value. Let

x

be the key to be inserted,

x.psl

be the (incremental) PSL length of

x

T

be the hash table and

j

be the index, the insertion procedure is as follows: * If

x.psl\ \le\ T psl

: the iteration goes into the next bucket without attempting an external probe. * If

x.psl\ >\ T psl

: insert the item

x

into the bucket

j

; swap

x

with

T /math>—let it be x'; continue the probe from the j+1 st bucket to insert x'; repeat the procedure until every element is inserted.

Dynamic resizing

Repeated insertions cause the number of entries in a hash table to grow, which consequently increases the load factor; to maintain the amortized

O(1)

performance of the lookup and insertion operations, a hash table is dynamically resized and the items of the tables are ''rehashed'' into the buckets of the new hash table, since the items cannot be copied over as varying table sizes results in different hash value due to modulo operation. If a hash table becomes "too empty" after deleting some elements, resizing may be performed to avoid excessive

memory usage Memory management is a form of resource management applied to computer memory. The essential requirement of memory management is to provide ways to dynamically allocate portions of memory to programs at their request, and free it for reuse when ...

Resizing by moving all entries

Generally, a new hash table with a size double that of the original hash table gets allocated privately and every item in the original hash table gets moved to the newly allocated one by computing the hash values of the items followed by the insertion operation. Rehashing is computationally expensive despite its simplicity.

Alternatives to all-at-once rehashing

Some hash table implementations, notably in real-time systems, cannot pay the price of enlarging the hash table all at once, because it may interrupt time-critical operations. If one cannot avoid dynamic resizing, a solution is to perform the resizing gradually to avoid storage blip—typically at 50% of new table's size—during rehashing and to avoid memory fragmentation that triggers heap compaction due to deallocation of large memory blocks caused by the old hash table. In such case, the rehashing operation is done incrementally through extending prior memory block allocated for the old hash table such that the buckets of the hash table remain unaltered. A common approach for amortized rehashing involves maintaining two hash functions

h_\text

and

h_\text

. The process of rehashing a bucket's items in accordance with the new hash function is termed as ''cleaning'', which is implemented through command pattern by encapsulating the operations such as

\mathrm(\mathrm)

\mathrm(\mathrm)

and

\mathrm(\mathrm)

through a

\mathrm(\mathrm, \text)

wrapper such that each element in the bucket gets rehashed and its procedure involve as follows: * Clean

\mathrm_\text(\mathrm) /math> bucket.
* Clean \mathrm_\text(\mathrm) /math> bucket.
* The ''command'' gets executed.

Linear hashing

Linear hashing is an implementation of the hash table which enables dynamic growths or shrinks of the table one bucket at a time.

Performance

The performance of a hash table is dependent on the hash function's ability in generating quasi-random numbers (

\sigma

) for entries in the hash table where

K

n

and

h(x)

denotes the key, number of buckets and the hash function such that

\sigma\ =\ h(K)\ \%\ n

. If the hash function generates same

\sigma

for distinct keys (

K_1 \ne K_2,\ h(K_1)\ =\ h(K_2)

), this results in ''collision'', which should be dealt with in several ways. The constant time complexity (

O(1)

) of the operation in a hash table is presupposed on the condition that the hash function doesn't generate colliding indices; thus, the performance of the hash table is directly proportional to the chosen hash function ability to disperse the indices. However, construction of such a hash function is practically unfeasible, that being so, implementations depend on case-specific collision resolution techniques in achieving higher performance.

Applications

Associative arrays

Hash tables are commonly used to implement many types of in-memory tables. They are used to implement associative arrays..

Database indexing

Hash tables may also be used as

disk Disk or disc may refer to: * Disk (mathematics), a geometric shape * Disk storage Music * Disc (band), an American experimental music band * ''Disk'' (album), a 1995 EP by Moby Other uses * Disk (functional analysis), a subset of a vector sp ...

-based data structures and

database indices A database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of additional writes and storage space to maintain the index data structure. Indexes are used to quickly locate data without ...

(such as in dbm) although B-trees are more popular in these applications.

Caches

Hash tables can be used to implement caches, auxiliary data tables that are used to speed up the access to data that is primarily stored in slower media. In this application, hash collisions can be handled by discarding one of the two colliding entries—usually erasing the old item that is currently stored in the table and overwriting it with the new item, so every item in the table has a unique hash value.

Sets

Hash tables can be used in the implementation of

set data structure In computer science, a set is an abstract data type that can store unique values, without any particular order. It is a computer implementation of the mathematical concept of a finite set. Unlike most other collection types, rather than retrievin ...

, which can store unique values without any particular order; set is typically used in testing the membership of a value in the collection, rather than element retrieval.

Transposition table

A transposition table to a complex Hash Table which stores information about each section that has been searched.

Implementations

In programming languages

Many programming languages provide hash table functionality, either as built-in associative arrays or as

standard library In computer programming, a standard library is the library made available across implementations of a programming language. These libraries are conventionally described in programming language specifications; however, contents of a language's as ...

modules. In JavaScript, every value except for 7 "primitive" data types is called an "object", which uses either integers, strings, or guaranteed-unique "symbol" primitive values as keys for a hash map. ECMAScript 6 also added Map and Set data structures. C++11 includes unordered_map in its standard library for storing keys and values of arbitrary types. Java programming language includes the HashSet, HashMap, LinkedHashSet, and LinkedHashMap generic collections. Python's built-in dict implements a hash table in the form of a

type Type may refer to: Science and technology Computing * Typing, producing text via a keyboard, typewriter, etc. * Data type, collection of values used for computations. * File type * TYPE (DOS command), a command to display contents of a file. * Ty ...

. Ruby's built-in Hash uses the open addressing model from Ruby 2.4 onwards. Rust programming language includes HashMap, HashSet as part of the Rust Standard Library.

References

External links

NIST The National Institute of Standards and Technology (NIST) is an agency of the United States Department of Commerce whose mission is to promote American innovation and industrial competitiveness. NIST's activities are organized into physical sci ...

entry o
hash tables

Pat Morin
MIT's Introduction to Algorithms: Hashing 1
MIT OCW lecture Video
MIT's Introduction to Algorithms: Hashing 2
MIT OCW lecture Video {{Authority control Articles with example C code * Hash based data structures

History

Overview

Load factor

Hash function

Integer universe assumption

Hashing by division

Hashing by multiplication

Choosing a hash function

Collision resolution

Separate chaining

Other data structures for separate chaining

Caching and locality of reference

Open addressing

Caching and locality of reference

Other collision resolution techniques based on open addressing

=Coalesced hashing

=Cuckoo hashing

=Hopscotch hashing

=Robin Hood hashing

Dynamic resizing

Resizing by moving all entries

Alternatives to all-at-once rehashing

Linear hashing

Performance

Applications

Associative arrays

Database indexing

Caches

Sets

Transposition table

Implementations

In programming languages

See also

References

Further reading

External links