
In
computer science
Computer science is the study of computation, information, and automation. Computer science spans Theoretical computer science, theoretical disciplines (such as algorithms, theory of computation, and information theory) to Applied science, ...
, a hash collision or hash clash is when two distinct pieces of data in a
hash table
In computer science, a hash table is a data structure that implements an associative array, also called a dictionary or simply map; an associative array is an abstract data type that maps Unique key, keys to Value (computer science), values. ...
share the same hash value. The hash value in this case is derived from a
hash function
A hash function is any Function (mathematics), function that can be used to map data (computing), data of arbitrary size to fixed-size values, though there are some hash functions that support variable-length output. The values returned by a ...
which takes a data input and returns a fixed length of bits.
Although hash algorithms, especially cryptographic hash algorithms, have been created with the intent of being
collision resistant, they can still sometimes map different data to the same hash (by virtue of the
pigeonhole principle
In mathematics, the pigeonhole principle states that if items are put into containers, with , then at least one container must contain more than one item. For example, of three gloves, at least two must be right-handed or at least two must be l ...
). Malicious users can take advantage of this to mimic, access, or alter data.
Due to the possible negative applications of hash collisions in
data management
Data management comprises all disciplines related to handling data as a valuable resource, it is the practice of managing an organization's data so it can be analyzed for decision making.
Concept
The concept of data management emerged alongsi ...
and
computer security
Computer security (also cybersecurity, digital security, or information technology (IT) security) is a subdiscipline within the field of information security. It consists of the protection of computer software, systems and computer network, n ...
(in particular,
cryptographic hash function
A cryptographic hash function (CHF) is a hash algorithm (a map (mathematics), map of an arbitrary binary string to a binary string with a fixed size of n bits) that has special properties desirable for a cryptography, cryptographic application: ...
s), collision avoidance has become an important topic in computer security.
Background
Hash collisions can be unavoidable depending on the number of objects in a set and whether or not the bit string they are mapped to is long enough in length. When there is a set of ''n'' objects, if ''n'' is greater than , ''R'', , which in this case ''R'' is the range of the hash value, the probability that there will be a hash collision is 1, meaning it is guaranteed to occur.
Another reason hash collisions are likely at some point in time stems from the idea of the
birthday paradox
In probability theory, the birthday problem asks for the probability that, in a set of randomly chosen people, at least two will share the same birthday. The birthday paradox is the counterintuitive fact that only 23 people are needed for that ...
in mathematics. This problem looks at the probability of a set of two randomly chosen people having the same birthday out of ''n'' number of people. This idea has led to what has been called the
birthday attack
A birthday attack is a bruteforce collision attack that exploits the mathematics behind the birthday problem in probability theory. This attack can be used to abuse communication between two or more parties. The attack depends on the higher likeli ...
. The premise of this attack is that it is difficult to find a birthday that specifically matches your birthday or a specific birthday, but the probability of finding a set of ''any'' two people with matching birthdays increases the probability greatly. Bad actors can use this approach to make it simpler for them to find hash values that collide with any other hash value – rather than searching for a specific value.
The impact of collisions depends on the application. When hash functions and fingerprints are used to identify similar data, such as
homologous DNA
Deoxyribonucleic acid (; DNA) is a polymer composed of two polynucleotide chains that coil around each other to form a double helix. The polymer carries genetic instructions for the development, functioning, growth and reproduction of al ...
sequences or similar audio files, the functions are designed so as to ''maximize'' the probability of collision between distinct but similar data, using techniques like
locality-sensitive hashing
In computer science, locality-sensitive hashing (LSH) is a fuzzy hashing technique that hashes similar input items into the same "buckets" with high probability. (The number of buckets is much smaller than the universe of possible input items.) Si ...
.
Checksum
A checksum is a small-sized block of data derived from another block of digital data for the purpose of detecting errors that may have been introduced during its transmission or storage. By themselves, checksums are often used to verify dat ...
s, on the other hand, are designed to minimize the probability of collisions between similar inputs, without regard for collisions between very different inputs.
Instances where bad actors attempt to create or find hash collisions are known as
collision attacks.
In practice, security-related applications use cryptographic hash algorithms, which are designed to be long enough for random matches to be unlikely, fast enough that they can be used anywhere, and safe enough that it would be extremely hard to find collisions.
Collision resolution
In hash tables, since hash collisions are inevitable, hash tables have mechanisms of dealing with them, known as collision resolutions. Two of the most common strategies are
open addressing
Open addressing, or closed hashing, is a method of Hash table#Collision resolution, collision resolution in hash tables. With this method a hash collision is resolved by probing, or searching through alternative locations in the array (the ''prob ...
and
separate chaining
In computer science, a hash table is a data structure that implements an associative array, also called a dictionary or simply map; an associative array is an abstract data type that maps keys to values. A hash table uses a hash function to ...
. The cache-conscious collision resolution is another strategy that has been discussed in the past for string hash tables.
Open addressing
Cells in the hash table are assigned one of three states in this method – occupied, empty, or deleted. If a hash collision occurs, the table will be probed to move the record to an alternate cell that is stated as empty. There are different types of probing that take place when a hash collision happens and this method is implemented. Some types of probing are
linear probing
Linear probing is a scheme in computer programming for resolving hash collision, collisions in hash tables, data structures for maintaining a collection of Attribute–value pair, key–value pairs and looking up the value associated with a giv ...
,
double hashing
Double hashing is a computer programming technique used in conjunction with open addressing in hash tables to resolve hash collisions, by using a secondary hash of the key as an offset when a collision occurs. Double hashing with open addressing is ...
, and
quadratic probing
Quadratic probing is an open addressing scheme in computer programming for resolving hash collisions in hash tables. Quadratic probing operates by taking the original hash index and adding successive values of an arbitrary quadratic polynomial un ...
.
Open Addressing is also known as closed hashing.
Separate chaining
This strategy allows more than one record to be "chained" to the cells of a hash table. If two records are being directed to the same cell, both would go into that cell as a linked list. This efficiently prevents a hash collision from occurring since records with the same hash values can go into the same cell, but it has its disadvantages. Keeping track of so many lists is difficult and can cause whatever tool that is being used to become very slow.
Separate chaining is also known as open hashing.
Cache-conscious collision resolution
Although much less used than the previous two, has proposed the
cache
Cache, caching, or caché may refer to:
Science and technology
* Cache (computing), a technique used in computer storage for easier data access
* Cache (biology) or hoarding, a food storing behavior of animals
* Cache (archaeology), artifacts p ...
-conscious collision resolution method in 2005.
It is a similar idea to the separate chaining methods, although it does not technically involve the chained lists. In this case, instead of chained lists, the hash values are represented in a contiguous list of items. This is better suited for string hash tables and the use for numeric values is still unknown.
See also
*
List of hash functions
This is a list of hash functions, including cyclic redundancy checks, checksum functions, and cryptographic hash functions.
Cyclic redundancy checks
Adler-32 is often mistaken for a CRC, but it is not: it is a checksum.
Checksums
Univer ...
*
*
*
*
*
References
External links
Hashing
{{DEFAULTSORT:Hash_Collision