HOME

TheInfoList



OR:

File verification is the process of using an
algorithm In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for performing ...
for verifying the integrity of a
computer file A computer file is a computer resource for recording data in a computer storage device, primarily identified by its file name. Just as words can be written to paper, so can data be written to a computer file. Files can be shared with and trans ...
, usually by
checksum A checksum is a small-sized block of data derived from another block of digital data for the purpose of detecting errors that may have been introduced during its transmission or storage. By themselves, checksums are often used to verify dat ...
. This can be done by comparing two files bit-by-bit, but requires two copies of the same file, and may miss systematic corruptions which might occur to both files. A more popular approach is to generate a
hash Hash, hashes, hash mark, or hashing may refer to: Substances * Hash (food), a coarse mixture of ingredients * Hash, a nickname for hashish, a cannabis product Hash mark *Hash mark (sports), a marking on hockey rinks and gridiron football field ...
of the copied file and comparing that to the hash of the original file.


Integrity verification

File integrity can be compromised, usually referred to as the file becoming corrupted. A file can become corrupted by a variety of ways: faulty
storage media Data storage is the recording (storing) of information (data) in a storage medium. Handwriting, phonographic recording, magnetic tape, and optical discs are all examples of storage media. Biological molecules such as RNA and DNA are cons ...
, errors in transmission, write errors during copying or moving, software bugs, and so on.
Hash Hash, hashes, hash mark, or hashing may refer to: Substances * Hash (food), a coarse mixture of ingredients * Hash, a nickname for hashish, a cannabis product Hash mark *Hash mark (sports), a marking on hockey rinks and gridiron football field ...
-based verification ensures that a file has not been corrupted by comparing the file's hash value to a previously calculated value. If these values match, the file is presumed to be unmodified. Due to the nature of hash functions,
hash collision In computer science, a hash collision or hash clash is when two pieces of data in a hash table share the same hash value. The hash value in this case is derived from a hash function which takes a data input and returns a fixed length of bits. Al ...
s may result in
false positive A false positive is an error in binary classification in which a test result incorrectly indicates the presence of a condition (such as a disease when the disease is not present), while a false negative is the opposite error, where the test resul ...
s, but the likelihood of collisions is often negligible with random corruption.


Authenticity verification

It is often desirable to verify that a file hasn't been modified in transmission or storage by untrusted parties, for example, to include malicious code such as
virus A virus is a wikt:submicroscopic, submicroscopic infectious agent that replicates only inside the living Cell (biology), cells of an organism. Viruses infect all life forms, from animals and plants to microorganisms, including bacteria and ...
es or
backdoor A back door is a door in the rear of a building. Back door may also refer to: Arts and media * Back Door (jazz trio), a British group * Porta dos Fundos (literally “Back Door” in Portuguese) Brazilian comedy YouTube channel. * Works so titl ...
s. To verify the authenticity, a classical hash function is not enough as they are not designed to be collision resistant; it is computationally trivial for an attacker to cause deliberate hash collisions, meaning that a malicious change in the file is not detected by a hash comparison. In cryptography, this attack is called a
preimage attack In cryptography, a preimage attack on cryptographic hash functions tries to find a message that has a specific hash value. A cryptographic hash function should resist attacks on its preimage (set of possible inputs). In the context of attack, the ...
. For this purpose,
cryptographic hash function A cryptographic hash function (CHF) is a hash algorithm (a map of an arbitrary binary string to a binary string with fixed size of n bits) that has special properties desirable for cryptography: * the probability of a particular n-bit output ...
s are employed often. As long as the hash sums cannot be tampered with — for example, if they are communicated over a secure channel — the files can be presumed to be intact. Alternatively,
digital signature A digital signature is a mathematical scheme for verifying the authenticity of digital messages or documents. A valid digital signature, where the prerequisites are satisfied, gives a recipient very high confidence that the message was created b ...
s can be employed to assure
tamper resistance Tamperproofing, conceptually, is a methodology used to hinder, deter or detect unauthorised access to a device or circumvention of a security system. Since any device or system can be foiled by a person with sufficient knowledge, equipment, and ti ...
.


File formats

A checksum file is a small file that contains the checksums of other files. There are a few well-known checksum file formats. Several utilities, such as md5deep, can use such checksum files to automatically verify an entire directory of files in one operation. The particular hash algorithm used is often indicated by the file extension of the checksum file. The ".sha1" file extension indicates a checksum file containing 160-bit
SHA-1 In cryptography, SHA-1 (Secure Hash Algorithm 1) is a cryptographically broken but still widely used hash function which takes an input and produces a 160-bit (20- byte) hash value known as a message digest – typically rendered as 40 hexadec ...
hashes in
sha1sum is a computer program that calculates and verifies SHA-1 hashes. It is commonly used to verify the integrity of files. It (or a variant) is installed by default on most Linux distributions. Typically distributed alongside are , , and , which ...
format. The ".md5" file extension, or a file named "MD5SUMS", indicates a checksum file containing 128-bit MD5 hashes in
md5sum is a computer program that calculates and verifies 128-bit MD5 hashes, as described in RFC 1321. The MD5 hash functions as a compact digital fingerprint of a file. As with all such hashing algorithms, there is theoretically an unlimited number ...
format. The ".sfv" file extension indicates a checksum file containing 32-bit CRC32 checksums in simple file verification format. The "crc.list" file indicates a checksum file containing 32-bit CRC checksums in brik format. As of 2012, best practice recommendations is to use
SHA-2 SHA-2 (Secure Hash Algorithm 2) is a set of cryptographic hash functions designed by the United States National Security Agency (NSA) and first published in 2001. They are built using the Merkle–Damgård construction, from a one-way compressi ...
or
SHA-3 SHA-3 (Secure Hash Algorithm 3) is the latest member of the Secure Hash Algorithm family of standards, released by NIST on August 5, 2015. Although part of the same series of standards, SHA-3 is internally different from the MD5-like stru ...
to generate new file integrity digests; and to accept MD5 and SHA1 digests for backward compatibility if stronger digests are not available. The theoretically weaker SHA1, the weaker MD5, or much weaker CRC were previously commonly used for file integrity checks. Simson Garfinkel, Gene Spafford, Alan Schwartz
"Practical UNIX and Internet Security"
p. 630.
CRC checksums cannot be used to verify the authenticity of files, as CRC32 is not a collision resistant hash function -- even if the hash sum file is not tampered with, it is computationally trivial for an attacker to replace a file with the same CRC digest as the original file, meaning that a malicious change in the file is not detected by a CRC comparison.


References


See also

*
Checksum A checksum is a small-sized block of data derived from another block of digital data for the purpose of detecting errors that may have been introduced during its transmission or storage. By themselves, checksums are often used to verify dat ...
*
Data deduplication In computing, data deduplication is a technique for eliminating duplicate copies of repeating data. Successful implementation of the technique can improve storage utilization, which may in turn lower capital expenditure by reducing the overall amou ...
{{Computer files Computer files Error detection and correction