HOME

TheInfoList



OR:

In computer science, a fingerprinting algorithm is a procedure that maps an arbitrarily large data item (remove, as a computer file) to a much shorter bit string, its ''fingerprint'', that uniquely identifies the original data for all practical purposes just as human fingerprints uniquely identify people for practical purposes. This fingerprint may be used for data deduplication purposes. This is also referred to as ''file fingerprinting'', ''data fingerprinting'', or ''structured data fingerprinting''. Fingerprints are typically used to avoid the comparison and transmission of bulky data. For instance, a remove, web browser or proxy server can efficiently check whether a remote, by fetching only its fingerprint and comparing it with that of the previously fetched copy. Fingerprint functions may be seen as high-performance hash functions used to uniquely identify substantial blocks of data where cryptographic functions may be. Special algorithms exist for audio and video fingerprinting.


Properties


Virtual uniqueness

To serve its intended purposes, a fingerprinting algorithm must be able to capture the identity of a file with virtual certainty. In other words, the probability of a
collision In physics, a collision is any event in which two or more bodies exert forces on each other in a relatively short time. Although the most common use of the word ''collision'' refers to incidents in which two or more objects collide with great for ...
— two files yielding the same fingerprint — must be negligible, compared to the probability of other unavoidable causes of fatal errors (such as the system being destroyed by
war War is an armed conflict between the armed forces of states, or between governmental forces and armed groups that are organized under a certain command structure and have the capacity to sustain military operations, or between such organi ...
or by a
meteorite A meteorite is a rock (geology), rock that originated in outer space and has fallen to the surface of a planet or Natural satellite, moon. When the original object enters the atmosphere, various factors such as friction, pressure, and chemical ...
): say, 10−20 or less. This requirement is somewhat similar to that of a
checksum A checksum is a small-sized block of data derived from another block of digital data for the purpose of detecting errors that may have been introduced during its transmission or storage. By themselves, checksums are often used to verify dat ...
function, but is much more stringent. To detect accidental data corruption or transmission errors, it is sufficient that the checksums of the original file and any corrupted version will differ with near certainty, given some statistical model for the errors. In typical situations, this goal is easily achieved with 16- or 32-bit checksums. In contrast, file fingerprints need to be at least
64-bit In computer architecture, 64-bit integers, memory addresses, or other data units are those that are 64 bits wide. Also, 64-bit central processing units (CPU) and arithmetic logic units (ALU) are those that are based on processor registers, a ...
long to guarantee virtual uniqueness in large file systems (see
birthday attack A birthday attack is a bruteforce collision attack that exploits the mathematics behind the birthday problem in probability theory. This attack can be used to abuse communication between two or more parties. The attack depends on the higher likeli ...
). When proving the above requirement, one must take into account that files are generated by highly non-random processes that create complicated dependencies among files. For instance, in a typical business network, one usually finds many pairs or clusters of documents that differ only by minor edits or other slight modifications. A good fingerprinting algorithm must ensure that such "natural" processes generate distinct fingerprints, with the desired level of certainty.


Compounding

Computer files are often combined in various ways, such as concatenation (as in
archive file In computing, an archive file stores the content of one or more files, possibly compressed, with associated metadata such as file name, directory structure, error detection and correction information, commentary, compressed data archives, sto ...
s) or symbolic inclusion (as with the
C preprocessor The C preprocessor (CPP) is a text file processor that is used with C, C++ and other programming tools. The preprocessor provides for file inclusion (often header files), macro expansion, conditional compilation, and line control. Although ...
's directive). Some fingerprinting algorithms allow the fingerprint of a composite file to be computed from the fingerprints of its constituent parts. This "compounding" property may be useful in some applications, such as detecting when a program needs to be recompiled.


Algorithms


Rabin's algorithm

Rabin's fingerprinting algorithm is the prototype of the class. It is fast and easy to implement, allows compounding, and comes with a mathematically precise analysis of the probability of collision. Namely, the probability of two strings ''r'' and ''s'' yielding the same ''w''-bit fingerprint does not exceed max(, ''r'', ,, ''s'', )/2''w''-1, where , ''r'', denotes the length of ''r'' in bits. The algorithm requires the previous choice of a ''w''-bit internal "key", and this guarantee holds as long as the strings ''r'' and ''s'' are chosen without knowledge of the key. Rabin's method is not secure against malicious attacks. An adversarial agent can easily discover the key and use it to modify files without changing their fingerprint.


Cryptographic hash functions

Mainstream
cryptographic Cryptography, or cryptology (from "hidden, secret"; and ''graphein'', "to write", or '' -logia'', "study", respectively), is the practice and study of techniques for secure communication in the presence of adversarial behavior. More gen ...
grade hash functions generally can serve as high-quality fingerprint functions, are subject to intense scrutiny from cryptanalysts, and have the advantage that they are believed to be safe against malicious attacks. A drawback of cryptographic hash algorithms such as
MD5 The MD5 message-digest algorithm is a widely used hash function producing a 128-bit hash value. MD5 was designed by Ronald Rivest in 1991 to replace an earlier hash function MD4, and was specified in 1992 as Request for Comments, RFC 1321. MD5 ...
and SHA is that they take considerably longer to execute than Rabin's fingerprint algorithm. They also lack proven guarantees on the collision probability. Some of these algorithms, notably
MD5 The MD5 message-digest algorithm is a widely used hash function producing a 128-bit hash value. MD5 was designed by Ronald Rivest in 1991 to replace an earlier hash function MD4, and was specified in 1992 as Request for Comments, RFC 1321. MD5 ...
, are no longer recommended for secure fingerprinting. They are still useful for error checking, where purposeful data tampering is not a primary concern.


Perceptual hashing


Application examples

NIST The National Institute of Standards and Technology (NIST) is an agency of the United States Department of Commerce whose mission is to promote American innovation and industrial competitiveness. NIST's activities are organized into physical s ...
distributes a software reference library, the American
National Software Reference Library The National Software Reference Library (NSRL), is a project of the National Institute of Standards and Technology (NIST) which maintains a repository of known software, file profiles and file signatures for use by law enforcement and other organiz ...
, that uses cryptographic hash functions to fingerprint files and map them to software products. The
HashKeeper HashKeeper is a database application of value primarily to those conducting forensic examinations of computers on a somewhat regular basis. Overview HashKeeper uses the MD5 file signature algorithm to establish unique numeric identifiers (has ...
database, maintained by the
National Drug Intelligence Center The United States National Drug Intelligence Center (NDIC), established in 1993, was a component of the U.S. Department of Justice and a member of the Intelligence Community. ThGeneral Counterdrug Intelligence Plan implemented in February 2000, ...
, is a repository of fingerprints of "known to be good" and "known to be bad" computer files, for use in law enforcement applications (e.g. analyzing the contents of seized disk drives).


Content similarity detection


See also

*
Acoustic fingerprint An acoustic fingerprint is a condensed digital summary, a digital fingerprint, deterministically generated from an audio signal, that can be used to identify an audio sample or quickly locate similar items in a music database. Practical u ...
ing *
Automatic content recognition Automatic content recognition (ACR) is a technology used to identify content played on a media device or presented within a media file. Devices with ACR can allow for the collection of content consumption information automatically at the screen or ...
*
Canvas fingerprinting Canvas fingerprinting is one of a number of Device fingerprint, browser fingerprinting techniques for Web tracking, tracking online users that allow websites to identify and track visitors using the HTML5 canvas element instead of browser cookies o ...
*
Digital video fingerprinting Video fingerprinting or video hashing are a class of dimension reduction techniques in which a system identifies, extracts and then summarizes characteristic components of a video as a unique or a set of multiple '' perceptual hashes'' or ''finger ...
*
TCP/IP stack fingerprinting TCP/IP stack fingerprinting is the remote detection of the characteristics of a TCP/IP stack implementation. The combination of parameters may then be used to infer the remote machine's operating system (aka, OS fingerprinting), or incorporated ...
*
Device fingerprint A device fingerprint or machine fingerprint is information collected about the software and hardware of a remote computing device for the purpose of identification. The information is usually assimilated into a brief identifier using a fingerprint ...
*
Machine Identification Code Printer tracking dots, also known as printer steganography, DocuColor tracking dots, yellow dots, secret dots, or a machine identification code (MIC), is a digital watermarking, digital watermark which many color laser printing, laser printers a ...
*
Error correcting code In computing, telecommunication, information theory, and coding theory, forward error correction (FEC) or channel coding is a technique used for controlling errors in data transmission over unreliable or noisy communication channels. The centra ...
*
Public key fingerprint In public-key cryptography, a public key fingerprint is a short sequence of bytes used to identify a longer public key. Fingerprints are created by applying a cryptographic hash function to a public key. Since fingerprints are shorter than the k ...
*
Randomizing function Random number generation is a process by which, often by means of a random number generator (RNG), a sequence of numbers or symbols is generated that cannot be reasonably predicted better than by random chance. This means that the particular ou ...
*
Usage share of web browsers The usage share of web browsers is the portion, often expressed as a percentage, of visitors to a group of web sites that use a particular web browser. Accuracy Measuring browser usage in the number of requests (page hits) made by each user a ...


References

{{Reflist Identifiers Fingerprinting algorithms