In
computer science
Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to practical disciplines (includin ...
, a fingerprinting algorithm is a procedure that
maps an
arbitrarily large
data
In the pursuit of knowledge, data (; ) is a collection of discrete values that convey information, describing quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpret ...
item (such as a computer
file) to a much shorter
bit
The bit is the most basic unit of information in computing and digital communications. The name is a portmanteau of binary digit. The bit represents a logical state with one of two possible values. These values are most commonly represented a ...
string, its fingerprint, that uniquely identifies the original data for all practical purposes
[A. Z. Broder. Some applications of Rabin's fingerprinting method. In Sequences II: Methods in Communications, Security, and Computer Science, pages 143--152. Springer-Verlag, 1993] just as human
fingerprint
A fingerprint is an impression left by the friction ridges of a human finger. The recovery of partial fingerprints from a crime scene is an important method of forensic science. Moisture and grease on a finger result in fingerprints on surfa ...
s uniquely identify people for practical purposes. This fingerprint may be used for
data deduplication purposes. This is also referred to as file fingerprinting, data fingerprinting, or structured data fingerprinting.
Fingerprints are typically used to avoid the comparison and transmission of bulky data. For instance, a
web browser
A web browser is application software for accessing websites. When a user requests a web page from a particular website, the browser retrieves its files from a web server and then displays the page on the user's screen. Browsers are used on ...
or
proxy server
In computer networking, a proxy server is a server application that acts as an intermediary between a client requesting a resource and the server providing that resource.
Instead of connecting directly to a server that can fulfill a requ ...
can efficiently check whether a remote file has been modified, by fetching only its fingerprint and comparing it with that of the previously fetched copy.
[Detecting duplicate and near-duplicate files. US Patent 6658423 Issued on December 2, 2003][ Brin, S. and Davis, J. and Garcia-Molina, H. (1995) ]
Copy Detection Mechanisms for Digital Documents
'' In: ACM International Conference on Management of Data (SIGMOD 1995), May 22–25, 1995, San Jose, California, fro
stanford.edu
Archived
18/08/2016. Retrieved 11/01/2019.[L. Fan, P. Cao, J. Almeida and A. Broder, Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol, IEEE/ACM Transactions on Networking, vol. 8, No. 3 (2000)][U. Manber, Finding Similar Files in a Large File System. Proceedings of the USENIX Winter Technical Conf. (1994)]
Fingerprint functions may be seen as high-performance
hash function
A hash function is any function that can be used to map data of arbitrary size to fixed-size values. The values returned by a hash function are called ''hash values'', ''hash codes'', ''digests'', or simply ''hashes''. The values are usually ...
s used to uniquely identify substantial blocks of data where
cryptographic hash function
A cryptographic hash function (CHF) is a hash algorithm (a map of an arbitrary binary string to a binary string with fixed size of n bits) that has special properties desirable for cryptography:
* the probability of a particular n-bit output ...
s may be unnecessary.
Audio fingerprint
An acoustic fingerprint is a condensed digital summary, a fingerprint, deterministically generated from an audio signal, that can be used to identify an audio sample or quickly locate similar items in an audio database.
Practical uses of ac ...
algorithms should not be confused with this type of fingerprint function.
Properties
Virtual uniqueness
To serve its intended purposes, a fingerprinting algorithm must be able to capture the identity of a file with virtual certainty. In other words, the probability of a
collision
In physics, a collision is any event in which two or more bodies exert forces on each other in a relatively short time. Although the most common use of the word ''collision'' refers to incidents in which two or more objects collide with great fo ...
— two files yielding the same fingerprint — must be negligible, compared to the probability of other unavoidable causes of fatal errors (such as the system being destroyed by
war or by a
meteorite
A meteorite is a solid piece of debris from an object, such as a comet, asteroid, or meteoroid, that originates in outer space and survives its passage through the atmosphere to reach the surface of a planet or moon. When the original object en ...
): say, 10
−20 or less.
This requirement is somewhat similar to that of a
checksum
A checksum is a small-sized block of data derived from another block of digital data for the purpose of detecting errors that may have been introduced during its transmission or storage. By themselves, checksums are often used to verify dat ...
function, but is much more stringent. To detect accidental data corruption or transmission errors, it is sufficient that the checksums of the original file and any corrupted version will differ with near certainty, given some statistical model for the errors. In typical situations, this goal is easily achieved with 16- or 32-bit checksums. In contrast, file fingerprints need to be at least
64-bit
In computer architecture, 64-bit integers, memory addresses, or other data units are those that are 64 bits wide. Also, 64-bit CPUs and ALUs are those that are based on processor registers, address buses, or data buses of that size. A comp ...
long to guarantee virtual uniqueness in large file systems (see
birthday attack).
When proving the above requirement, one must take into account that files are generated by highly non-random processes that create complicated dependencies among files. For instance, in a typical business network, one usually finds many pairs or clusters of documents that differ only by minor edits or other slight modifications. A good fingerprinting algorithm must ensure that such "natural" processes generate distinct fingerprints, with the desired level of certainty.
Compounding
Computer files are often combined in various ways, such as concatenation (as in
archive file
In computing, an archive file is a computer file that is composed of one or more files along with metadata. Archive files are used to collect multiple data files together into a single file for easier portability and storage, or simply to compre ...
s) or symbolic inclusion (as with the
C preprocessor's directive). Some fingerprinting algorithms allow the fingerprint of a composite file to be computed from the fingerprints of its constituent parts. This "compounding" property may be useful in some applications, such as detecting when a program needs to be recompiled.
Algorithms
Rabin's algorithm
Rabin's fingerprinting algorithm[M. O. Rabin Fingerprinting by random polynomials. Center for Research in Computing Technology Harvard University Report TR-15-81 (1981)] is the prototype of the class. It is fast and easy to implement, allows compounding, and comes with a mathematically precise analysis of the probability of collision. Namely, the probability of two strings ''r'' and ''s'' yielding the same ''w''-bit fingerprint does not exceed max(, ''r'', ,, ''s'', )/2
''w''-1, where , ''r'', denotes the length of ''r'' in bits. The algorithm requires the previous choice of a ''w''-bit internal "key", and this guarantee holds as long as the strings ''r'' and ''s'' are chosen without knowledge of the key.
Rabin's method is not secure against malicious attacks. An adversarial agent can easily discover the key and use it to modify files without changing their fingerprint.
Cryptographic hash functions
Mainstream
cryptographic
Cryptography, or cryptology (from grc, , translit=kryptós "hidden, secret"; and ''graphein'', "to write", or ''-logia'', "study", respectively), is the practice and study of techniques for secure communication in the presence of adver ...
grade hash functions generally can serve as high-quality fingerprint functions, are subject to intense scrutiny from
cryptanalysts, and have the advantage that they are believed to be safe against malicious attacks.
A drawback of cryptographic hash algorithms such as
MD5 and
SHA is that they take considerably longer to execute than Rabin's fingerprint algorithm. They also lack proven guarantees on the collision probability. Some of these algorithms, notably
MD5, are no longer recommended for secure fingerprinting. They are still useful for error checking, where purposeful data tampering isn't a primary concern.
Fingerprinting and watermarking for relational databases
Fingerprinting and
digital watermarking for relational databases emerged as candidate solutions to provide
copyright protection,
tamper detection,
traitor tracing, and maintaining integrity of relational data. Many techniques have been proposed in the literature to address these purposes. A survey of the current
state-of-the-art
The state of the art (sometimes cutting edge or leading edge) refers to the highest level of general development, as of a device, technique, or scientific field achieved at a particular time. However, in some contexts it can also refer to a level ...
and a classification of the different approaches according to their intent, the way they express the fingerprint/watermark, the cover type, the granularity level, and their verifiability, is available.
Application examples
NIST
The National Institute of Standards and Technology (NIST) is an agency of the United States Department of Commerce whose mission is to promote American innovation and industrial competitiveness. NIST's activities are organized into physical sc ...
distributes a software reference library, the American
National Software Reference Library
The National Software Reference Library (NSRL), is a project of the National Institute of Standards and Technology (NIST) which maintains a repository of known software, file profiles and file signatures for use by law enforcement and other organiz ...
, that uses cryptographic hash functions to fingerprint files and map them to software products. The
HashKeeper
HashKeeper is a database application of value primarily to those conducting forensic examinations of computers on a somewhat regular basis.
Overview
HashKeeper uses the MD5 file signature algorithm to establish unique numeric identifiers ( ...
database, maintained by the
National Drug Intelligence Center
The United States National Drug Intelligence Center (NDIC), established in 1993, was a component of the U.S. Department of Justice and a member of the Intelligence Community. ThGeneral Counterdrug Intelligence Plan implemented in February 2000, ...
, is a repository of fingerprints of "known to be good" and "known to be bad" computer files, for use in law enforcement applications (e.g. analyzing the contents of seized disk drives).
See also
*
Acoustic fingerprint
An acoustic fingerprint is a condensed digital summary, a fingerprint, deterministically generated from an audio signal, that can be used to identify an audio sample or quickly locate similar items in an audio database.
Practical uses of ...
ing
*
Automatic content recognition
*
Canvas fingerprinting
*
Digital video fingerprinting
Video fingerprinting or video hashing are a class of dimension reduction techniques in which a system identifies, extracts, and then summarizes characteristic components of a video as a unique or a set of multiple perceptual hashes, enabling that ...
*
TCP/IP stack fingerprinting
*
Device fingerprint
A device fingerprint or machine fingerprint is information collected about the software and hardware of a remote computing device for the purpose of identification. The information is usually assimilated into a brief identifier using a fingerprint ...
*
Machine Identification Code
*
Error correcting code
In computing, telecommunication, information theory, and coding theory, an error correction code, sometimes error correcting code, (ECC) is used for controlling errors in data over unreliable or noisy communication channels. The central idea is ...
*
Public key fingerprint
In public-key cryptography, a public key fingerprint is a short sequence of bytes used to identify a longer public key. Fingerprints are created by applying a cryptographic hash function to a public key. Since fingerprints are shorter than the ...
*
Randomizing function
Random number generation is a process by which, often by means of a random number generator (RNG), a sequence of numbers or symbols that cannot be reasonably predicted better than by random chance is generated. This means that the particular ou ...
*
Usage share of web browsers
References
{{Reflist
Identifiers
Fingerprinting algorithms