HOME

TheInfoList



OR:

In
computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to Applied science, practical discipli ...
, a fingerprinting algorithm is a procedure that
maps A map is a symbolic depiction emphasizing relationships between elements of some space, such as objects, regions, or themes. Many maps are static, fixed to paper or some other durable medium, while others are dynamic or interactive. Although ...
an arbitrarily large
data In the pursuit of knowledge, data (; ) is a collection of discrete values that convey information, describing quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpret ...
item (such as a computer file) to a much shorter
bit The bit is the most basic unit of information in computing and digital communications. The name is a portmanteau of binary digit. The bit represents a logical state with one of two possible values. These values are most commonly represente ...
string, its fingerprint, that uniquely identifies the original data for all practical purposesA. Z. Broder. Some applications of Rabin's fingerprinting method. In Sequences II: Methods in Communications, Security, and Computer Science, pages 143--152. Springer-Verlag, 1993 just as human
fingerprint A fingerprint is an impression left by the friction ridges of a human finger. The recovery of partial fingerprints from a crime scene is an important method of forensic science. Moisture and grease on a finger result in fingerprints on surfac ...
s uniquely identify people for practical purposes. This fingerprint may be used for
data deduplication In computing, data deduplication is a technique for eliminating duplicate copies of repeating data. Successful implementation of the technique can improve storage utilization, which may in turn lower capital expenditure by reducing the overall amou ...
purposes. This is also referred to as file fingerprinting, data fingerprinting, or structured data fingerprinting. Fingerprints are typically used to avoid the comparison and transmission of bulky data. For instance, a
web browser A web browser is application software for accessing websites. When a user requests a web page from a particular website, the browser retrieves its files from a web server and then displays the page on the user's screen. Browsers are used o ...
or
proxy server In computer networking, a proxy server is a server application that acts as an intermediary between a client requesting a resource and the server providing that resource. Instead of connecting directly to a server that can fulfill a reques ...
can efficiently check whether a remote file has been modified, by fetching only its fingerprint and comparing it with that of the previously fetched copy.Detecting duplicate and near-duplicate files. US Patent 6658423 Issued on December 2, 2003 Brin, S. and Davis, J. and Garcia-Molina, H. (1995)
Copy Detection Mechanisms for Digital Documents
'' In: ACM International Conference on Management of Data (SIGMOD 1995), May 22–25, 1995, San Jose, California, fro
stanford.eduArchived
18/08/2016. Retrieved 11/01/2019.
L. Fan, P. Cao, J. Almeida and A. Broder, Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol, IEEE/ACM Transactions on Networking, vol. 8, No. 3 (2000)U. Manber, Finding Similar Files in a Large File System. Proceedings of the USENIX Winter Technical Conf. (1994) Fingerprint functions may be seen as high-performance
hash function A hash function is any function that can be used to map data of arbitrary size to fixed-size values. The values returned by a hash function are called ''hash values'', ''hash codes'', ''digests'', or simply ''hashes''. The values are usually ...
s used to uniquely identify substantial blocks of data where
cryptographic hash function A cryptographic hash function (CHF) is a hash algorithm (a map of an arbitrary binary string to a binary string with fixed size of n bits) that has special properties desirable for cryptography: * the probability of a particular n-bit output ...
s may be unnecessary. Audio fingerprint algorithms should not be confused with this type of fingerprint function.


Properties


Virtual uniqueness

To serve its intended purposes, a fingerprinting algorithm must be able to capture the identity of a file with virtual certainty. In other words, the probability of a
collision In physics, a collision is any event in which two or more bodies exert forces on each other in a relatively short time. Although the most common use of the word ''collision'' refers to incidents in which two or more objects collide with great fo ...
— two files yielding the same fingerprint — must be negligible, compared to the probability of other unavoidable causes of fatal errors (such as the system being destroyed by
war War is an intense armed conflict between states, governments, societies, or paramilitary groups such as mercenaries, insurgents, and militias. It is generally characterized by extreme violence, destruction, and mortality, using regular o ...
or by a
meteorite A meteorite is a solid piece of debris from an object, such as a comet, asteroid, or meteoroid, that originates in outer space and survives its passage through the atmosphere to reach the surface of a planet or moon. When the original object ...
): say, 10−20 or less. This requirement is somewhat similar to that of a
checksum A checksum is a small-sized block of data derived from another block of digital data for the purpose of detecting errors that may have been introduced during its transmission or storage. By themselves, checksums are often used to verify data ...
function, but is much more stringent. To detect accidental data corruption or transmission errors, it is sufficient that the checksums of the original file and any corrupted version will differ with near certainty, given some statistical model for the errors. In typical situations, this goal is easily achieved with 16- or 32-bit checksums. In contrast, file fingerprints need to be at least
64-bit In computer architecture, 64-bit integers, memory addresses, or other data units are those that are 64 bits wide. Also, 64-bit CPUs and ALUs are those that are based on processor registers, address buses, or data buses of that size. A ...
long to guarantee virtual uniqueness in large file systems (see
birthday attack A birthday attack is a type of cryptographic attack that exploits the mathematics behind the birthday problem in probability theory. This attack can be used to abuse communication between two or more parties. The attack depends on the higher likel ...
). When proving the above requirement, one must take into account that files are generated by highly non-random processes that create complicated dependencies among files. For instance, in a typical business network, one usually finds many pairs or clusters of documents that differ only by minor edits or other slight modifications. A good fingerprinting algorithm must ensure that such "natural" processes generate distinct fingerprints, with the desired level of certainty.


Compounding

Computer files are often combined in various ways, such as concatenation (as in archive files) or symbolic inclusion (as with the
C preprocessor The C preprocessor is the macro preprocessor for the C, Objective-C and C++ computer programming languages. The preprocessor provides the ability for the inclusion of header files, macro expansions, conditional compilation, and line control ...
's directive). Some fingerprinting algorithms allow the fingerprint of a composite file to be computed from the fingerprints of its constituent parts. This "compounding" property may be useful in some applications, such as detecting when a program needs to be recompiled.


Algorithms


Rabin's algorithm

Rabin's fingerprinting algorithmM. O. Rabin Fingerprinting by random polynomials. Center for Research in Computing Technology Harvard University Report TR-15-81 (1981) is the prototype of the class. It is fast and easy to implement, allows compounding, and comes with a mathematically precise analysis of the probability of collision. Namely, the probability of two strings ''r'' and ''s'' yielding the same ''w''-bit fingerprint does not exceed max(, ''r'', ,, ''s'', )/2''w''-1, where , ''r'', denotes the length of ''r'' in bits. The algorithm requires the previous choice of a ''w''-bit internal "key", and this guarantee holds as long as the strings ''r'' and ''s'' are chosen without knowledge of the key. Rabin's method is not secure against malicious attacks. An adversarial agent can easily discover the key and use it to modify files without changing their fingerprint.


Cryptographic hash functions

Mainstream
cryptographic Cryptography, or cryptology (from grc, , translit=kryptós "hidden, secret"; and ''graphein'', "to write", or '' -logia'', "study", respectively), is the practice and study of techniques for secure communication in the presence of adv ...
grade hash functions generally can serve as high-quality fingerprint functions, are subject to intense scrutiny from
cryptanalysts Cryptanalysis (from the Greek ''kryptós'', "hidden", and ''analýein'', "to analyze") refers to the process of analyzing information systems in order to understand hidden aspects of the systems. Cryptanalysis is used to breach cryptographic sec ...
, and have the advantage that they are believed to be safe against malicious attacks. A drawback of cryptographic hash algorithms such as MD5 and SHA is that they take considerably longer to execute than Rabin's fingerprint algorithm. They also lack proven guarantees on the collision probability. Some of these algorithms, notably MD5, are no longer recommended for secure fingerprinting. They are still useful for error checking, where purposeful data tampering isn't a primary concern.


Fingerprinting and watermarking for relational databases

Fingerprinting and
digital watermarking A digital watermark is a kind of marker covertly embedded in a noise-tolerant signal such as audio, video or image data. It is typically used to identify ownership of the copyright of such signal. "Watermarking" is the process of hiding digital in ...
for relational databases emerged as candidate solutions to provide
copyright protection A copyright is a type of intellectual property that gives its owner the exclusive right to copy, distribute, adapt, display, and perform a creative work, usually for a limited time. The creative work may be in a literary, artistic, educatio ...
, tamper detection,
traitor tracing Traitor tracing schemes help trace the source of leaks when secret or proprietary data is sold to many customers. In a traitor tracing scheme, each customer is given a different personal decryption key. (Traitor tracing schemes are often combined ...
, and maintaining integrity of relational data. Many techniques have been proposed in the literature to address these purposes. A survey of the current
state-of-the-art The state of the art (sometimes cutting edge or leading edge) refers to the highest level of general development, as of a device, technique, or scientific field achieved at a particular time. However, in some contexts it can also refer to a level ...
and a classification of the different approaches according to their intent, the way they express the fingerprint/watermark, the cover type, the granularity level, and their verifiability, is available.


Application examples

NIST The National Institute of Standards and Technology (NIST) is an agency of the United States Department of Commerce whose mission is to promote American innovation and industrial competitiveness. NIST's activities are organized into physical sci ...
distributes a software reference library, the American
National Software Reference Library The National Software Reference Library (NSRL), is a project of the National Institute of Standards and Technology (NIST) which maintains a repository of known software, file profiles and file signatures for use by law enforcement and other organiz ...
, that uses cryptographic hash functions to fingerprint files and map them to software products. The
HashKeeper HashKeeper is a database application of value primarily to those conducting forensic examinations of computers on a somewhat regular basis. Overview HashKeeper uses the MD5 file signature algorithm to establish unique numeric identifiers (has ...
database, maintained by the National Drug Intelligence Center, is a repository of fingerprints of "known to be good" and "known to be bad" computer files, for use in law enforcement applications (e.g. analyzing the contents of seized disk drives).


See also

*
Acoustic fingerprint An acoustic fingerprint is a condensed digital summary, a fingerprint, deterministically generated from an audio signal, that can be used to identify an audio sample or quickly locate similar items in an audio database. Practical uses of aco ...
ing * Automatic content recognition *
Canvas fingerprinting Canvas fingerprinting is one of a number of browser fingerprinting techniques for tracking online users that allow websites to identify and track visitors using the HTML5 canvas element instead of browser cookies or other similar means. The techni ...
* Digital video fingerprinting *
TCP/IP stack fingerprinting TCP/IP stack fingerprinting is the remote detection of the characteristics of a TCP/IP stack implementation. The combination of parameters may then be used to infer the remote machine's operating system (aka, OS fingerprinting), or incorporated ...
*
Device fingerprint A device fingerprint or machine fingerprint is information collected about the software and hardware of a remote computing device for the purpose of identification. The information is usually assimilated into a brief identifier using a fingerprinti ...
*
Machine Identification Code A Machine Identification Code (MIC), also known as printer steganography, yellow dots, tracking dots or secret dots, is a digital watermark which certain color laser printers and copiers leave on every printed page, allowing identification of th ...
* Error correcting code * Public key fingerprint *
Randomizing function Random number generation is a process by which, often by means of a random number generator (RNG), a sequence of numbers or symbols that cannot be reasonably predicted better than by random chance is generated. This means that the particular outc ...
*
Usage share of web browsers The usage share of web browsers is the portion, often expressed as a percentage, of visitors to a group of web sites that use a particular web browser. Accuracy Measuring browser usage in the number of requests (page hits) made by each use ...


References

{{Reflist Identifiers Fingerprinting algorithms