HOME

TheInfoList



OR:

Editing documents, program code, or any data always risks introducing errors. Displaying the differences between two or more sets of data, file comparison tools can make computing simpler, and more efficient by focusing on new data and ignoring what did not change. Generically known as a diff after the
Unix Unix (, ; trademarked as UNIX) is a family of multitasking, multi-user computer operating systems that derive from the original AT&T Unix, whose development started in 1969 at the Bell Labs research center by Ken Thompson, Dennis Ritchie, a ...
diff utility, there are a range of ways to compare data sources and display the results. Some widely used file comparison programs are
diff In computing, the utility diff is a data comparison tool that computes and displays the differences between the contents of files. Unlike edit distance notions used for other purposes, diff is line-oriented rather than character-oriented, but i ...
, cmp, FileMerge, WinMerge, Beyond Compare, and File Compare. Because understanding changes is important to writers of code or documents, many
text editor A text editor is a type of computer program that edits plain text. An example of such program is "notepad" software (e.g. Windows Notepad). Text editors are provided with operating systems and software development packages, and can be used to c ...
s and
word processor A word processor (WP) is a device or computer program that provides for input, editing, formatting, and output of text, often with some additional features. Early word processors were stand-alone devices dedicated to the function, but current word ...
s include the functionality necessary to see the changes between different versions of a file or document.


Method types

The most efficient method of finding differences depends on the source data, and the nature of the changes. One approach is to find the longest common subsequence between two files, then regard the non-common data as an insertion, or a deletion. In 1978, Paul Heckel published an algorithm that identifies most moved blocks of text. This is used in the IBM History Flow tool. Other file comparison programs find block moves. Some specialized file comparison tools find the longest increasing subsequence between two files. The rsync protocol uses a rolling hash function to compare two files on two distant computers with low communication overhead. File comparison in word processors is typically at the word level, while comparison in most programming tools is at the line level. Byte or character-level comparison is useful in some specialized applications.


Display

The optimal way to display the results of a file comparison depends on many factors, including the type of source data. The fixed lines of programming code provide a clear unit of comparison. This does not work with documents, where adding a single word may cause the following lines to wrap differently, but still not change the content. The most popular ways to display changes are either side-by-side, or a consolidating view that highlights data inserts, and deletes. In either side-by-side viewing,
code folding Code or text folding, or less commonly holophrasting, is a feature of some graphical user interfaces that allows the user to selectively hide ("fold") or display ("unfold") parts of a document. This allows the user to manage large amounts of text ...
or text folding, for the sake of efficiency, the interface may hide portions of the file that did not change and show only the changes.


Reasoning

There are various reasons to use comparison tools, and tools themselves use different approaches. To compare binary files, a tool may use byte-level comparison. Comparing
text file A text file (sometimes spelled textfile; an old alternative name is flat file) is a kind of computer file that is structured as a sequence of lines of electronic text. A text file exists stored as data within a computer file system. In ope ...
s or
computer program A computer program is a sequence or set of instructions in a programming language for a computer to Execution (computing), execute. It is one component of software, which also includes software documentation, documentation and other intangibl ...
s, many tools use a side-by-side visual comparison. This gives the user the chance to choose which changes to keep or reject before merging the files into a new version. Or perhaps to keep them both as-is for later reference, through some form of "versioning" control. File comparison is an important, and integral process of
file synchronization File synchronization (or syncing) in computing is the process of ensuring that computer files in two or more locations are updated via certain rules. In ''one-way file synchronization'', also called Web mirror, mirroring, updated files are copied ...
and
backup In information technology, a backup, or data backup is a copy of computer data taken and stored elsewhere so that it may be used to restore the original after a data loss event. The verb form, referring to the process of doing so, is "wikt:back ...
. In backup methodologies, the issue of
data corruption Data corruption refers to errors in computer data that occur during writing, reading, storage, transmission, or processing, which introduce unintended changes to the original data. Computer, transmission, and storage systems use a number of meas ...
is important. Rarely is there a warning before corruption occurs, this can make recovery difficult or impossible. Often, the problem is only apparent the next time someone tries to open a file. In this circumstance, a comparison tool can help to isolate the introduction of the problem.


Historical uses

Prior to file comparison, machines existed to compare magnetic tapes or punch cards. The IBM 519 Card Reproducer could determine whether a deck of
punched card A punched card (also punch card or punched-card) is a stiff paper-based medium used to store digital information via the presence or absence of holes in predefined positions. Developed over the 18th to 20th centuries, punched cards were widel ...
s were equivalent. In 1957, John Van Gardner developed a system to compare the check sums of loaded sections of Fortran programs to
debug In engineering, debugging is the process of finding the root cause, workarounds, and possible fixes for bugs. For software, debugging tactics can involve interactive debugging, control flow analysis, log file analysis, monitoring at the ap ...
compilation problems on the
IBM 704 The IBM 704 is the model name of a large digital computer, digital mainframe computer introduced by IBM in 1954. Designed by John Backus and Gene Amdahl, it was the first mass-produced computer with hardware for floating-point arithmetic. The I ...
.


See also

* * * * * *


References


External links

{{Version control software Data differencing Utility software types