computational linguistics Computational linguistics is an interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, computational linguistics ...

and

computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to Applied science, practical discipli ...

, edit distance is a string metric, i.e. a way of quantifying how dissimilar two strings (e.g., words) are to one another, that is measured by counting the minimum number of operations required to transform one string into the other. Edit distances find applications in

natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to proc ...

, where automatic spelling correction can determine candidate corrections for a misspelled word by selecting words from a dictionary that have a low distance to the word in question. In

bioinformatics Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combi ...

, it can be used to quantify the similarity of DNA sequences, which can be viewed as strings of the letters A, C, G and T. Different definitions of an edit distance use different sets of string operations. Levenshtein distance operations are the removal, insertion, or substitution of a character in the string. Being the most common metric, the term ''Levenshtein distance'' is often used interchangeably with ''edit distance''.

Types of edit distance

Different types of edit distance allow different sets of string operations. For instance: * The Levenshtein distance allows deletion, insertion and substitution. * The longest common subsequence (LCS) distance allows only insertion and deletion, not substitution. * The

Hamming distance In information theory, the Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different. In other words, it measures the minimum number of ''substitutions'' required to chan ...

allows only substitution, hence, it only applies to strings of the same length. * The Damerau–Levenshtein distance allows insertion, deletion, substitution, and the transposition of two adjacent characters. * The

Jaro distance Jaro may refer to: *Jaro, Iloilo City, a district of Iloilo City, Philippines * Jaro, Indonesia, a subdistrict in Tabalong Regency, South Kalimantan *Jaro, Leyte, a municipality in the province of Leyte, Philippines * Jaro Medien (Jaro Media), a G ...

allows only transposition. Some edit distances are defined as a parameterizable metric calculated with a specific set of allowed edit operations, and each operation is assigned a cost (possibly infinite). This is further generalized by DNA

sequence alignment In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Al ...

algorithms such as the

Smith–Waterman algorithm The Smith–Waterman algorithm performs local sequence alignment; that is, for determining similar regions between two strings of nucleic acid sequences or protein sequences. Instead of looking at the entire sequence, the Smith–Waterman algorit ...

, which make an operation's cost depend on where it is applied.

Formal definition and properties

Given two strings and on an alphabet (e.g. the set of

ASCII ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because ...

characters, the set of

byte The byte is a unit of digital information that most commonly consists of eight bits. Historically, the byte was the number of bits used to encode a single character of text in a computer and for this reason it is the smallest addressable uni ...

s ..255 etc.), the edit distance is the minimum-weight series of edit operations that transforms into . One of the simplest sets of edit operations is that defined by Levenshtein in 1966: :Insertion of a single symbol. If = , then inserting the symbol produces . This can also be denoted ε→, using ε to denote the empty string. :Deletion of a single symbol changes to (→ε). :Substitution of a single symbol for a symbol ≠ changes to (→). In Levenshtein's original definition, each of these operations has unit cost (except that substitution of a character by itself has zero cost), so the Levenshtein distance is equal to the minimum ''number'' of operations required to transform to . A more general definition associates non-negative weight functions _ins(), _del() and _sub(, ) with the operations. Additional primitive operations have been suggested. Damerau–Levenshtein distance counts as a single edit a common mistake: transposition of two adjacent characters, formally characterized by an operation that changes into . For the task of correcting OCR output, merge and split operations have been used which replace a single character into a pair of them or vice versa. Other variants of edit distance are obtained by restricting the set of operations. Longest common subsequence (LCS) distance is edit distance with insertion and deletion as the only two edit operations, both at unit cost. Similarly, by only allowing substitutions (again at unit cost),

is obtained; this must be restricted to equal-length strings.

Jaro–Winkler distance In computer science and statistics, the Jaro–Winkler distance is a string metric measuring an edit distance between two sequences. It is a variant proposed in 1990 by William E. Winkler of the Jaro distance metric (1989, Matthew A. Jaro). ...

can be obtained from an edit distance where only transpositions are allowed.

Example

The Levenshtein distance between "kitten" and "sitting" is 3. A minimal edit script that transforms the former into the latter is: # kitten → sitten (substitute "s" for "k") # sitten → sittin (substitute "i" for "e") # sittin → sitting (insert "g" at the end) LCS distance (insertions and deletions only) gives a different distance and minimal edit script: # kitten → itten (delete "k" at 0) # itten → sitten (insert "s" at 0) # sitten → sittn (delete "e" at 4) # sittn → sittin (insert "i" at 4) # sittin → sitting (insert "g" at 6) for a total cost/distance of 5 operations.

Properties

Edit distance with non-negative cost satisfies the axioms of a

metric Metric or metrical may refer to: * Metric system, an internationally adopted decimal system of measurement * An adjective indicating relation to measurement in general, or a noun describing a specific type of measurement Mathematics In mathe ...

, giving rise to a

metric space In mathematics, a metric space is a set together with a notion of '' distance'' between its elements, usually called points. The distance is measured by a function called a metric or distance function. Metric spaces are the most general setti ...

of strings, when the following conditions are met: * Every edit operation has positive cost; * for every operation, there is an inverse operation with equal cost. With these properties, the metric axioms are satisfied as follows: :(, ) = 0 if and only if a=b, since each string can be trivially transformed to itself using exactly zero operations. :(, ) > 0 when ≠ , since this would require at least one operation at non-zero cost. :(, ) = (, ) by equality of the cost of each operation and its inverse. :Triangle inequality: (, ) ≤ (, ) + (, ). Levenshtein distance and LCS distance with unit cost satisfy the above conditions, and therefore the metric axioms. Variants of edit distance that are not proper metrics have also been considered in the literature. Other useful properties of unit-cost edit distances include: * LCS distance is bounded above by the sum of lengths of a pair of strings. * LCS distance is an upper bound on Levenshtein distance. * For strings of the same length, Hamming distance is an upper bound on Levenshtein distance. Regardless of cost/weights, the following property holds of all edit distances: * When and share a common prefix, this prefix has no effect on the distance. Formally, when = and = , then (, ) = (, ). This allows speeding up many computations involving edit distance and edit scripts, since common prefixes and suffixes can be skipped in linear time.

Computation

The first algorithm for computing minimum edit distance between a pair of strings was published by Damerau in 1964.

Common algorithm

Using Levenshtein's original operations, the (nonsymmetric) edit distance from

a = a_1\ldots a_m

b = b_1\ldots b_n

is given by

d_

, defined by the recurrence :

\begind_ &= \sum_^ w_\mathrm(a_), & & \quad  \text\; 1 \leq i \leq m \\
d_ &= \sum_^ w_\mathrm(b_), & & \quad \text\; 1 \leq j \leq n \\
d_ &= \begin d_ & \text\; a_ = b_\\ \min \begin d_ + w_\mathrm(a_)\\ d_ + w_\mathrm(b_) \\ d_ + w_\mathrm(a_, b_) \end & \text\; a_ \neq b_\end & & \quad  \text\; 1 \leq i \leq m, 1 \leq j \leq n.\end

This algorithm can be generalized to handle transpositions by adding another term in the recursive clause's minimization. The straightforward, recursive way of evaluating this recurrence takes exponential time. Therefore, it is usually computed using a

dynamic programming Dynamic programming is both a mathematical optimization method and a computer programming method. The method was developed by Richard Bellman in the 1950s and has found applications in numerous fields, from aerospace engineering to economics. ...

algorithm that is commonly credited to Wagner and Fischer, although it has a history of multiple invention. After completion of the Wagner–Fischer algorithm, a minimal sequence of edit operations can be read off as a backtrace of the operations used during the dynamic programming algorithm starting at

d_

. This algorithm has a

time complexity In computer science, the time complexity is the computational complexity that describes the amount of computer time it takes to run an algorithm. Time complexity is commonly estimated by counting the number of elementary operations performed by t ...

of Θ() where and are the lengths of the strings. When the full dynamic programming table is constructed, its space complexity is also ; this can be improved to by observing that at any instant, the algorithm only requires two rows (or two columns) in memory. However, this optimization makes it impossible to read off the minimal series of edit operations. A linear-space solution to this problem is offered by Hirschberg's algorithm. A general recursive divide-and-conquer framework for solving such recurrences and extracting an optimal sequence of operations cache-efficiently in space linear in the size of the input is given by Chowdhury, Le, and Ramachandran.

Improved algorithms

Improving on the Wagner–Fisher algorithm described above, Ukkonen describes several variants, one of which takes two strings and a maximum edit distance , and returns . It achieves this by only computing and storing a part of the dynamic programming table around its diagonal. This algorithm takes time , where and are the lengths of the strings. Space complexity is or , depending on whether the edit sequence needs to be read off. Further improvements by

Landau Landau ( pfl, Landach), officially Landau in der Pfalz, is an autonomous (''kreisfrei'') town surrounded by the Südliche Weinstraße ("Southern Wine Route") district of southern Rhineland-Palatinate, Germany. It is a university town (since 1990 ...

, Myers, and Schmid

give an time algorithm. For a finite alphabet and edit costs which are multiples of each other, the fastest known exact algorithm is of Masek and Paterson having worst case runtime of O(nm/logn).

Applications

Edit distance finds applications in

computational biology Computational biology refers to the use of data analysis, mathematical modeling and Computer simulation, computational simulations to understand biological systems and relationships. An intersection of computer science, biology, and big data, the ...

and natural language processing, e.g. the correction of spelling mistakes or OCR errors, and

approximate string matching In computer science, approximate string matching (often colloquially referred to as fuzzy string searching) is the technique of finding strings that match a pattern approximately (rather than exactly). The problem of approximate string matching ...

, where the objective is to find matches for short strings in many longer texts, in situations where a small number of differences is to be expected. Various algorithms exist that solve problems beside the computation of distance between a pair of strings, to solve related types of problems. * Hirschberg's algorithm computes the optimal

alignment Alignment may refer to: Archaeology * Alignment (archaeology), a co-linear arrangement of features or structures with external landmarks * Stone alignment, a linear arrangement of upright, parallel megalithic standing stones Biology * Struc ...

of two strings, where optimality is defined as minimizing edit distance. *

Approximate string matching In computer science, approximate string matching (often colloquially referred to as fuzzy string searching) is the technique of finding strings that match a pattern approximately (rather than exactly). The problem of approximate string matching ...

can be formulated in terms of edit distance. Ukkonen's 1985 algorithm takes a string , called the pattern, and a constant ; it then builds a deterministic finite state automaton that finds, in an arbitrary string , a substring whose edit distance to is at most (cf. the Aho–Corasick algorithm, which similarly constructs an automaton to search for any of a number of patterns, but without allowing edit operations). A similar algorithm for approximate string matching is the

bitap algorithm The bitap algorithm (also known as the shift-or, shift-and or Baeza-Yates–Gonnet algorithm) is an approximate string matching algorithm. The algorithm tells whether a given text contains a substring which is "approximately equal" to a given patter ...

, also defined in terms of edit distance. * Levenshtein automata are finite-state machines that recognize a set of strings within bounded edit distance of a fixed reference string.

Language edit distance

A generalization of the edit distance between strings is the language edit distance between a string and a language, usually a

formal language In logic, mathematics, computer science, and linguistics, a formal language consists of words whose letters are taken from an alphabet and are well-formed according to a specific set of rules. The alphabet of a formal language consists of sym ...

. Instead of considering the edit distance between one string and another, the language edit distance is the minimum edit distance that can be attained between a fixed string and ''any'' string taken from a set of strings. More formally, for any language ''L'' and string ''x'' over an alphabet , the ''language edit distance'' d(''L'', ''x'') is given by

d(L,x) = \min_ d(x,y)

, where

d(x,y)

is the string edit distance. When the language ''L'' is context free, there is a cubic time dynamic programming algorithm proposed by Aho and Peterson in 1972 which computes the language edit distance. For less expressive families of grammars, such as the

regular grammar In theoretical computer science and formal language theory, a regular grammar is a grammar that is ''right-regular'' or ''left-regular''. While their exact definition varies from textbook to textbook, they all require that * all production rule ...

s, faster algorithms exist for computing the edit distance. Language edit distance has found many diverse applications, such as RNA folding, error correction, and solutions to the Optimum Stack Generation problem.

References

{{Strings String metrics