Overview
The closeness of a match is measured in terms of the number of primitive operations necessary to convert the string into an exact match. This number is called the edit distance between the string and the pattern. The usual primitive operations are: * insertion: ''cot'' → ''coat'' * deletion: ''coat'' → ''cot'' * substitution: ''coat'' → ''cost'' These three operations may be generalized as forms of substitution by adding a NULL character (here symbolized by *) wherever a character has been deleted or inserted: * insertion: ''co*t'' → ''coat'' * deletion: ''coat'' → ''co*t'' * substitution: ''coat'' → ''cost'' Some approximate matchers also treat ''transposition'', in which the positions of two letters in the string are swapped, to be a primitive operation. * transposition: ''cost'' → ''cots'' Different approximate matchers impose different constraints. Some matchers use a single global unweighted cost, that is, the total number of primitive operations necessary to convert the match to the pattern. For example, if the pattern is ''coil'', ''foil'' differs by one substitution, ''coils'' by one insertion, ''oil'' by one deletion, and ''foal'' by two substitutions. If all operations count as a single unit of cost and the limit is set to one, ''foil'', ''coils'', and ''oil'' will count as matches while ''foal'' will not. Other matchers specify the number of operations of each type separately, while still others set a total cost but allow different weights to be assigned to different operations. Some matchers permit separate assignments of limits and weights to individual groups in the pattern.Problem formulation and algorithms
One possible definition of the approximate string matching problem is the following: Given a pattern string and a text string , find a substring in ''T'', which, of all substrings of ''T'', has the smallest edit distance to the pattern ''P''. A brute-force approach would be to compute the edit distance to P for all substrings of T, and then choose the substring with the minimum distance. However, this algorithm would have the running time ''O''(''n''3 ''m''). A better solution, which was proposed by Sellers, relies on dynamic programming. It uses an alternative formulation of the problem: for each position ''j'' in the text ''T'' and each position ''i'' in the pattern ''P'', compute the minimum edit distance between the ''i'' first characters of the pattern, , and any substring of ''T'' that ends at position ''j''. For each position ''j'' in the text ''T'', and each position ''i'' in the pattern ''P'', go through all substrings of ''T'' ending at position ''j'', and determine which one of them has the minimal edit distance to the ''i'' first characters of the pattern ''P''. Write this minimal distance as ''E''(''i'', ''j''). After computing ''E''(''i'', ''j'') for all ''i'' and ''j'', we can easily find a solution to the original problem: it is the substring for which ''E''(''m'', ''j'') is minimal (''m'' being the length of the pattern ''P''.) Computing ''E''(''m'', ''j'') is very similar to computing the edit distance between two strings. In fact, we can use the Levenshtein distance computing algorithm for ''E''(''m'', ''j''), the only difference being that we must initialize the first row with zeros, and save the path of computation, that is, whether we used ''E''(''i'' − 1,''j''), E(''i'',''j'' − 1) or ''E''(''i'' − 1,''j'' − 1) in computing ''E''(''i'', ''j''). In the array containing the ''E''(''x'', ''y'') values, we then choose the minimal value in the last row, let it be ''E''(''x''2, ''y''2), and follow the path of computation backwards, back to the row number 0. If the field we arrived at was ''E''(0, ''y''1), then ''T'' 1 + 1">'y''1 + 1nbsp;... ''T'' 2">'y''2is a substring of T with the minimal edit distance to the pattern ''P''. Computing the ''E''(''x'', ''y'') array takes ''O''(''mn'') time with the dynamic programming algorithm, while the backwards-working phase takes ''O''(''n'' + ''m'') time. Another recent idea is the similarity join. When matching database relates to a large scale of data, the ''O''(''mn'') time with the dynamic programming algorithm cannot work within a limited time. So, the idea is to reduce the number of candidate pairs, instead of computing the similarity of ''all'' pairs of strings. Widely used algorithms are based on filter-verification, hashing, Locality-sensitive hashing (LSH), Tries and other greedy and approximation algorithms. Most of them are designed to fit some framework (such as Map-Reduce) to compute concurrently.On-line versus off-line
Traditionally, approximate string matching algorithms are classified into two categories:Applications
Common applications of approximate matching include spell checking. With the availability of large amounts of DNA data, matching ofSee also
* Concept search * Jaro–Winkler distance * Levenshtein distance * Locality-sensitive hashing * Metaphone * Needleman–Wunsch algorithm * Plagiarism detection * Regular expressions for fuzzy and non-fuzzy matching * Smith–Waterman algorithm * Soundex * String metricReferences
* * * * * * * * * * * * * *External links