HOME

TheInfoList



OR:

In
computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to Applied science, practical discipli ...
and
statistics Statistics (from German: '' Statistik'', "description of a state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, indust ...
, the Jaro–Winkler distance is a string metric measuring an edit distance between two sequences. It is a variant proposed in 1990 by William E. Winkler of the Jaro distance metric (1989, Matthew A. Jaro). The Jaro–Winkler distance uses a
prefix A prefix is an affix which is placed before the stem of a word. Adding it to the beginning of one word changes it into another word. For example, when the prefix ''un-'' is added to the word ''happy'', it creates the word ''unhappy''. Particul ...
scale p which gives more favourable ratings to strings that match from the beginning for a set prefix length \ell. The higher the Jaro–Winkler distance for two strings is, the less similar the strings are. The score is normalized such that 0 means an exact match and 1 means there is no similarity. The original paper actually defined the metric in terms of similarity, so the distance is defined as the inversion of that value (distance = 1 − similarity). Although often referred to as a ''distance metric'', the Jaro–Winkler distance is not a
metric Metric or metrical may refer to: * Metric system, an internationally adopted decimal system of measurement * An adjective indicating relation to measurement in general, or a noun describing a specific type of measurement Mathematics In mathe ...
in the mathematical sense of that term because it does not obey the
triangle inequality In mathematics, the triangle inequality states that for any triangle, the sum of the lengths of any two sides must be greater than or equal to the length of the remaining side. This statement permits the inclusion of degenerate triangles, but ...
.


Definition


Jaro similarity

The Jaro similarity sim_j of two given strings s_1 and s_2 is : sim_j = \left\{ \begin{array}{l l} 0 & \text{if }m = 0\\ \frac{1}{3}\left(\frac{m}{, s_1 + \frac{m}{, s_2 + \frac{m-t}{m}\right) & \text{otherwise} \end{array} \right. Where: * , s_i, is the length of the string s_i; * m is the number of ''matching characters'' (see below); * t is the number of ''transpositions'' (see below). Jaro similarity score is 0 if the strings do not match at all, and 1 if they are an exact match. In the first step, each character of s_1 is compared with all its matching characters in s_2. Two characters from s_1 and s_2 respectively, are considered ''matching'' only if they are the same and not farther than \left\lfloor\frac{\max(, s_1, ,, s_2, )}{2}\right\rfloor-1 characters apart. For example, the following two nine character long strings, FAREMVIEL and FARMVILLE, have 8 matching characters. 'F', 'A' and 'R' are in the same position in both string. Also 'M', 'V', 'I', 'E' and 'L' are within three (result of \lfloor\tfrac{\max(9, 9)}{2}\rfloor - 1) characters away. If no matching characters are found then the strings are not similar and the algorithm terminates by returning Jaro similarity score 0. If non-zero matching characters are found, the next step is to find the number of transpositions. Transposition is the number of matching characters that are not in the right order divided by two. In the above example between FAREMVIEL and FARMVILLE, 'E' and 'L' are the matching characters that are not in the right order. So the number of transposition is one. Finally, plugging in the number of matching characters m and number of transpositions t the Jaro similarity of FAREMVIEL and FARMVILLE can be calculated, \frac{1}{3}\left(\frac{8}{9} + \frac{8}{9} + \frac{8-1}{8} \right) = 0.88


Jaro–Winkler similarity

Jaro–Winkler similarity uses a
prefix A prefix is an affix which is placed before the stem of a word. Adding it to the beginning of one word changes it into another word. For example, when the prefix ''un-'' is added to the word ''happy'', it creates the word ''unhappy''. Particul ...
scale p which gives more favorable ratings to strings that match from the beginning for a set prefix length \ell. Given two strings s_1 and s_2, their Jaro–Winkler similarity sim_w is: : sim_w = sim_j + \ell p (1 - sim_j), where: * sim_j is the Jaro similarity for strings s_1 and s_2 * \ell is the length of common prefix at the start of the string up to a maximum of 4 characters * p is a constant scaling factor for how much the score is adjusted upwards for having common prefixes. p should not exceed 0.25 (i.e. 1/4, with 4 being the maximum length of the prefix being considered), otherwise the similarity could become larger than 1. The standard value for this constant in Winkler's work is p = 0.1 The Jaro–Winkler distance d_w is defined as d_w = 1 - sim_w. Although often referred to as a ''distance metric'', the Jaro–Winkler distance is not a
metric Metric or metrical may refer to: * Metric system, an internationally adopted decimal system of measurement * An adjective indicating relation to measurement in general, or a noun describing a specific type of measurement Mathematics In mathe ...
in the mathematical sense of that term because it does not obey the
triangle inequality In mathematics, the triangle inequality states that for any triangle, the sum of the lengths of any two sides must be greater than or equal to the length of the remaining side. This statement permits the inclusion of degenerate triangles, but ...
. The Jaro–Winkler distance also does not satisfy the identity axiom d(x,y)=0 \leftrightarrow x = y.


Relationship with other edit distance metrics

There are other popular measures of edit distance, which are calculated using a different set of allowable edit operations. For instance, * the Levenshtein distance allows deletion, insertion and substitution; * the Damerau–Levenshtein distance allows insertion, deletion, substitution, and the transposition of two adjacent characters; * the longest common subsequence (LCS) distance allows only insertion and deletion, not substitution; * the
Hamming distance In information theory, the Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different. In other words, it measures the minimum number of ''substitutions'' required to chan ...
allows only substitution, hence, it only applies to strings of the same length. Edit distance is usually defined as a parameterizable metric calculated with a specific set of allowed edit operations, and each operation is assigned a cost (possibly infinite). This is further generalized by DNA
sequence alignment In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Al ...
algorithms such as the
Smith–Waterman algorithm The Smith–Waterman algorithm performs local sequence alignment; that is, for determining similar regions between two strings of nucleic acid sequences or protein sequences. Instead of looking at the entire sequence, the Smith–Waterman algorit ...
, which make an operation's cost depend on where it is applied.


See also

* Record linkage *
Census A census is the procedure of systematically acquiring, recording and calculating information about the members of a given population. This term is used mostly in connection with national population and housing censuses; other common censuses inc ...


Footnotes


References

* * * * *


External links


strcmp.c - Original C implementation by the author of the algorithm


Python implementation in the Natural Language Toolkit {{DEFAULTSORT:Jaro-Winkler distance String metrics