HOME

TheInfoList



OR:

In mathematics and
computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to practical disciplines (includin ...
, a string metric (also known as a string similarity metric or string distance function) is a metric that measures
distance Distance is a numerical or occasionally qualitative measurement of how far apart objects or points are. In physics or everyday usage, distance may refer to a physical length or an estimation based on other criteria (e.g. "two counties over"). ...
("inverse similarity") between two text strings for
approximate string matching In computer science, approximate string matching (often colloquially referred to as fuzzy string searching) is the technique of finding strings that match a pattern approximately (rather than exactly). The problem of approximate string matching ...
or comparison and in fuzzy string searching. A requirement for a string ''metric'' (e.g. in contrast to string matching) is fulfillment of the
triangle inequality In mathematics, the triangle inequality states that for any triangle, the sum of the lengths of any two sides must be greater than or equal to the length of the remaining side. This statement permits the inclusion of degenerate triangles, bu ...
. For example, the strings "Sam" and "Samuel" can be considered to be close. A string metric provides a number indicating an algorithm-specific indication of distance. The most widely known string metric is a rudimentary one called the Levenshtein distance (also known as edit distance). It operates between two input strings, returning a number equivalent to the number of substitutions and deletions needed in order to transform one input string into another. Simplistic string metrics such as Levenshtein distance have expanded to include phonetic,
token Token may refer to: Arts, entertainment, and media * Token, a game piece or counter, used in some games * The Tokens, a vocal music group * Tolkien Black, a recurring character on the animated television series ''South Park,'' formerly known as ...
, grammatical and character-based methods of statistical comparisons. String metrics are used heavily in information integration and are currently used in areas including fraud detection, fingerprint analysis, plagiarism detection,
ontology merging {{Unreferenced, date=March 2013 Ontology merging defines the act of bringing together two conceptually divergent ontologies or the instance data associated to two ontologies. This is similar to work in database merging (schema matching). This merg ...
, DNA analysis, RNA analysis, image analysis, evidence-based
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
,
database In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases spa ...
data deduplication, data mining, incremental search, data integration, Malware Detection, and semantic knowledge integration.


List of string metrics

* Levenshtein distance, or its generalization edit distance * Damerau–Levenshtein distance * Sørensen–Dice coefficient * Block distance or L1 distance or City block distance *
Hamming distance In information theory, the Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different. In other words, it measures the minimum number of ''substitutions'' required to chang ...
* Jaro–Winkler distance *
Simple matching coefficient The simple matching coefficient (SMC) or Rand similarity coefficient is a statistic used for comparing the similarity and diversity of sample sets. Given two objects, A and B, each with ''n'' binary attributes, SMC is defined as: : \begin \te ...
(SMC) *
Jaccard similarity The Jaccard index, also known as the Jaccard similarity coefficient, is a statistic used for gauging the similarity and diversity of sample sets. It was developed by Grove Karl Gilbert in 1884 as his ratio of verification (v) and now is freque ...
or Jaccard coefficient or Tanimoto coefficient * Tversky index * Overlap coefficient *
Variational distance Variational may refer to: *Calculus of variations, a field of mathematical analysis that deals with maximizing or minimizing functionals * Variational method (quantum mechanics), a way of finding approximations to the lowest energy eigenstate or g ...
Sam's String Metrics - Computational Linguistics and Phonetics
/ref> * Hellinger distance or Bhattacharyya distance *
Information radius Information is an Abstraction, abstract concept that refers to that which has the power to Communication, inform. At the most fundamental level information pertains to the Interpretation (logic), interpretation of that which may be sensed. ...
( Jensen–Shannon divergence) * Skew divergence * Confusion probability * Tau metric, an approximation of the
Kullback–Leibler divergence In mathematical statistics, the Kullback–Leibler divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how one probability distribution ''P'' is different fro ...
* Fellegi and Sunters metric (SFS) *
Maximal matches Maximal may refer to: *Maximal element, a mathematical definition * Maximal (Transformers), a faction of Transformers * Maximalism, an artistic style * Maximal set * ''Maxim'' (magazine), a men's magazine marketed as ''Maximal'' in several countrie ...
* Grammar-based distance * TFIDF distance metric


Selected string measures examples


References


External links


String Similarity Metrics for Information Integration
A fairly complete overview
Carnegie Mellon University open source library

StringMetric project
a Scala library of string metrics and phonetic algorithms
Natural project
a
JavaScript JavaScript (), often abbreviated as JS, is a programming language that is one of the core technologies of the World Wide Web, alongside HTML and CSS. As of 2022, 98% of Website, websites use JavaScript on the Client (computing), client side ...
natural language processing library which includes implementations of popular string metrics {{DEFAULTSORT:String Metric Metrics