Metaphone
   HOME

TheInfoList



OR:

Metaphone is a
phonetic algorithm A phonetic algorithm is an algorithm for indexing of words by their pronunciation. If the algorithm is based on orthography, it depends crucially on the spelling system of the language it is designed for: as most phonetic algorithms were developed ...
, published by Lawrence Philips in 1990, for indexing words by their English pronunciation. It fundamentally improves on the
Soundex Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling. The algorithm mainly enc ...
algorithm by using information about variations and inconsistencies in English spelling and pronunciation to produce a more accurate encoding, which does a better job of matching words and names which sound similar. As with Soundex, similar-sounding words should share the same keys. Metaphone is available as a built-in operator in a number of systems. Philips later produced a new version of the algorithm, which he named
Double Metaphone Metaphone is a phonetic algorithm, published by Lawrence Philips in 1990, for indexing words by their English pronunciation. It fundamentally improves on the Soundex algorithm by using information about variations and inconsistencies in English s ...
. Contrary to the original algorithm whose application is limited to English only, this version takes into account spelling peculiarities of a number of other languages. In 2009 Philips released a third version, called Metaphone 3, which achieves an accuracy of approximately 99% for English words, non-English words familiar to Americans, and first names and family names commonly found in the United States, having been developed according to modern engineering standards against a test harness of prepared correct encodings.


Procedure

Original Metaphone codes use the 16
consonant In articulatory phonetics, a consonant is a speech sound that is articulated with complete or partial closure of the vocal tract, except for the h sound, which is pronounced without any stricture in the vocal tract. Examples are and pronou ...
symbols 0BFHJKLMNPRSTWXY. The '0' represents " th" (as an
ASCII ASCII ( ), an acronym for American Standard Code for Information Interchange, is a character encoding standard for representing a particular set of 95 (English language focused) printable character, printable and 33 control character, control c ...
approximation of Θ), 'X' represents " sh" or " ch", and the others represent their usual English pronunciations. The
vowels A vowel is a speech sound pronounced without any stricture in the vocal tract, forming the nucleus of a syllable. Vowels are one of the two principal classes of speech sounds, the other being the consonant. Vowels vary in quality, in loudness a ...
AEIOU are also used, but only at the beginning of the code. This table summarizes most of the rules in the original implementation: # Drop duplicate adjacent letters, except for C. # If the word begins with 'KN', 'GN', 'PN', 'AE', 'WR', drop the first letter. # Drop 'B' if after 'M' at the end of the word. # 'C' transforms to 'X' if followed by 'IA' or 'H' (unless in latter case, it is part of '-SCH-', in which case it transforms to 'K'). 'C' transforms to 'S' if followed by 'I', 'E', or 'Y'. Otherwise, 'C' transforms to 'K'. # 'D' transforms to 'J' if followed by 'GE', 'GY', or 'GI'. Otherwise, 'D' transforms to 'T'. # Drop 'G' if followed by 'H' and 'H' is not at the end or before a vowel. Drop 'G' if followed by 'N' or 'NED' and is at the end. # 'G' transforms to 'J' if before 'I', 'E', or 'Y', and it is not in 'GG'. Otherwise, 'G' transforms to 'K'. # Drop 'H' if after vowel and not before a vowel. # 'CK' transforms to 'K'. # 'PH' transforms to 'F'. # 'Q' transforms to 'K'. # 'S' transforms to 'X' if followed by 'H', 'IO', or 'IA'. # 'T' transforms to 'X' if followed by 'IA' or 'IO'. 'TH' transforms to '0'. Drop 'T' if followed by 'CH'. # 'V' transforms to 'F'. # 'WH' transforms to 'W' if at the beginning. Drop 'W' if not followed by a vowel. # 'X' transforms to 'S' if at the beginning. Otherwise, 'X' transforms to 'KS'. # Drop 'Y' if not followed by a vowel. # 'Z' transforms to 'S'. # Drop all vowels unless it is the beginning. This table does not constitute a complete description of the original Metaphone algorithm, and the algorithm cannot be coded correctly from it. Original Metaphone contained many errors and was superseded by Double Metaphone, and in turn Double Metaphone and original Metaphone were superseded by Metaphone 3, which corrects thousands of miscodings that will be produced by the first two versions. To implement Metaphone without purchasing a (source code) copy of Metaphone 3, the reference implementation of Double Metaphone can be used. Alternatively, version 2.1.3 of Metaphone 3, an earlier 2009 version without a number of encoding corrections made in the current version, version 2.5.4, has been made available under the terms of the
BSD License BSD licenses are a family of permissive free software licenses, imposing minimal restrictions on the use and distribution of covered software. This is in contrast to copyleft licenses, which have share-alike requirements. The original BSD lic ...
via the OpenRefine project.


Double Metaphone

The Double Metaphone phonetic encoding algorithm is the second generation of this algorithm. Its implementation was described in the June 2000 issue of ''
C/C++ Users Journal ''C/C++ Users Journal'' was a computer magazine dedicated to the C and C++ programming languages published in the United States from 1985 to 2006. It was one of the last printed magazines to cover specifically this topic (apart from ACCU's jo ...
''. It makes a number of fundamental design improvements over the original Metaphone algorithm. It is called "Double" because it can return both a primary and a secondary code for a string; this accounts for some ambiguous cases as well as for multiple variants of surnames with common ancestry. For example, encoding the name "Smith" yields a primary code of ''SM0'' and a secondary code of ''XMT'', while the name "Schmidt" yields a primary code of ''XMT'' and a secondary code of ''SMT''—both have ''XMT'' in common. Double Metaphone tries to account for myriad irregularities in
English English usually refers to: * English language * English people English may also refer to: Culture, language and peoples * ''English'', an adjective for something of, from, or related to England * ''English'', an Amish ter ...
of
Slavic Slavic, Slav or Slavonic may refer to: Peoples * Slavic peoples, an ethno-linguistic group living in Europe and Asia ** East Slavic peoples, eastern group of Slavic peoples ** South Slavic peoples, southern group of Slavic peoples ** West Slav ...
, Germanic,
Celtic Celtic, Celtics or Keltic may refer to: Language and ethnicity *pertaining to Celts, a collection of Indo-European peoples in Europe and Anatolia **Celts (modern) *Celtic languages **Proto-Celtic language *Celtic music *Celtic nations Sports Foot ...
,
Greek Greek may refer to: Anything of, from, or related to Greece, a country in Southern Europe: *Greeks, an ethnic group *Greek language, a branch of the Indo-European language family **Proto-Greek language, the assumed last common ancestor of all kno ...
,
French French may refer to: * Something of, from, or related to France ** French language, which originated in France ** French people, a nation and ethnic group ** French cuisine, cooking traditions and practices Arts and media * The French (band), ...
,
Italian Italian(s) may refer to: * Anything of, from, or related to the people of Italy over the centuries ** Italians, a Romance ethnic group related to or simply a citizen of the Italian Republic or Italian Kingdom ** Italian language, a Romance languag ...
,
Spanish Spanish might refer to: * Items from or related to Spain: **Spaniards are a nation and ethnic group indigenous to Spain **Spanish language, spoken in Spain and many countries in the Americas **Spanish cuisine **Spanish history **Spanish culture ...
,
Chinese Chinese may refer to: * Something related to China * Chinese people, people identified with China, through nationality, citizenship, and/or ethnicity **Han Chinese, East Asian ethnic group native to China. **'' Zhonghua minzu'', the supra-ethnic ...
, and other origins. Thus it uses a much more complex ruleset for coding than its predecessor; for example, it tests for approximately 100 different contexts of the use of the letter C alone.


Metaphone 3

A professional version was released in October 2009, developed by the same author, Lawrence Philips. It is a commercial product sold as source code. Metaphone 3 further improves phonetic encoding of words in the English language, non-English words familiar to Americans, and first names and family names commonly found in the United States. It improves encoding for proper names in particular to a considerable extent. The author claims that in general it improves accuracy for all words from the approximately 89% of Double Metaphone to 98%. Developers can also now set switches in code to cause the algorithm to encode Metaphone keys 1) taking non-initial vowels into account, as well as 2) encoding voiced and unvoiced consonants differently. This allows the result set to be more closely focused if the developer finds that the search results include too many words that don't resemble the search term closely enough. Metaphone 3 is sold as C++, Java, C#, PHP, Perl, and PL/SQL source, Ruby and Python wrappers accessing a Java jar, and also Metaphone 3 for Spanish and German pronunciation available as Java and C# source. The latest revision of the Metaphone 3 algorithm is v2.5.4, released March 2015. The Metaphone3 Java source code for an earlier version, 2.1.3, lacking a large number of encoding corrections made in the current version, version 2.5.4, was included as part of the OpenRefine project and is publicly viewable.{{cite web, url=https://github.com/OpenRefine/OpenRefine/blob/master/main/src/com/google/refine/clustering/binning/Metaphone3.java, title=OpenRefine source for Metaphone3, website=github.com, access-date=2 Nov 2020


Common misconceptions

There are some misconceptions about the Metaphone algorithms that should be addressed. The following statements are true: # All of them are designed to address regular, "dictionary" words, not just names, and # Metaphone algorithms do not produce phonetic representations of the input words and names; rather, the output is an intentionally approximate phonetic representation, according to this standard: ::* words that start with a vowel sound will have an 'A', representing any vowel, as the first character of the encoding (in Double Metaphone and Metaphone 3 - original Metaphone just preserves the actual vowel), ::* vowels after an initial vowel sound will be disregarded and not encoded, and ::* voiced/unvoiced consonant pairs will be mapped to the same encoding. (Examples of voiced/unvoiced consonant pairs are D/T, B/P, Z/S, G/K, etc.). This approximate encoding is necessary to account for the way English speakers vary their pronunciations and misspell or otherwise vary words and names they are trying to spell. Vowels, of course, are notoriously highly variable. British speakers often complain that Americans seem to pronounce 'T's the same as 'D'. Consider, also, that all English speakers often pronounce 'Z' where 'S' is spelled, almost always when a noun ending in a voiced consonant or a liquid is pluralized, for example "seasons", "beams", "examples", etc. Not encoding vowels after an initial vowel sound will help to group words where a vowel and a consonant may be transposed in the misspelling or alternative pronunciation.


Metaphone of other languages

Metaphone is useful for English variants and other languages, having been preferred to
Soundex Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling. The algorithm mainly enc ...
in several
Indo-European The Indo-European languages are a language family native to the northern Indian subcontinent, most of Europe, and the Iranian plateau with additional native branches found in regions such as Sri Lanka, the Maldives, parts of Central Asia (e. ...
languages. On the other hand, rough phonetic encoding causes language dependency — or, in a language variant, average language-speaker dependency — mainly for non-English variants. Perhaps the first example of stable adaptation of non-English metaphone was
Brazilian Portuguese Brazilian Portuguese (; ; also known as pt-BR) is the set of Variety (linguistics), varieties of Portuguese language native to Brazil. It is spoken by almost all of the 203 million inhabitants of Brazil and widely across the Brazilian diaspora ...
: i
originated in ~2008
as a database solution in
Várzea Paulista Várzea Paulista is a municipality in the state of São Paulo in Brazil. The population is 123,071 (2020 est.) in an area of 35.1 km2. The elevation is 745 m. It is part of the agglomeration of Jundiaí. Media In telecommunications, the cit ...
municipality of Brazil, and it evolved to th
current metaphone-ptbr algorithm


See also

*
Caverphone The Caverphone within linguistics and computing, is a phonetic matching algorithm invented to identify English names with their sounds, originally built to process a custom dataset compound between 1893 and 1938 in southern Dunedin, New Zealand. St ...
*
New York State Identification and Intelligence System The New York State Identification and Intelligence System Phonetic Code, commonly known as NYSIIS, is a phonetic algorithm devised in 1970 as part of the New York State Identification and Intelligence System (now a part of the New York State Divis ...
* Match Rating Approach *
Approximate string matching In computer science, approximate string matching (often colloquially referred to as fuzzy string searching) is the technique of finding strings that match a pattern approximately (rather than exactly). The problem of approximate string matching ...


References


External links


The Double Metaphone Search Algorithm
By Lawrence Phillips, June 1, 2000, Dr Dobb's, ''Original article''


Metaphone algorithms for other languages


Brazilian Portuguese in C
Metaphone for Brazilian Portuguese, in C with PHP and PostgreSQL port.
Brazilian Portuguese in Java
Metaphone for Brazilian Portuguese, in Java.
Spanish Metaphone in Python

Double Metaphone algorithm for Bangla



Russian Metaphone in Ruby

Double Metaphone
an
Metaphone
in JavaScript Phonetic algorithms