Soundex
   HOME

TheInfoList



OR:

Soundex is a
phonetic algorithm A phonetic algorithm is an algorithm for indexing of words by their pronunciation. If the algorithm is based on orthography, it depends crucially on the spelling system of the language it is designed for: as most phonetic algorithms were developed ...
for indexing names by sound, as pronounced in English. The goal is for
homophone A homophone () is a word that is pronounced the same as another word but differs in meaning or in spelling. The two words may be spelled the same, for example ''rose'' (flower) and ''rose'' (past tense of "rise"), or spelled differently, a ...
s to be
encoded In communications and information processing, code is a system of rules to convert information—such as a letter, word, sound, image, or gesture—into another form, sometimes shortened or secret, for communication through a communication ...
to the same representation so that they can be matched despite minor differences in
spelling Spelling is a set of conventions for written language regarding how graphemes should correspond to the sounds of spoken language. Spelling is one of the elements of orthography, and highly standardized spelling is a prescriptive element. Spelli ...
. The algorithm mainly encodes consonants; a vowel will not be encoded unless it is the first letter. Soundex is the most widely known of all
phonetic algorithm A phonetic algorithm is an algorithm for indexing of words by their pronunciation. If the algorithm is based on orthography, it depends crucially on the spelling system of the language it is designed for: as most phonetic algorithms were developed ...
s (in part because it is a standard feature of popular database software such as IBM Db2,
PostgreSQL PostgreSQL ( ) also known as Postgres, is a free and open-source software, free and open-source relational database management system (RDBMS) emphasizing extensibility and SQL compliance. PostgreSQL features transaction processing, transactions ...
,
MySQL MySQL () is an Open-source software, open-source relational database management system (RDBMS). Its name is a combination of "My", the name of co-founder Michael Widenius's daughter My, and "SQL", the acronym for Structured Query Language. A rel ...
, SQLite, Ingres, MS SQL Server,
Oracle An oracle is a person or thing considered to provide insight, wise counsel or prophetic predictions, most notably including precognition of the future, inspired by deities. If done through occultic means, it is a form of divination. Descript ...
,
ClickHouse ClickHouse is an open-source column-oriented DBMS (columnar database management system) for online analytical processing (OLAP) that allows users to generate analytical reports using SQL queries in real-time. ClickHouse Inc. is headquartered in ...

Snowflake
and SAP ASE.) Improvements to Soundex are the basis for many modern phonetic algorithms.


History

Soundex was developed by Robert C. Russell and Margaret King Odell and
patent A patent is a type of intellectual property that gives its owner the legal right to exclude others from making, using, or selling an invention for a limited period of time in exchange for publishing an sufficiency of disclosure, enabling discl ...
ed in 1918 and 1922. A variation, American Soundex, was used in the 1930s for a retrospective analysis of the US censuses from 1890 through 1920. The Soundex code came to prominence in the 1960s when it was the subject of several articles in the ''
Communications Communication is commonly defined as the transmission of information. Its precise definition is disputed and there are disagreements about whether Intention, unintentional or failed transmissions are included and whether communication not onl ...
'' and '' Journal of the Association for Computing Machinery'', and especially when described in Donald Knuth's ''
The Art of Computer Programming ''The Art of Computer Programming'' (''TAOCP'') is a comprehensive multi-volume monograph written by the computer scientist Donald Knuth presenting programming algorithms and their analysis. it consists of published volumes 1, 2, 3, 4A, and 4 ...
''. The
National Archives and Records Administration The National Archives and Records Administration (NARA) is an independent agency of the United States government within the executive branch, charged with the preservation and documentation of government and historical records. It is also task ...
(NARA) maintains the current rule set for the official implementation of Soundex used by the U.S. government. These encoding rules are available from NARA, upon request, in the form of General Information Leaflet 55, "Using the Census Soundex".


American Soundex

The Soundex code for a name consists of a letter followed by three
numerical digit A numerical digit (often shortened to just digit) or numeral is a single symbol used alone (such as "1"), or in combinations (such as "15"), to represent numbers in positional notation, such as the common base 10. The name "digit" origin ...
s: the letter is the first letter of the name, and the digits encode the remaining
consonant In articulatory phonetics, a consonant is a speech sound that is articulated with complete or partial closure of the vocal tract, except for the h sound, which is pronounced without any stricture in the vocal tract. Examples are and pronou ...
s. Consonants at a similar
place of articulation In articulatory phonetics, the place of articulation (also point of articulation) of a consonant is an approximate location along the vocal tract where its production occurs. It is a point where a constriction is made between an active and a pa ...
share the same digit so, for example, the
labial consonant Labial consonants are consonants in which one or both lips are the active articulator. The two common labial articulations are bilabials, articulated using both lips, and labiodentals, articulated with the lower lip against the upper teeth, b ...
s B, F, P, and V are each encoded as the number 1. The correct value can be found as follows: #Retain the first letter of the name and drop all other occurrences of a, e, i, o, u, y, h, w. # Replace consonants with digits as follows (after the first letter): #* b, f, p, v → 1 #* c, g, j, k, q, s, x, z → 2 #* d, t → 3 #* l → 4 #* m, n → 5 #* r → 6 # If two or more letters with the same number are adjacent in the original name (before step 1), only retain the first letter; also two letters with the same number separated by 'h', 'w' or 'y' are coded as a single number, whereas such letters separated by a vowel are coded twice. This rule also applies to the first letter. # If there are too few letters in the word to assign three numbers, append zeros until there are three numbers. If there are four or more numbers, retain only the first three. Using this algorithm, both "Robert" and "Rupert" return the same string "R163" while "Rubin" yields "R150". "Ashcraft" and "Ashcroft" both yield "A261". "Tymczak" yields "T522" not "T520" (the chars 'z' and 'k' in the name are coded as 2 twice since a vowel lies in between them). "Pfister" yields "P236" not "P123" (the first two letters have the same number and are coded once as 'P'), and "Honeyman" yields "H555". The following algorithm is followed by most SQL languages (excluding PostgreSQL): # Save the first letter. Map all occurrences of a, e, i, o, u, y, h, w. to zero(0) # Replace all consonants (include the first letter) with digits as in .above. # Replace all adjacent same digits with one digit, and then remove all the zero (0) digits # If the saved letter's digit is the same as the resulting first digit, remove the digit (keep the letter). # Append 3 zeros if result contains less than 3 digits. Remove all except the first letter and 3 digits after it (This step is the same as .in explanation above). The two algorithms above do not return the same results in all cases primarily because of the difference between when the vowels are removed. The first algorithm is used by most programming languages and the second is used by SQL. For example, "Tymczak" yields "T522" in the first algorithm, but "T520" in the algorithm used by SQL. Often, both algorithms generate the same code. As examples, both "Robert" and "Rupert" yield "R163" and "Honeyman" yields "H555". In designing an application, which combines SQL and a programming language, the architect must decide whether to do all of the Soundex encoding in the SQL server or all in the programming language. The MySQL implementation can return more than 4 characters.


Variants

A similar algorithm called "Reverse Soundex" prefixes the last letter of the name instead of the first. The New York State Identification and Intelligence System (NYSIIS) algorithm was introduced in 1970 as an improvement to the Soundex algorithm. NYSIIS handles some multi-character n-grams and maintains relative vowel positioning, whereas Soundex does not. Daitch–Mokotoff Soundex (D–M Soundex) was developed in 1985 by genealogist Gary Mokotoff and later improved by genealogist Randy Daitch because of problems they encountered while trying to apply the Russell Soundex to Jews with Germanic or Slavic surnames (such as Moskowitz vs. Moskovitz or Levine vs. Lewin). D–M Soundex is sometimes referred to as "Jewish Soundex" or "Eastern European Soundex", although the authors discourage the use of those names. The D–M Soundex algorithm can return as many as 32 individual phonetic encodings for a single name. Results of D-M Soundex are returned in an all-numeric format between 100000 and 999999. This algorithm is much more complex than Russell Soundex. As a response to deficiencies in the Soundex algorithm, Lawrence Philips developed the
Metaphone Metaphone is a phonetic algorithm, published by Lawrence Philips in 1990, for indexing words by their English pronunciation. It fundamentally improves on the Soundex algorithm by using information about variations and inconsistencies in English s ...
algorithm in 1990. Philips developed an improvement to Metaphone in 2000, which he called Double Metaphone. Double Metaphone includes a much larger encoding rule set than its predecessor, handles a subset of non-Latin characters, and returns a primary and a secondary encoding to account for different pronunciations of a single word in English. Philips created Metaphone 3 as a further revision in 2009 to provide a professional version that provides a much higher percentage of correct encodings for English words, non-English words familiar to Americans, and first and last names found in the United States. It also provides settings that allow more exact consonant and internal vowel matching to allow the programmer to focus the precision of matches more closely.


See also

* Cologne phonetics * Match Rating Approach * Levenshtein distance


References

{{Reflist Phonetic algorithms