HOME

TheInfoList



OR:

Unicode equivalence is the specification by the
Unicode Unicode or ''The Unicode Standard'' or TUS is a character encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 16.0 defines 154,998 Char ...
character encoding standard that some sequences of
code point A code point, codepoint or code position is a particular position in a Table (database), table, where the position has been assigned a meaning. The table may be one dimensional (a column), two dimensional (like cells in a spreadsheet), three dime ...
s represent essentially the same character. This feature was introduced in the standard to allow compatibility with pre-existing standard
character set Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using computers. The numerical values that make up a c ...
s, which often included similar or identical characters.
Unicode Unicode or ''The Unicode Standard'' or TUS is a character encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 16.0 defines 154,998 Char ...
provides two such notions, canonical equivalence and compatibility.
Code point A code point, codepoint or code position is a particular position in a Table (database), table, where the position has been assigned a meaning. The table may be one dimensional (a column), two dimensional (like cells in a spreadsheet), three dime ...
sequences that are defined as canonically equivalent are assumed to have the same appearance and meaning when printed or displayed. For example, the code point followed by is defined by Unicode to be canonically equivalent to the single code point of the
Spanish alphabet Spanish orthography is the orthography used in the Spanish language. The alphabet uses the Latin script. The spelling is fairly phonemic orthography, phonemic, especially in comparison to more opaque orthographies like English orthography, Engl ...
). Therefore, those sequences should be displayed in the same manner, should be treated in the same way by applications such as alphabetizing names or
searching Searching may refer to: Music * " Searchin", a 1957 song originally performed by The Coasters * "Searching" (China Black song), a 1991 song by China Black * "Searchin" (CeCe Peniston song), a 1993 song by CeCe Peniston * " Searchin' (I Gott ...
, and may be substituted for each other. Similarly, each
Hangul The Korean alphabet is the modern writing system for the Korean language. In North Korea, the alphabet is known as (), and in South Korea, it is known as (). The letters for the five basic consonants reflect the shape of the speech organs ...
syllable block that is encoded as a single character may be equivalently encoded as a combination of a leading conjoining jamo, a vowel conjoining jamo, and, if appropriate, a trailing conjoining jamo. Sequences that are defined as compatible are assumed to have possibly distinct appearances, but the same meaning in some contexts. Thus, for example, the code point U+FB00 (the
typographic ligature In writing and typography, a ligature occurs where two or more graphemes or letters are joined to form a single glyph. Examples are the characters and used in English and French, in which the letters and are joined for the first ligature ...
"ff") is defined to be compatible—but not canonically equivalent—to the sequence U+0066 U+0066 (two Latin "f" letters). Compatible sequences may be treated the same way in some applications (such as
sorting Sorting refers to ordering data in an increasing or decreasing manner according to some linear relationship among the data items. # ordering: arranging items in a sequence ordered by some criterion; # categorizing: grouping items with similar p ...
and
index Index (: indexes or indices) may refer to: Arts, entertainment, and media Fictional entities * Index (''A Certain Magical Index''), a character in the light novel series ''A Certain Magical Index'' * The Index, an item on the Halo Array in the ...
ing), but not in others; and may be substituted for each other in some situations, but not in others. Sequences that are canonically equivalent are also compatible, but the opposite is not necessarily true. The standard also defines a text normalization procedure, called Unicode normalization, that replaces equivalent sequences of characters so that any two texts that are equivalent will be reduced to the same sequence of code points, called the normalization form or normal form of the original text. For each of the two equivalence notions, Unicode defines two normal forms, one fully composed (where multiple code points are replaced by single points whenever possible), and one fully decomposed (where single points are split into multiple ones).


Sources of equivalence


Character duplication

For compatibility or other reasons, Unicode sometimes assigns two different code points to entities that are essentially the same character. For example, the letter "A with a ring diacritic above" is encoded as (a letter of the
alphabet An alphabet is a standard set of letter (alphabet), letters written to represent particular sounds in a spoken language. Specifically, letters largely correspond to phonemes as the smallest sound segments that can distinguish one word from a ...
in Swedish and several other
language Language is a structured system of communication that consists of grammar and vocabulary. It is the primary means by which humans convey meaning, both in spoken and signed language, signed forms, and may also be conveyed through writing syste ...
s) or as . Yet the symbol for
angstrom The angstrom (; ) is a unit of length equal to m; that is, one ten-billionth of a metre, a hundred-millionth of a centimetre, 0.1 nanometre, or 100 picometres. The unit is named after the Swedish physicist Anders Jonas Ångström (1814–18 ...
is defined to be that Swedish letter, and most other symbols that are letters (such as for
volt The volt (symbol: V) is the unit of electric potential, Voltage#Galvani potential vs. electrochemical potential, electric potential difference (voltage), and electromotive force in the International System of Units, International System of Uni ...
) do not have a separate code point for each usage. In general, the code points of truly identical characters are defined to be canonically equivalent.


Combining and precomposed characters

For consistency with some older standards, Unicode provides single code points for many characters that could be viewed as modified forms of other characters (such as U+00F1 for "ñ" or U+00C5 for "Å") or as combinations of two or more characters (such as U+FB00 for the ligature "ff" or U+0132 for the Dutch letter " ij") For consistency with other standards, and for greater flexibility, Unicode also provides codes for many elements that are not used on their own, but are meant instead to modify or combine with a preceding base character. Examples of these
combining character In digital typography, combining characters are Character (computing), characters that are intended to modify other characters. The most common combining characters in the Latin script are the combining diacritic, diacritical marks (including c ...
s are and the Japanese diacritic
dakuten The , colloquially , is a diacritic most often used in the Japanese kana syllabaries to indicate that the consonant of a mora should be pronounced voiced, for instance, on sounds that have undergone rendaku (sequential voicing). The , coll ...
(). In the context of Unicode, character composition is the process of replacing the code points of a base letter followed by one or more combining characters into a single
precomposed character A precomposed character (alternatively composite character or decomposable character) is a Unicode entity that can also be defined as a sequence of one or more other characters. A precomposed character may typically represent a letter with a diac ...
; and character decomposition is the opposite process. In general, precomposed characters are defined to be canonically equivalent to the sequence of their base letter and subsequent combining diacritic marks, in whatever order these may occur.


Example


Typographical non-interaction

Some scripts regularly use multiple combining marks that do not, in general, interact typographically, and do not have precomposed characters for the combinations. Pairs of such non-interacting marks can be stored in either order. These alternative sequences are, in general, canonically equivalent. The rules that define their sequencing in the canonical form also define whether they are considered to interact.


Typographic conventions

Unicode provides code points for some characters or groups of characters which are modified only for aesthetic reasons (such as ligatures, the
half-width katakana are katakana characters displayed compressed at half their normal width (a 1:2 aspect ratio), instead of the usual square (1:1) aspect ratio. For example, the usual (full-width) form of the katakana ''ka'' is カ while the half-width form is カ. ...
characters, or the
full-width In CJK (Chinese, Japanese, and Korean) computing, graphic characters are traditionally classed into fullwidth and halfwidth characters. Unlike monospaced fonts, a halfwidth character occupies half the width of a fullwidth character, hence the na ...
Latin letters for use in Japanese texts), or to add new semantics without losing the original one (such as digits in
subscript A subscript or superscript is a character (such as a number or letter) that is set slightly below or above the normal line of type, respectively. It is usually smaller than the rest of the text. Subscripts appear at or below the baseline, wh ...
or
superscript A subscript or superscript is a character (such as a number or letter) that is set slightly below or above the normal line of type, respectively. It is usually smaller than the rest of the text. Subscripts appear at or below the baseline, wh ...
positions, or the circled digits (such as "①") inherited from some Japanese fonts). Such a sequence is considered compatible with the sequence of original (individual and unmodified) characters, for the benefit of applications where the appearance and added semantics are not relevant. However, the two sequences are not declared canonically equivalent, since the distinction has some semantic value and affects the rendering of the text.


Encoding errors

UTF-8 UTF-8 is a character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode Transformation Format 8-bit''. Almost every webpage is transmitted as UTF-8. UTF-8 supports all 1,112,0 ...
and
UTF-16 UTF-16 (16-bit Unicode Transformation Format) is a character encoding that supports all 1,112,064 valid code points of Unicode. The encoding is variable-length as code points are encoded with one or two ''code units''. UTF-16 arose from an earli ...
(and also some other Unicode encodings) do not allow all possible sequences of
code unit Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using computers. The numerical values that make up a c ...
s. Different software will convert invalid sequences into Unicode characters using varying rules, some of which are very lossy (e.g., turning all invalid sequences into the same character). This can be considered a form of normalization and can lead to the same difficulties as others.


Normalization

A text processing software implementing the Unicode string search and comparison functionality must take into account the presence of equivalent code points. In the absence of this feature, users searching for a particular code point sequence would be unable to find other visually indistinguishable glyphs that have a different, but canonically equivalent, code point representation.


Algorithms

Unicode provides standard normalization algorithms that produce a unique (normal) code point sequence for all sequences that are equivalent; the equivalence criteria can be either canonical (NF) or compatibility (NFK). Since one can arbitrarily choose the representative element of an
equivalence class In mathematics, when the elements of some set S have a notion of equivalence (formalized as an equivalence relation), then one may naturally split the set S into equivalence classes. These equivalence classes are constructed so that elements ...
, multiple canonical forms are possible for each equivalence criterion. Unicode provides two normal forms that are semantically meaningful for each of the two compatibility criteria: the composed forms NFC and NFKC, and the decomposed forms NFD and NFKD. Both the composed and decomposed forms impose a canonical ordering on the code point sequence, which is necessary for the normal forms to be unique. In order to compare or search Unicode strings, software can use either composed or decomposed forms; this choice does not matter as long as it is the same for all strings involved in a search, comparison, etc. On the other hand, the choice of equivalence criteria can affect search results. For instance, some
typographic ligature In writing and typography, a ligature occurs where two or more graphemes or letters are joined to form a single glyph. Examples are the characters and used in English and French, in which the letters and are joined for the first ligature ...
s like U+FB03 (),
Roman numerals Roman numerals are a numeral system that originated in ancient Rome and remained the usual way of writing numbers throughout Europe well into the Late Middle Ages. Numbers are written with combinations of letters from the Latin alphabet, eac ...
like U+2168 () and even subscripts and superscripts, e.g. U+2075 () have their own Unicode code points. Canonical normalization (NF) does not affect any of these, but compatibility normalization (NFK) will decompose the ffi ligature into the constituent letters, so a search for U+0066 () as substring would succeed in an NFKC normalization of U+FB03 but not in NFC normalization of U+FB03. Likewise when searching for the Latin letter (U+0049) in the precomposed Roman numeral (U+2168). Similarly, the superscript (U+2075) is transformed to (U+0035) by compatibility mapping. Transforming superscripts into baseline equivalents may not be appropriate, however, for rich text software, because the superscript information is lost in the process. To allow for this distinction, the Unicode character database contains compatibility formatting tags that provide additional details on the compatibility transformation. In the case of typographic ligatures, this tag is simply , while for the superscript it is . Rich text standards like
HTML Hypertext Markup Language (HTML) is the standard markup language for documents designed to be displayed in a web browser. It defines the content and structure of web content. It is often assisted by technologies such as Cascading Style Sheets ( ...
take into account the compatibility tags. For instance, HTML uses its own markup to position a U+0035 in a superscript position.


Normal forms

The four Unicode normalization forms and the algorithms (transformations) for obtaining them are listed in the table below. All these algorithms are
idempotent Idempotence (, ) is the property of certain operations in mathematics and computer science whereby they can be applied multiple times without changing the result beyond the initial application. The concept of idempotence arises in a number of pl ...
transformations, meaning that a string that is already in one of these normalized forms will not be modified if processed again by the same algorithm. The normal forms are not closed under string
concatenation In formal language theory and computer programming, string concatenation is the operation of joining character strings end-to-end. For example, the concatenation of "snow" and "ball" is "snowball". In certain formalizations of concatenati ...
. For defective Unicode strings starting with a Hangul vowel or trailing conjoining jamo, concatenation can break Composition. However, they are not
injective In mathematics, an injective function (also known as injection, or one-to-one function ) is a function that maps distinct elements of its domain to distinct elements of its codomain; that is, implies (equivalently by contraposition, impl ...
(they map different original glyphs and sequences to the same normalized sequence) and thus also not
bijective In mathematics, a bijection, bijective function, or one-to-one correspondence is a function between two sets such that each element of the second set (the codomain) is the image of exactly one element of the first set (the domain). Equival ...
(cannot be restored). For example, the distinct Unicode strings "U+212B" (the angstrom sign "Å") and "U+00C5" (the Swedish letter "Å") are both expanded by NFD (or NFKD) into the sequence "U+0041 U+030A" (Latin letter "A" and combining ring above "°") which is then reduced by NFC (or NFKC) to "U+00C5" (the Swedish letter "Å"). A single character (other than a Hangul syllable block) that will get replaced by another under normalization can be identified in the Unicode tables for having a non-empty compatibility field but lacking a compatibility tag.


Canonical ordering

The canonical ordering is mainly concerned with the ordering of a sequence of combining characters. For the examples in this section we assume these characters to be
diacritics A diacritic (also diacritical mark, diacritical point, diacritical sign, or accent) is a glyph added to a letter or to a basic glyph. The term derives from the Ancient Greek (, "distinguishing"), from (, "to distinguish"). The word ''diacrit ...
, even though in general some diacritics are not combining characters, and some combining characters are not diacritics. Unicode assigns each character a combining class, which is identified by a numerical value. Non-combining characters have class number 0, while combining characters have a positive combining class value. To obtain the canonical ordering, every substring of characters having non-zero combining class value must be sorted by the combining class value using a stable sorting algorithm. Stable sorting is required because combining characters with the same class value are assumed to interact typographically, thus the two possible orders are ''not'' considered equivalent. For example, the character U+1EBF (ế), used in Vietnamese, has both an acute and a circumflex accent. Its canonical decomposition is the three-character sequence U+0065 (e) U+0302 (circumflex accent) U+0301 (acute accent). The combining classes for the two accents are both 230, thus U+1EBF is not equivalent to U+0065 U+0301 U+0302. Since not all combining sequences have a precomposed equivalent (the last one in the previous example can only be reduced to U+00E9 U+0302), even the normal form NFC is affected by combining characters' behavior.


Errors due to normalization differences

When two applications share Unicode data, but normalize them differently, errors and data loss can result. In one specific instance,
OS X macOS, previously OS X and originally Mac OS X, is a Unix, Unix-based operating system developed and marketed by Apple Inc., Apple since 2001. It is the current operating system for Apple's Mac (computer), Mac computers. With ...
normalized Unicode filenames sent from the Netatalk and
Samba Samba () is a broad term for many of the rhythms that compose the better known Brazilian music genres that originated in the Afro-Brazilians, Afro Brazilian communities of Bahia in the late 19th century and early 20th century, It is a name or ...
file- and printer-sharing software. Netatalk and Samba did not recognize the altered filenames as equivalent to the original, leading to data loss. Resolving such an issue is non-trivial, as normalization is not losslessly invertible.


See also

* Complex text layout *
Diacritic A diacritic (also diacritical mark, diacritical point, diacritical sign, or accent) is a glyph added to a letter or to a basic glyph. The term derives from the Ancient Greek (, "distinguishing"), from (, "to distinguish"). The word ''diacrit ...
*
IDN homograph attack The internationalized domain name (IDN) homograph attack (sometimes written as homoglyph attack) is a method used by malicious parties to deceive computer users about what remote system they are communicating with, by exploiting the fact that man ...
* ISO/IEC 14651 *
Ligature (typography) In writing and typography, a ligature occurs where two or more graphemes or letters are joined to form a single glyph. Examples are the characters and used in English and French, in which the letters and are joined for the first ligature a ...
*
Precomposed character A precomposed character (alternatively composite character or decomposable character) is a Unicode entity that can also be defined as a sequence of one or more other characters. A precomposed character may typically represent a letter with a diac ...
* The uconv tool can convert to and from NFC and NFD Unicode normalization forms. *
Unicode Unicode or ''The Unicode Standard'' or TUS is a character encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 16.0 defines 154,998 Char ...
*
Unicode compatibility characters In Unicode and the Universal Character Set, UCS, a compatibility character is a character that is encoded solely to maintain Round-trip format conversion, round-trip convertibility with other, often older, standards. As the Unicode Glossary says: ...


Notes


References


Unicode Standard Annex #15: Unicode Normalization Forms


External links




Charlint - a character normalization tool
written in Perl {{Unicode navigation Equivalence