HOME

TheInfoList



OR:

A precomposed character (alternatively composite character or decomposable character) is a
Unicode Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, whic ...
entity that can also be defined as a sequence of one or more other characters. A precomposed character may typically represent a letter with a
diacritical mark A diacritic (also diacritical mark, diacritical point, diacritical sign, or accent) is a glyph added to a letter or to a basic glyph. The term derives from the Ancient Greek (, "distinguishing"), from (, "to distinguish"). The word ''diacriti ...
, such as ''é'' (Latin small letter ''e'' with
acute accent The acute accent (), , is a diacritic used in many modern written languages with alphabets based on the Latin, Cyrillic, and Greek scripts. For the most commonly encountered uses of the accent in the Latin and Greek alphabets, precomposed cha ...
). Technically, ''é'' (U+00E9) is a character that can be decomposed into an
equivalent Equivalence or Equivalent may refer to: Arts and entertainment *Album-equivalent unit, a measurement unit in the music industry *Equivalence class (music) *''Equivalent VIII'', or ''The Bricks'', a minimalist sculpture by Carl Andre *'' Equival ...
string of the base letter ''e'' (U+0065) and combining acute accent (U+0301). Similarly,
ligatures Ligature may refer to: * Ligature (medicine), a piece of suture used to shut off a blood vessel or other anatomical structure ** Ligature (orthodontic), used in dentistry * Ligature (music), an element of musical notation used especially in the me ...
are precompositions of their constituent letters or
grapheme In linguistics, a grapheme is the smallest functional unit of a writing system. The word ''grapheme'' is derived and the suffix ''-eme'' by analogy with ''phoneme'' and other names of emic units. The study of graphemes is called ''graphemics' ...
s. Precomposed characters are the legacy solution for representing many special letters in various
character set Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers. The numerical values that ...
s. In Unicode, they are included primarily to aid computer systems with incomplete Unicode support, where equivalent decomposed characters may render incorrectly.


Comparing precomposed and decomposed characters

In the following example, there is a common Swedish surname Åström written in the two alternative methods, the first one with a precomposed '' Å'' (U+00C5) and '' ö'' (U+00F6), and the second one using a decomposed base letter '' A'' (U+0041) with a combining
ring above A ring diacritic may appear above or below letters. It may be combined with some letters of the extended Latin alphabets in various contexts. Rings Distinct letter The character Å (å) is derived from an A with a ring. It is a distinct le ...
(U+030A) and an '' o'' (U+006F) with a combining diaeresis (U+0308). #Åström (U+00C5 U+0073 U+0074 U+0072 U+00F6 U+006D) #Åström (U+0041 U+030A U+0073 U+0074 U+0072 U+006F U+0308 U+006D) Except for the different colors, the two solutions are equivalent and should render identically. In practice, however, some Unicode implementations still have difficulties with decomposed characters. In the worst case, combining diacritics may be disregarded or rendered as unrecognized characters after their base letters, as they are not included in all
font In metal typesetting, a font is a particular size, weight and style of a typeface. Each font is a matched set of type, with a piece (a "sort") for each glyph. A typeface consists of a range of such fonts that shared an overall design. In mod ...
s. To overcome the problems, some applications may simply attempt to replace the decomposed characters with the equivalent precomposed characters. With an incomplete font, however, precomposed characters may also be problematic – especially if they are more exotic, as in the following example (showing the reconstructed
Proto-Indo-European Proto-Indo-European (PIE) is the reconstructed common ancestor of the Indo-European language family. Its proposed features have been derived by linguistic reconstruction from documented Indo-European languages. No direct record of Proto-Indo- ...
word for "dog"): #ḱṷṓn (U+1E31 U+1E77 U+1E53 U+006E) #ḱṷṓn (U+006B U+0301 U+0075 U+032D U+006F U+0304 U+0301 U+006E) In some situations, the precomposed green k, u and o with diacritics may render as unrecognized characters, or their typographical appearance may be very different from the final letter n with no diacritic. On the second line, the base letters should at least render correctly even if the combining diacritics could not be recognized.
OpenType OpenType is a format for scalable computer fonts. It was built on its predecessor TrueType, retaining TrueType's basic structure and adding many intricate data structures for prescribing typographic behavior. OpenType is a registered trademark ...
has the ''ccmp'' "feature tag" to define glyphs that are compositions or decompositions involving combining characters.


Chinese characters

In theory, most
Chinese characters Chinese characters () are logograms developed for the writing of Chinese. In addition, they have been adapted to write other East Asian languages, and remain a key component of the Japanese writing system where they are known as ''kanji ...
as encoded by
Han unification Han unification is an effort by the authors of Unicode and the Universal Character Set to map multiple character sets of the Han characters of the so-called CJK languages into a single set of unified characters. Han characters are a feature ...
and similar schemes could be treated as precomposed characters, since they can be reduced (decomposed) to their constituent radical and phonetic components with
Chinese character description languages The Chinese character description languages are several proposed languages to most accurately and completely describe Chinese (or CJK) characters and information such as their list of components, list of strokes (basic and complex), their order, a ...
. Such an approach could reduce the number of characters in the character set from tens of thousands to just a few thousand. On the other hand, a decomposed character set would introduce challenges for searching and editing software and require more bytes of encoding per document.


See also

*
List of precomposed Latin characters in Unicode This is a list of precomposed Latin characters in Unicode. Unicode typefaces may be needed for these to display correctly. Letters with diacritics Digraphs and ligatures * DZ, Dz, dz * DŽ, Dž, dž * ff * ffi * ffl * fi * fl * IJ, ij * ...
*
Dead key A dead key is a special kind of modifier key on a mechanical typewriter, or computer keyboard, that is typically used to attach a specific diacritic to a base letter. The dead key does not generate a (complete) character by itself, but modifies ...
*
Compose key A compose key (sometimes called multi key) is a key on a computer keyboard that indicates that the following (usually 2 or more) keystrokes trigger the insertion of an alternate character, typically a precomposed character or a symbol. For insta ...
*
Combining character In digital typography, combining characters are characters that are intended to modify other characters. The most common combining characters in the Latin script are the combining diacritical marks (including combining accents). Unicode also ...
*
Unicode equivalence Unicode equivalence is the specification by the Unicode character encoding standard that some sequences of code points represent essentially the same character. This feature was introduced in the standard to allow compatibility with preexisting st ...
*
Complex text layout Complex text layout (CTL) or complex text rendering is the typesetting of writing systems in which the shape or positioning of a grapheme depends on its relation to other graphemes. The term is used in the field of software internationalizatio ...
*
Unicode compatibility characters In Unicode and the UCS, a compatibility character is a character that is encoded solely to maintain round-trip convertibility with other, often older, standards. As the Unicode Glossary says: A character that would not have been encoded excep ...
*
Alphabetic Presentation Forms Alphabetic Presentation Forms is a Unicode block containing standard ligatures for the Latin, Armenian, and Hebrew scripts. Block History The following Unicode-related documents record the purpose and process of defining specific characters in ...
– (Unicode block) *
Arabic Presentation Forms-A Arabic Presentation Forms-A is a Unicode block encoding contextual forms and ligatures of letter variants needed for Persian, Urdu, Sindhi and Central Asian languages. This block also allocates 32 noncharacters in Unicode, designed specifically ...
– (Unicode block) *
Arabic Presentation Forms-B Arabic Presentation Forms-B is a Unicode block encoding spacing forms of Arabic diacritics, and contextual letter forms. The special codepoint ZWNBSP is also here, which is only meant for a byte order mark The byte order mark (BOM) is a parti ...
– (Unicode block)


Sources

*The Unicode Standard, Version 5.2
Conformance
(see Section 3.7 for Decomposition). The Unicode Consortium, December 2009. *MSDN
Defining a Character Set
April 8, 2010. *Unicode Normalization Forms (Unicode® Standard Annex #15): http://unicode.org/reports/tr15/


External links



a derivative of the
FreeSerif GNU FreeFont (also known as Free UCS Outline Fonts) is a family of free OpenType, TrueType and WOFF vector fonts, implementing as much of the Universal Character Set (UCS) as possible, aside from the very large CJK Asian character set. The p ...
font with added declarations of precomposed characters. {{Unicode navigation Unicode