HOME

TheInfoList



OR:

A precomposed character (alternatively composite character or decomposable character) is a
Unicode Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, ...
entity that can also be defined as a sequence of one or more other characters. A precomposed character may typically represent a letter with a
diacritical mark A diacritic (also diacritical mark, diacritical point, diacritical sign, or accent) is a glyph added to a letter or to a basic glyph. The term derives from the Ancient Greek (, "distinguishing"), from (, "to distinguish"). The word ''diacrit ...
, such as ''é'' (Latin small letter ''e'' with
acute accent The acute accent (), , is a diacritic used in many modern written languages with alphabets based on the Latin, Cyrillic, and Greek scripts. For the most commonly encountered uses of the accent in the Latin and Greek alphabets, precomposed ...
). Technically, ''é'' (U+00E9) is a character that can be decomposed into an equivalent string of the base letter ''e'' (U+0065) and combining acute accent (U+0301). Similarly, ligatures are precompositions of their constituent letters or
grapheme In linguistics, a grapheme is the smallest functional unit of a writing system. The word ''grapheme'' is derived and the suffix ''-eme'' by analogy with ''phoneme'' and other names of emic units. The study of graphemes is called '' graphemi ...
s. Precomposed characters are the legacy solution for representing many special letters in various
character set Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers. The numerical values tha ...
s. In Unicode, they are included primarily to aid computer systems with incomplete Unicode support, where equivalent decomposed characters may render incorrectly.


Comparing precomposed and decomposed characters

In the following example, there is a common Swedish surname Åström written in the two alternative methods, the first one with a precomposed '' Å'' (U+00C5) and '' ö'' (U+00F6), and the second one using a decomposed base letter '' A'' (U+0041) with a combining ring above (U+030A) and an '' o'' (U+006F) with a combining diaeresis (U+0308). #Åström (U+00C5 U+0073 U+0074 U+0072 U+00F6 U+006D) #Åström (U+0041 U+030A U+0073 U+0074 U+0072 U+006F U+0308 U+006D) Except for the different colors, the two solutions are equivalent and should render identically. In practice, however, some Unicode implementations still have difficulties with decomposed characters. In the worst case, combining diacritics may be disregarded or rendered as unrecognized characters after their base letters, as they are not included in all
font In metal typesetting, a font is a particular size, weight and style of a typeface. Each font is a matched set of type, with a piece (a " sort") for each glyph. A typeface consists of a range of such fonts that shared an overall design. In mo ...
s. To overcome the problems, some applications may simply attempt to replace the decomposed characters with the equivalent precomposed characters. With an incomplete font, however, precomposed characters may also be problematic – especially if they are more exotic, as in the following example (showing the reconstructed
Proto-Indo-European Proto-Indo-European (PIE) is the reconstructed common ancestor of the Indo-European language family. Its proposed features have been derived by linguistic reconstruction from documented Indo-European languages. No direct record of Proto-Indo ...
word for "dog"): #ḱṷṓn (U+1E31 U+1E77 U+1E53 U+006E) #ḱṷṓn (U+006B U+0301 U+0075 U+032D U+006F U+0304 U+0301 U+006E) In some situations, the precomposed green k, u and o with diacritics may render as unrecognized characters, or their
typographical Typography is the art and technique of arranging type to make written language legible, readable and appealing when displayed. The arrangement of type involves selecting typefaces, point sizes, line lengths, line-spacing (leading), and ...
appearance may be very different from the final letter n with no diacritic. On the second line, the base letters should at least render correctly even if the combining diacritics could not be recognized.
OpenType OpenType is a format for scalable computer fonts. It was built on its predecessor TrueType, retaining TrueType's basic structure and adding many intricate data structures for prescribing typographic behavior. OpenType is a registered trademark ...
has the ''ccmp'' "feature tag" to define glyphs that are compositions or decompositions involving combining characters.


Chinese characters

In theory, most
Chinese characters Chinese characters () are logograms developed for the writing of Chinese. In addition, they have been adapted to write other East Asian languages, and remain a key component of the Japanese writing system where they are known as ''kanji ...
as encoded by
Han unification Han unification is an effort by the authors of Unicode and the Universal Character Set to map multiple character sets of the Han characters of the so-called CJK languages into a single set of unified characters. Han characters are a featur ...
and similar schemes could be treated as precomposed characters, since they can be reduced (decomposed) to their constituent radical and phonetic components with Chinese character description languages. Such an approach could reduce the number of characters in the character set from tens of thousands to just a few thousand. On the other hand, a decomposed character set would introduce challenges for searching and editing software and require more bytes of encoding per document.


See also

* List of precomposed Latin characters in Unicode * Dead key *
Compose key A compose key (sometimes called multi key) is a key on a computer keyboard that indicates that the following (usually 2 or more) keystrokes trigger the insertion of an alternate character, typically a precomposed character or a symbol. For inst ...
*
Combining character In digital typography, combining characters are characters that are intended to modify other characters. The most common combining characters in the Latin script are the combining diacritical marks (including combining accents). Unicode al ...
* Unicode equivalence * Complex text layout * Unicode compatibility characters * Alphabetic Presentation Forms – (Unicode block) * Arabic Presentation Forms-A – (Unicode block) * Arabic Presentation Forms-B – (Unicode block)


Sources

*The Unicode Standard, Version 5.2
Conformance
(see Section 3.7 for Decomposition). The Unicode Consortium, December 2009. *MSDN
Defining a Character Set
April 8, 2010. *Unicode Normalization Forms (Unicode® Standard Annex #15): http://unicode.org/reports/tr15/


External links



a derivative of the FreeSerif font with added declarations of precomposed characters. {{Unicode navigation Unicode