A precomposed character (alternatively composite character or decomposable character) is a
Unicode
Unicode or ''The Unicode Standard'' or TUS is a character encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 16.0 defines 154,998 Char ...
entity that can also be defined as a sequence of one or more other characters. A precomposed character may typically represent a letter with a
diacritical mark, such as ''é'' (Latin small letter ''e'' with
acute accent). Technically, ''é'' (U+00E9) is a character that can be decomposed into an
equivalent string of the base letter ''e'' (U+0065) and
combining acute accent (U+0301). Similarly,
ligatures are precompositions of their constituent letters or
graphemes.
Precomposed characters are the legacy solution for representing many special letters in various
character sets. In Unicode, they are included primarily to aid computer systems with incomplete Unicode support, where equivalent decomposed characters may render incorrectly.
Comparing precomposed and decomposed characters
In the following example, there is a common
Swedish surname Åström written in the two alternative methods, the first one with a precomposed ''
Ã…'' (U+00C5) and ''
ö'' (U+00F6), and the second one using a decomposed base letter ''
A'' (U+0041) with a combining
ring above (U+030A) and an ''
o'' (U+006F) with a combining
diaeresis (U+0308).
#
Åström (
U+00C5 U+0073 U+0074 U+0072
U+00F6 U+006D)
#
Åström (U+0041 U+030A U+0073 U+0074 U+0072 U+006F U+0308 U+006D)
Except for the different colors, the two solutions are equivalent and should render identically. In practice, however, some Unicode implementations still have difficulties with decomposed characters. In the worst case, combining diacritics may be disregarded or rendered as unrecognized characters after their base letters, as they are not included in all
fonts. To overcome the problems, some applications may simply attempt to replace the decomposed characters with the equivalent precomposed characters.
With an incomplete font, however, precomposed characters may also be problematic – especially if they are more exotic, as in the following example (showing the reconstructed
Proto-Indo-European word for "dog"):
#
ḱṷṓn (
U+1E31 U+1E77 U+1E53 U+006E)
#
ḱṷṓn (U+006B
U+0301 U+0075
U+032D U+006F
U+0304 U+0301 U+006E)
In some situations, the precomposed green
k,
u and
o with diacritics may render as
unrecognized characters, or their
typographical appearance may be very different from the final letter
n with no diacritic. On the second line, the base letters should at least render correctly even if the combining diacritics could not be recognized.
OpenType has the ''ccmp'' "feature tag" to define glyphs that are compositions or decompositions involving combining characters.
Chinese characters
In theory, most
Chinese characters
Chinese characters are logographs used Written Chinese, to write the Chinese languages and others from regions historically influenced by Chinese culture. Of the four independently invented writing systems accepted by scholars, they represe ...
as encoded by
Han unification and similar schemes could be treated as precomposed characters, since they can be reduced (decomposed) to their constituent
radical and phonetic components with
Chinese character description languages. Such an approach could reduce the number of characters in the character set from tens of thousands to just a few thousand. On the other hand, a decomposed character set would introduce challenges for searching and editing software and require more bytes of encoding per document. One particular challenge would be the multiple-to-multiple projections between the set of decomposed characters and the precomposed character—one precomposed character may be decomposed into multiple different sets of decomposed characters while one set of decomposed characters could contract themselves into multiple different precomposed characters. There is no strict requirement or constraints regarding the relative position between components within a character, the form of variant and transform (narrow, widen, stretch, rotate, etc.) applied on components, nor the number of each components.
See also
*
List of precomposed Latin characters in Unicode
*
Dead key
A dead key is a special kind of modifier key on a mechanical typewriter, or computer keyboard, that is typically used to attach a specific diacritic to a base letter (alphabet), letter. The dead key does not generate a (complete) grapheme, charact ...
*
Compose key
*
Combining character
*
Unicode equivalence
*
Complex text layout
*
Unicode compatibility characters
*
Alphabetic Presentation Forms – (Unicode block)
*
Arabic Presentation Forms-A – (Unicode block)
*
Arabic Presentation Forms-B – (Unicode block)
Sources
*The Unicode Standard, Version 5.2
Conformance(see Section 3.7 for Decomposition). The Unicode Consortium, December 2009.
*MSDN
Defining a Character Set April 8, 2010.
*Unicode Normalization Forms (Unicode® Standard Annex #15): http://unicode.org/reports/tr15/
External links
a derivative of the
FreeSerif font with added declarations of precomposed characters.
{{Unicode navigation
Unicode