A precomposed character (alternatively composite character or decomposable character) is a

Unicode Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, whic ...

entity that can also be defined as a sequence of one or more other characters. A precomposed character may typically represent a letter with a

diacritical mark A diacritic (also diacritical mark, diacritical point, diacritical sign, or accent) is a glyph added to a letter or to a basic glyph. The term derives from the Ancient Greek (, "distinguishing"), from (, "to distinguish"). The word ''diacriti ...

, such as ''é'' (Latin small letter ''e'' with

acute accent The acute accent (), , is a diacritic used in many modern written languages with alphabets based on the Latin, Cyrillic, and Greek scripts. For the most commonly encountered uses of the accent in the Latin and Greek alphabets, precomposed cha ...

). Technically, ''é'' (U+00E9) is a character that can be decomposed into an

equivalent Equivalence or Equivalent may refer to: Arts and entertainment *Album-equivalent unit, a measurement unit in the music industry *Equivalence class (music) *''Equivalent VIII'', or ''The Bricks'', a minimalist sculpture by Carl Andre *'' Equival ...

string of the base letter ''e'' (U+0065) and combining acute accent (U+0301). Similarly,

ligatures Ligature may refer to: * Ligature (medicine), a piece of suture used to shut off a blood vessel or other anatomical structure ** Ligature (orthodontic), used in dentistry * Ligature (music), an element of musical notation used especially in the me ...

are precompositions of their constituent letters or

grapheme In linguistics, a grapheme is the smallest functional unit of a writing system. The word ''grapheme'' is derived and the suffix ''-eme'' by analogy with ''phoneme'' and other names of emic units. The study of graphemes is called ''graphemics' ...

s. Precomposed characters are the legacy solution for representing many special letters in various

character set Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers. The numerical values that ...

s. In Unicode, they are included primarily to aid computer systems with incomplete Unicode support, where equivalent decomposed characters may render incorrectly.

Comparing precomposed and decomposed characters

In the following example, there is a common Swedish surname Åström written in the two alternative methods, the first one with a precomposed '' Å'' (U+00C5) and '' ö'' (U+00F6), and the second one using a decomposed base letter '' A'' (U+0041) with a combining

ring above A ring diacritic may appear above or below letters. It may be combined with some letters of the extended Latin alphabets in various contexts. Rings Distinct letter The character Å (å) is derived from an A with a ring. It is a distinct le ...

(U+030A) and an '' o'' (U+006F) with a combining diaeresis (U+0308). #Åström (U+00C5 U+0073 U+0074 U+0072 U+00F6 U+006D) #Åström (U+0041 U+030A U+0073 U+0074 U+0072 U+006F U+0308 U+006D) Except for the different colors, the two solutions are equivalent and should render identically. In practice, however, some Unicode implementations still have difficulties with decomposed characters. In the worst case, combining diacritics may be disregarded or rendered as unrecognized characters after their base letters, as they are not included in all

font In metal typesetting, a font is a particular size, weight and style of a typeface. Each font is a matched set of type, with a piece (a "sort") for each glyph. A typeface consists of a range of such fonts that shared an overall design. In mod ...

s. To overcome the problems, some applications may simply attempt to replace the decomposed characters with the equivalent precomposed characters. With an incomplete font, however, precomposed characters may also be problematic – especially if they are more exotic, as in the following example (showing the reconstructed

Proto-Indo-European Proto-Indo-European (PIE) is the reconstructed common ancestor of the Indo-European language family. Its proposed features have been derived by linguistic reconstruction from documented Indo-European languages. No direct record of Proto-Indo- ...

word for "dog"): #ḱṷṓn (U+1E31 U+1E77 U+1E53 U+006E) #ḱṷṓn (U+006B U+0301 U+0075 U+032D U+006F U+0304 U+0301 U+006E) In some situations, the precomposed green k, u and o with diacritics may render as unrecognized characters, or their typographical appearance may be very different from the final letter n with no diacritic. On the second line, the base letters should at least render correctly even if the combining diacritics could not be recognized.

OpenType OpenType is a format for scalable computer fonts. It was built on its predecessor TrueType, retaining TrueType's basic structure and adding many intricate data structures for prescribing typographic behavior. OpenType is a registered trademark ...

has the ''ccmp'' "feature tag" to define glyphs that are compositions or decompositions involving combining characters.

Chinese characters

In theory, most

Chinese characters Chinese characters () are logograms developed for the writing of Chinese. In addition, they have been adapted to write other East Asian languages, and remain a key component of the Japanese writing system where they are known as ''kanji ...

as encoded by

Han unification Han unification is an effort by the authors of Unicode and the Universal Character Set to map multiple character sets of the Han characters of the so-called CJK languages into a single set of unified characters. Han characters are a feature ...

and similar schemes could be treated as precomposed characters, since they can be reduced (decomposed) to their constituent radical and phonetic components with

Chinese character description languages The Chinese character description languages are several proposed languages to most accurately and completely describe Chinese (or CJK) characters and information such as their list of components, list of strokes (basic and complex), their order, a ...

. Such an approach could reduce the number of characters in the character set from tens of thousands to just a few thousand. On the other hand, a decomposed character set would introduce challenges for searching and editing software and require more bytes of encoding per document.

Sources

*The Unicode Standard, Version 5.2
Conformance
(see Section 3.7 for Decomposition). The Unicode Consortium, December 2009. *MSDN
Defining a Character Set
April 8, 2010. *Unicode Normalization Forms (Unicode® Standard Annex #15): http://unicode.org/reports/tr15/

External links

a derivative of the

FreeSerif GNU FreeFont (also known as Free UCS Outline Fonts) is a family of free OpenType, TrueType and WOFF vector fonts, implementing as much of the Universal Character Set (UCS) as possible, aside from the very large CJK Asian character set. The p ...

font with added declarations of precomposed characters. {{Unicode navigation Unicode

Comparing precomposed and decomposed characters

Chinese characters

See also

Sources

External links