Character reference overview
An HTML or XML numeric character reference refers to a character by its Universal Character Set/Unicode code point, and uses the format :
''nnnn'';
or
:
''hhhh'';
where ''nnnn'' is the code point in &
''name'';
where ''name'' is the case-sensitive name of the entity. The semicolon is required.
Planes
Unicode and ISO divide the set of code points into 17 planes, each capable of containing 65536 distinct characters or 1,114,112 total. As of 2022 (Unicode 15.0) ISO and the Unicode Consortium has only allocated characters and blocks in seven of the 17 planes. The others remain empty and reserved for future use. Most characters are currently assigned to the first plane: the ''Basic Multilingual Plane''. This is to help ease the transition for legacy software since the Basic Multilingual Plane is addressable with just two octets. The characters outside the first plane usually have very specialized or rare use. Each plane corresponds with the value of the one or two hexadecimal digits (0—9, A—F) preceding the four final ones: hence U+24321 is in Plane 2, U+4321 is in Plane 0 (implicitly read U+04321), and U+10A200 would be in Plane 16 (hex 10 = decimal 16). Within one plane, the range of code points is hexadecimal 0000—FFFF, yielding a maximum of 65536 code points. Planes restrict code points to a subset of that range.Blocks
Unicode adds a block property to UCS that further divides each plane into separate blocks. Each block is a grouping of characters by their use such as "mathematical operators" or "Hebrew script characters". When assigning characters to previously unassigned code points, the Consortium typically allocates entire blocks of similar characters: for example all the characters belonging to the same script or all similarly purposed symbols get assigned to a single block. Blocks may also maintain unassigned or reserved code points when the Consortium expects a block to require additional assignments. The first 256 code points in the UCS correspond with those ofCategories
Unicode assigns to every UCS character a ''general category'' and subcategory. The general categories are: letter, mark, number, punctuation, symbol, or control (in other words a formatting or non-graphical character). Types include: * Modern, Historic, and Ancient Scripts. As of 2022 (Unicode 15.0), the UCS identifies 161 scripts that are, or have been, used throughout of the world. Many more are in various approval stages for future inclusion of the UCS. * International Phonetic Alphabet. The UCS devotes several blocks (over 300 characters) to characters for theSpecial-purpose characters
Unicode codifies over a hundred thousand characters. Most of those represent graphemes for processing as linear text. Some, however, either do not represent graphemes, or, as graphemes, require exceptional treatment. Unlike the ASCII control characters and other characters included for legacy round-trip capabilities, these other special-purpose characters endow plain text with important semantics. Some special characters can alter the layout of text, such as the zero-width joiner and zero-width non-joiner, while others do not affect text layout at all, but instead affect the way text strings are collated, matched or otherwise processed. Other special-purpose characters, such as the mathematical invisibles, generally have no effect on text rendering, though sophisticated text layout software may choose to subtly adjust spacing around them. Unicode does not specify the division of labor between font and text layout software (or "engine") when rendering Unicode text. Because the more complex font formats, such as OpenType or Apple Advanced Typography, provide for contextual substitution and positioning of glyphs, a simple text layout engine might rely entirely on the font for all decisions of glyph choice and placement. In the same situation a more complex engine may combine information from the font with its own rules to achieve its own idea of best rendering. To implement all recommendations of the Unicode specification, a text engine must be prepared to work with fonts of any level of sophistication, since contextual substitution and positioning rules do not exist in some font formats and are optional in the rest. The fraction slash is an example: complex fonts may or may not supply positioning rules in the presence of the fraction slash character to create a fraction, while fonts in simple formats cannot.Byte order mark
When appearing at the head of a text file or stream, the byte order mark (BOM) U+FEFF hints at the encoding form and its byte order. If the stream's first byte is 0xFE and the second 0xFF, then the stream's text is not likely to be encoded in UTF-8, since those bytes are invalid in UTF-8. It is also not likely to be UTF-16 in little-endian byte order because 0xFE, 0xFF read as a 16-bit little endian word would be U+FFFE, which is meaningless. The sequence also has no meaning in any arrangement of UTF-32 encoding, so, in summary, it serves as a fairly reliable indication that the text stream is encoded as UTF-16 in big-endian byte order. Conversely, if the first two bytes are 0xFF, 0xFE, then the text stream may be assumed to be encoded as UTF-16LE because, read as a 16-bit little-endian value, the bytes yield the expected 0xFEFF byte order mark. This assumption becomes questionable, however, if the next two bytes are both 0x00; either the text begins with a null character (U+0000), or the correct encoding is actually UTF-32LE, in which the full 4-byte sequence FF FE 00 00 is one character, the BOM. The UTF-8 sequence corresponding to U+FEFF is 0xEF, 0xBB, 0xBF. This sequence has no meaning in other Unicode encoding forms, so it may serve to indicate that that stream is encoded as UTF-8. The Unicode specification does not require the use of byte order marks in text streams. It further states that they should not be used in situations where some other method of signaling the encoding form is already in use.Mathematical invisibles
Primarily for mathematics, the Invisible Separator (U+2063) provides a separator between characters where punctuation or space may be omitted such as in a two-dimensional index like ij. Invisible Times (U+2062) and Function Application (U+2061) are useful in mathematics text where the multiplication of terms or the application of a function is implied without any glyph indicating the operation. Unicode 5.1 introduces the Mathematical Invisible Plus character as well (U+2064) which may indicate that an integral number followed by a fraction should denote their sum, but not their product.Fraction slash
The standard form of a fraction built using the fraction slash is defined as follows: any sequence of one or more decimal digits (General Category = Nd), followed by the fraction slash, followed by any sequence of one or more decimal digits. Such a fraction should be displayed as a unit, such as ¾. If the displaying software is incapable of mapping the fraction to a unit, then it can also be displayed as a simple linear sequence as a fallback (for example, 3/4). If the fraction is to be separated from a previous number, then a space can be used, choosing the appropriate width (normal, thin, zero width, and so on). For example, 1 + ZERO WIDTH SPACE + 3 + FRACTION SLASH + 4 is displayed as 1¾.By following this Unicode recommendation, text processing systems yield sophisticated symbols from plain text alone. Here the presence of the fraction slash character instructs the layout engine to synthesize a fraction from all consecutive digits preceding and following the slash. In practice, results vary because of the complicated interplay between fonts and layout engines. Simple text layout engines tend not to synthesize fractions at all, and instead draw the glyphs as a linear sequence as described in the Unicode fallback scheme. More sophisticated layout engines face two practical choices: they can follow Unicode's recommendation, or they can rely on the font's own instructions for synthesizing fractions. By ignoring the font's instructions, the layout engine can guarantee Unicode's recommended behavior. By following the font's instructions, the layout engine can achieve better typography because placement and shaping of the digits will be tuned to that particular font at that particular size. The problem with following the font's instructions is that the simpler font formats have no way to specify fraction synthesis behavior. Meanwhile, the more complex formats do not require the font to specify fraction synthesis behavior and therefore many do not. Most fonts of complex formats can instruct the layout engine to replace a plain text sequence such as "1⁄2" with the precomposed "½" glyph. But because many of them will not issue instructions to synthesize fractions, a plain text string such as "221⁄225" may well render as 22½25 (with the ½ being the substituted precomposed fraction, rather than synthesized). In the face of problems like this, those who wish to rely on the recommended Unicode behavior should choose fonts known to synthesize fractions or text layout software known to produce Unicode's recommended behavior regardless of font.
Bidirectional neutral formatting
Writing direction is the direction glyphs are placed on the page in relation to forward progression of characters in the Unicode string. English and other languages of Latin script have left-to-right writing direction. Several major writing scripts, such as Arabic and Hebrew, have right-to-left writing direction. The Unicode specification assigns a ''directional type'' to each character to inform text processors how sequences of characters should be ordered on the page. While lexical characters (that is, letters) are normally specific to a single writing script, some symbols and punctuation marks are used across many writing scripts. Unicode could have created duplicate symbols in the repertoire that differ only by directional type, but chose instead to unify them and assign them a neutral directional type. They acquire direction at render time from adjacent characters. Some of these characters also have a ''bidi-mirrored'' property indicating the glyph should be rendered in mirror-image when used in right-to-left text. The render-time directional type of a neutral character can remain ambiguous when the mark is placed on the boundary between directional changes. To address this, Unicode includes characters that have strong directionality, have no glyph associated with them, and are ignorable by systems that do not process bidirectional text: * Arabic letter mark (U+061C) * Left-to-right mark (U+200E) * Right-to-left mark (U+200F) Surrounding a bidirectionally neutral character by the left-to-right mark will force the character to behave as a left-to-right character while surrounding it by the right-to-left mark will force it to behave as a right-to-left character. The behavior of these characters is detailed in Unicode's Bidirectional Algorithm.Bidirectional general formatting
While Unicode is designed to handle multiple languages, multiple writing systems and even text that flows either left-to-right or right-to-left with minimal author intervention, there are special circumstances where the mix of bidirectional text can become intricate—requiring more author control. For these circumstances, Unicode includes five other characters to control the complex embedding of left-to-right text within right-to-left text and vice versa: * Left-to-right embedding (U+202A) * Right-to-left embedding (U+202B) * Pop directional formatting (U+202C) * Left-to-right override (U+202D) * Right-to-left override (U+202E) * Left-to-right isolate (U+2066) * Right-to-left isolate (U+2067) * First strong isolate (U+2068) * Pop directional isolate (U+2069)Interlinear annotation characters
* Interlinear Annotation Anchor (U+FFF9) * Interlinear Annotation Separator (U+FFFA) * Interlinear Annotation Terminator (U+FFFB)Script-specific
* Prefixed format control ** Arabic Number Sign (U+0600) ** Arabic Sign Sanah (U+0601) ** Arabic Footnote Marker (U+0602) ** Arabic Sign Safha (U+0603) ** Arabic Sign Samvat (U+0604) ** Arabic Number Mark Above (U+0605) ** Arabic End of Ayah (U+06DD) ** Syriac Abbreviation Mark (U+070F) ** Arabic Pound Mark Above (U+0890) ** Arabic Piastre Mark Above (U+0891) ** Kaithi Number Sign (U+110BD) ** Kaithi Number Sign Above (U+110CD) * Egyptian Hieroglyphs ** Egyptian Hieroglyph Vertical Joiner (U+13430) ** Egyptian Hieroglyph Horizontal Joiner (U+13431) ** Egyptian Hieroglyph Insert At Top Start (U+13432) ** Egyptian Hieroglyph Insert At Bottom Start (U+13433) ** Egyptian Hieroglyph Insert At Top End (U+13434) ** Egyptian Hieroglyph Insert At Bottom End (U+13435) ** Egyptian Hieroglyph Overlay Middle (U+13436) ** Egyptian Hieroglyph Begin Segment (U+13437) ** Egyptian Hieroglyph End Segment (U+13438) ** Egyptian Hieroglyph Insert At Middle (U+13439) ** Egyptian Hieroglyph Insert At Top (U+1343A) ** Egyptian Hieroglyph Insert At Bottom (U+1343B) ** Egyptian Hieroglyph Begin Enclosure (U+1343C) ** Egyptian Hieroglyph End Enclosure (U+1343D) ** Egyptian Hieroglyph Begin Walled Enclosure (U+1343E) ** Egyptian Hieroglyph End Walled Enclosure (U+1343F) * Brahmi ** Brahmi Number Joiner (U+1107F) * Brahmi-derived script dead-character formation ( Virama and similar diacritics) ** Devanagari Sign Virama (U+094D) ** Bengali Sign Virama (U+09CD) ** Gurmukhi Sign Virama (U+0A4D) ** Gujarati Sign Virama (U+0ACD) ** Oriya Sign Virama (U+0B4D) ** Tamil Sign Virama (U+0BCD) ** Telugu Sign Virama (U+0C4D) ** Kannada Sign Virama (U+0CCD) ** Malayalam Sign Vertical Bar Virama (U+0D3B) ** Malayalam Sign Circular Virama (U+0D3C) ** Malayalam Sign Virama (U+0D4D) ** Sinhala Sign Al-Lakuna (U+0DCA) ** Thai Character Phinthu (U+0E3A) ** Thai Character Yamakkan (U+0E4E) ** Lao Sign Pali Virama (U+0EBA) ** Myanmar Sign Virama (U+1039) ** Tagalog Sign Virama (U+1714) ** Tagalog Sign Pamudpod (U+1715) ** Hanunoo Sign Pamudpod (U+1734) ** Khmer Sign Viriam (U+17D1) ** Khmer Sign Coeng (U+17D2) ** Tai Tham Sign Sakot (U+1A60) ** Tai Tham Sign Ra Haam (U+1A7A) ** Balinese Adeg Adeg (U+1B44) ** Sundanese Sign Pamaaeh (U+1BAA) ** Sundanese Sign Virama (U+1BAB) ** Batak Pangolat (U+1BF2) ** Batak Panongonan (U+1BF3) ** Syloti Nagri Sign Hasanta (U+A806) ** Syloti Nagri Sign Alternate Hasanta (U+A82C) ** Saurashtra Sign Virama (U+A8C4) ** Rejang Virama (U+A953) ** Javanese Pangkon (U+A9C0) ** Meetei Mayek Virama (U+AAF6) ** Kharoshthi Virama (U+10A3F) ** Brahmi Virama (U+11046) ** Brahmi Sign Old Tamil Virama (U+11070) ** Kaithi Sign Virama (U+110B9) ** Chakma Virama (U+11133) ** Sharada Sign Virama (U+111C0) ** Khojki Sign Virama (U+11235) ** Khudawadi Sign Virama (U+112EA) ** Grantha Sign Virama (U+1134D) ** Newa Sign Virama (U+11442) ** Tirhuta Sign Virama (U+114C2) ** Siddham Sign Virama (U+115BF) ** Modi Sign Virama (U+1163F) ** Takri Sign Virama (U+116B6) ** Ahom Sign Killer (U+1172B) ** Dogra Sign Virama (U+11839) ** Dives Akuru Sign Halanta (U+1193D) ** Dives Akuru Virama (U+1193E) ** Nandinagari Sign Virama (U+119E0) ** Zanabazar Square Sign Virama (U+11A34) ** Zanabazar Square Subjoiner (U+11A47) ** Soyombo Subjoiner (U+11A99) ** Bhaiksuki Sign Virama (U+11C3F) ** Masaram Gondi Sign Halanta (U+11D44) ** Masaram Gondi Virama (U+11D45) ** Gunjala Gondi Virama (U+11D97) ** Kawi Sign Killer (U+11F41) ** Kawi Conjoiner (U+11F42) * Historical Viramas with other functions ** Tibetan Mark Halanta (U+0F84) ** Myanmar Sign Asat (U+103A) ** Limbu Sign Sa-I (U+193B) ** Meetei Mayek Apun Iyek (U+ABED) ** Chakma Maayyaa (U+11134) * Mongolian Variation Selectors ** Mongolian Free Variation Selector One (U+180B) ** Mongolian Free Variation Selector Two (U+180C) ** Mongolian Free Variation Selector Three (U+180D) ** Mongolian Vowel Separator (U+180E) * Generic Variation Selectors ** Variation Selector-1 through -16 (U+FE00–U+FE0F) ** Variation Selector-17 through -256 (U+E0100–U+E01EF) * Tag characters (U+E0001 and U+E0020–U+E007F) * Tifinagh ** Tifinagh Consonant Joiner (U+2D7F) * Ogham ** Ogham Space Mark (U+1680) * Ideographic ** Ideographic variation indicator (U+303E) ** Ideographic Description (U+2FF0–U+2FFB) * Musical Format Control ** Musical Symbol Begin Beam (U+1D173) ** Musical Symbol End Beam (U+1D174) ** Musical Symbol Begin Tie (U+1D175) ** Musical Symbol End Tie (U+1D176) ** Musical Symbol Begin Slur (U+1D177) ** Musical Symbol End Slur (U+1D178) ** Musical Symbol Begin Phrase (U+1D179) ** Musical Symbol End Phrase (U+1D17A) * Shorthand Format Control ** Shorthand Format Letter Overlap (U+1BCA0) ** Shorthand Format Continuing Overlap (U+1BCA1) ** Shorthand Format Down Step (U+1BCA2) ** Shorthand Format Up Step (U+1BCA3) * Deprecated Alternate Formatting ** Inhibit Symmetric Swapping (U+206A) ** Activate Symmetric Swapping (U+206B) ** Inhibit Arabic Form Shaping (U+206C) ** Activate Arabic Form Shaping (U+206D) ** National Digit Shapes (U+206E) ** Nominal Digit Shapes (U+206F)Others
* Object Replacement Character (U+FFFC) * Replacement Character (U+FFFD)Characters vs code points
The term "character" is not well defined, and what we are referring to most of the time is the grapheme. A grapheme is represented visually by itsWhitespace, joiners, and separators
Unicode provides a list of characters it deems whitespace characters for interoperability support. Software Implementations and other standards may use the term to denote a slightly different set of characters. For example, Java does not consider or to be whitespace, even though Unicode does. Whitespace characters are characters typically designated for programming environments. Often they have no syntactic meaning in such programming environments and are ignored by the machine interpreters. Unicode designates the legacy control characters U+0009 through U+000D and U+0085 as whitespace characters, as well as all characters whose General Category property value is Separator. There are 25 total whitespace characters as of Unicode 15.0.Grapheme joiners and non-joiners
The zero-width joiner (U+200D) and zero-width non-joiner (U+200C) control the joining and ligation of glyphs. The joiner does not cause characters that would not otherwise join or ligate to do so, but when paired with the non-joiner these characters can be used to control the joining and ligating properties of the surrounding two joining or ligating characters. The Combining Grapheme Joiner (U+034F) is used to distinguish two base characters as one common base or digraph, mostly for underlying text processing, collation of strings, case folding and so on.Word joiners and separators
The most common word separator is a space (U+0020). However, there are other word joiners and separators that also indicate a break between words and participate in line-breaking algorithms. The No-Break Space (U+00A0) also produces a baseline advance without a glyph but inhibits rather than enabling a line-break. The Zero Width Space (U+200B) allows a line-break but provides no space: in a sense joining, rather than separating, two words. Finally, the Word Joiner (U+2060) inhibits line breaks and also involves none of the white space produced by a baseline advance.Other separators
* Line Separator (U+2028) * Paragraph Separator (U+2029) These provide Unicode with native paragraph and line separators independent of the legacy encoded ASCII control characters such as carriage return (U+000A), linefeed (U+000D), and Next Line (U+0085). Unicode does not provide for other ASCII formatting control characters which presumably then are not part of the Unicode plain text processing model. These legacy formatting control characters include Tab (U+0009), Line Tabulation or Vertical Tab (U+000B), and Form Feed (U+000C) which is also thought of as a page break.Spaces
The space character (U+0020) typically input by the space bar on a keyboard serves semantically as a word separator in many languages. For legacy reasons, the UCS also includes spaces of varying sizes that are compatibility equivalents for the space character. While these spaces of varying width are important in typography, the Unicode processing model calls for such visual effects to be handled by rich text, markup and other such protocols. They are included in the Unicode repertoire primarily to handle lossless roundtrip transcoding from other character set encodings. These spaces include: # En Quad (U+2000) # Em Quad (U+2001) # En Space (U+2002) # Em Space (U+2003) # Three-Per-Em Space (U+2004) # Four-Per-Em Space (U+2005) # Six-Per-Em Space (U+2006) # Figure Space (U+2007) # Punctuation Space (U+2008) # Thin Space (U+2009) # Hair Space (U+200A) # Medium Mathematical Space (U+205F) Aside from the original ASCII space, the other spaces are all compatibility characters. In this context this means that they effectively add no semantic content to the text, but instead provide styling control. Within Unicode, this non-semantic styling control is often referred to as rich text and is outside the thrust of Unicode's goals. Rather than using different spaces in different contexts, this styling should instead be handled through intelligent text layout software. Three other writing-system-specific word separators are: * Mongolian Vowel Separator (U+180E) * Ideographic Space (U+3000): behaves as an ideographic separator and generally rendered as white space of the same width as an ideograph. * Ogham Space Mark (U+1680): this character is sometimes displayed with a glyph and other times as only white space.Line-break control characters
Several characters are designed to help control line-breaks either by discouraging them (no-break characters) or suggesting line breaks such as the soft hyphen (U+00AD) (sometimes called the "shy hyphen"). Such characters, though designed for styling, are probably indispensable for the intricate types of line-breaking they make possible. ;Break inhibiting # Non-breaking hyphen (U+2011) # No-break space (U+00A0) # Tibetan Mark Delimiter Tsheg Bstar (U+0F0C) # Narrow no-break space (U+202F) The break inhibiting characters are meant to be equivalent to a character sequence wrapped in the Word Joiner U+2060. However, the Word Joiner may be appended before or after any character that would allow a line-break to inhibit such line-breaking. ;Break enabling # Soft hyphen (U+00AD) # Tibetan Mark Intersyllabic Tsheg (U+0F0B) # Zero-width space (U+200B) Both the break inhibiting and break enabling characters participate with other punctuation and whitespace characters to enable text imaging systems to determine line breaks within the Unicode Line Breaking Algorithm.Types of code point
All code points given some kind of purpose or use are considered designated code points. Of those, they may be assigned to an abstract character, or otherwise designated for some other purpose.Assigned characters
The majority of code points in actual use have been assigned to abstract characters. This includes private-use characters, which though not formally designated by the Unicode standard for a particular purpose, require a sender and recipient to have agreed in advance how they should be interpreted for meaningful information interchange to take place.Private-use characters
The UCS includes 137,468 private-use characters, which are code points for private use spread across three different blocks, each called a ''Private Use Area'' (PUA). The Unicode standard recognizes code points within PUAs as legitimate Unicode character codes, but does not assign them any (abstract) character. Instead, individuals, organizations, software vendors, operating system vendors, font vendors and communities of end-users are free to use them as they see fit. Within closed systems, characters in the PUA can operate unambiguously, allowing such systems to represent characters or glyphs not defined in Unicode. In public systems their use is more problematic, since there is no registry and no way to prevent several organizations from adopting the same code points for different purposes. One example of such a conflict is Apple's use of U+F8FF for the Apple logo, versus the ConScript Unicode Registry's use of U+F8FF as in theSurrogates
The UCS uses surrogates to address characters outside the initial Basic Multilingual Plane without resorting to more-than-16-bit byte representations. There are 1024 "high" surrogates (D800–DBFF) and 1024 "low" surrogates (DC00–DFFF). By combining a pair of surrogates, the remaining characters in all the other planes can be addressed (1024 × 1024 = 1048576 code points in the other 16 planes). In UTF-16, they must always appear in pairs, as a high surrogate followed by a low surrogate, thus using 32 bits to denote one code point. A surrogate pair denotes the code point :1000016 + (''H'' - D80016) × 40016 + (''L'' - DC0016) where ''H'' and ''L'' are the numeric values of the high and low surrogates respectively. Since high surrogate values in the range DB80–DBFF always produce values in the Private Use planes, the high surrogate range can be further divided into (normal) high surrogates (D800–DB7F) and "high private use surrogates" (DB80–DBFF). Isolated surrogate code points have no general interpretation; consequently, no character code charts or names lists are provided for this range. In the Python programming language, individual surrogate codes are used to embed undecodable bytes in Unicode strings.The unhyphenated term "" refers to 66 code points (labeled
) permanently reserved for internal use, and therefore guaranteed to never be assigned to a character. Each of the 17 planes has its two ending code points set aside as . So, are: U+FFFE and U+FFFF on the BMP, U+1FFFE and U+1FFFF on Plane 1, and so on, up to U+10FFFE and U+10FFFF on Plane 16, for a total of 34 code points. In addition, there is a contiguous range of another 32 code points in the BMP: U+FDD0..U+FDEF. Software implementations are therefore free to use these code points for internal use. One particularly useful example of a is the code point U+FFFE. This code point has the reverse UTF-16/UCS-2 byte sequence of the byte order mark (U+FEFF). If a stream of text contains this , this is a good indication the text has been interpreted with the incorrect endianness.
Versions of the Unicode standard from 3.1.0 to 6.3.0 claimed that "should never be interchanged"Reserved code points
All other code points, being those not designated, are referred to as being reserved. These code points may be assigned for a particular use in future versions of the Unicode standard.Characters, grapheme clusters and glyphs
Whereas many other character sets assign a character for every possible glyph representation of the character, Unicode seeks to treat characters separately from glyphs. This distinction is not always unambiguous, however a few examples will help illustrate the distinction. Often two characters may be combined typographically to improve the readability of the text. For example, the three letter sequence "ffi" may be treated as a single glyph. Other character sets would often assign a code point to this glyph in addition to the individual letters: "f" and "i". In addition, Unicode approachesCompatibility characters
UCS includes thousands of characters that Unicode designates as compatibility characters. These are characters that were included in UCS in order to provide distinct code points for characters that other character sets differentiate, but would not be differentiated in the Unicode approach to characters. The chief reason for this differentiation was that Unicode makes a distinction between characters and glyphs. For example, when writing English in aCharacter properties
Every character in Unicode is defined by a large and growing set of properties. Most of these properties are not part of Universal Character Set. The properties facilitate text processing including collation or sorting of text, identifying words, sentences and graphemes, rendering or imaging text and so on. Below is a list of some of the core properties. There are many others documented in the Unicode Character Database. Unicode provides an online database to interactively query the entire Unicode character repertoire by the various properties.See also
* ConScript Unicode Registry * Unicode compatibility charactersReferences
External links