Numeric character reference
   HOME

TheInfoList



OR:

A numeric character reference (NCR) is a common markup construct used in
SGML The Standard Generalized Markup Language (SGML; ISO 8879:1986) is a standard for defining generalized markup languages for documents. ISO 8879 Annex A.1 states that generalized markup is "based on two postulates": * Declarative: Markup should ...
and SGML-derived markup languages such as
HTML The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaS ...
and
XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. T ...
. It consists of a short sequence of
character Character or Characters may refer to: Arts, entertainment, and media Literature * ''Character'' (novel), a 1936 Dutch novel by Ferdinand Bordewijk * ''Characters'' (Theophrastus), a classical Greek set of character sketches attributed to The ...
s that, in turn, represents a single character. Since WebSgml,
XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. T ...
and HTML 4, the code points of the
Universal Character Set The Universal Coded Character Set (UCS, Unicode) is a standard set of characters defined by the international standard ISO/IEC 10646, ''Information technology — Universal Coded Character Set (UCS)'' (plus amendments to that standard), w ...
(UCS) of
Unicode Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, ...
are used. NCRs are typically used in order to represent characters that are not directly encodable in a particular document (for example, because they are international characters that do not fit in the 8-bit character set being used, or because they have special syntactic meaning in the language). When the document is interpreted by a markup-aware reader, each NCR is treated as if it were the character it represents.


Examples

In SGML, HTML, and XML, the following are all valid numeric character references for the Greek capital letter Sigma In SGML, HTML, and XML, the following are all valid numeric character references for the Latin capital letter AE In SGML, HTML, and XML, the following are all valid numeric character references for the Latin small letter sharp s ß List of numeric character references for the printable
ASCII ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because ...
characters:


Discussion

Markup languages are typically defined in terms of UCS or Unicode characters. That is, a document consists, at its most fundamental level of abstraction, of a sequence of characters, which are abstract units that exist independently of any encoding. Ideally, when the characters of a document utilizing a markup language are encoded for storage or transmission over a network as a sequence of bits, the encoding that is used will be one that supports representing each and every character in the document, if not in the whole of Unicode, directly as a particular bit sequence. Sometimes, though, for reasons of convenience or due to technical limitations, documents are encoded with an encoding that cannot represent some characters directly. For example, the widely used encodings based on
ISO 8859 ISO/IEC 8859 is a joint ISO and IEC series of standards for 8-bit character encodings. The series of standards consists of numbered parts, such as ISO/IEC 8859-1, ISO/IEC 8859-2, etc. There are 15 parts, excluding the abandoned ISO/IEC 8859-12. ...
can only represent, at most, 256 unique characters as one 8-bit
byte The byte is a unit of digital information that most commonly consists of eight bits. Historically, the byte was the number of bits used to encode a single character of text in a computer and for this reason it is the smallest addressable uni ...
each. Documents are rarely, in practice, ever allowed to use more than one encoding internally, so the onus is usually on the markup language to provide a means for document authors to express unencodable characters in terms of encodable ones. This is generally done through some kind of "escaping" mechanism. The SGML-based markup languages allow document authors to use special sequences of characters from the ASCII range (the first 128 code points of Unicode) to represent, or ''reference'', any Unicode character, regardless of whether the character being represented is directly available in the document's encoding. These special sequences are ''character references''. Character references that are based on the referenced character's UCS or Unicode
code point In character encoding terminology, a code point, codepoint or code position is a numerical value that maps to a specific character. Code points usually represent a single grapheme—usually a letter, digit, punctuation mark, or whitespace—but ...
are called ''numeric'' character references. In HTML 4 and in all versions of
XHTML Extensible HyperText Markup Language (XHTML) is part of the family of XML markup languages. It mirrors or extends versions of the widely used HyperText Markup Language (HTML), the language in which Web pages are formulated. While HTML, prior ...
and XML, the code point can be expressed either as a
decimal The decimal numeral system (also called the base-ten positional numeral system and denary or decanary) is the standard system for denoting integer and non-integer numbers. It is the extension to non-integer numbers of the Hindu–Arabic numeral ...
(base 10) number or as a
hexadecimal In mathematics and computing, the hexadecimal (also base-16 or simply hex) numeral system is a positional numeral system that represents numbers using a radix (base) of 16. Unlike the decimal system representing numbers using 10 symbols, he ...
(base 16) number. The syntax is as follows: Character U+0026 (
ampersand The ampersand, also known as the and sign, is the logogram , representing the conjunction "and". It originated as a ligature of the letters ''et''—Latin for "and". Etymology Traditionally in English, when spelling aloud, any letter tha ...
), followed by character U+0023 (
number sign The symbol is known variously in English-speaking regions as the number sign, hash, or pound sign. The symbol has historically been used for a wide range of purposes including the designation of an ordinal number and as a ligatured abbreviati ...
), followed by one of the following choices: * one or more decimal digits zero (U+0030) through nine (U+0039); or * character U+0078 ("x") followed by one or more hexadecimal digits, which are zero (U+0030) through nine (U+0039), Latin capital letter A (U+0041) through F (U+0046), and Latin small letter a (U+0061) through f (U+0066); all followed by character U+003B (
semicolon The semicolon or semi-colon is a symbol commonly used as orthographic punctuation. In the English language, a semicolon is most commonly used to link (in a single sentence) two independent clauses that are closely related in thought. When a ...
). Older versions of HTML disallowed the hexadecimal syntax. The characters that comprise a numeric character reference can be represented in every character encoding used in computing and telecommunications today, so there is no risk of the reference itself being unencodable. There is another kind of character reference called a ''
character entity reference Character or Characters may refer to: Arts, entertainment, and media Literature * ''Character'' (novel), a 1936 Dutch novel by Ferdinand Bordewijk * ''Characters'' (Theophrastus), a classical Greek set of character sketches attributed to The ...
'', which allows a character to be referred to by a name instead of a number. (Naming a character creates a ''character
entity An entity is something that exists as itself, as a subject or as an object, actually or potentially, concretely or abstractly, physically or not. It need not be of material existence. In particular, abstractions and legal fictions are usually ...
''.) HTML defines some character entities, but not many; all other characters can only be included by direct encoding or using NCRs.


Restrictions

The Universal Character Set defined by ISO 10646 is the "document character set" of SGML, HTML 4, so by default, any character in such a document, and any character ''referenced'' in such a document, must be in the UCS. While the syntax of SGML does not prohibit references to invalid or unassigned code points, such as , SGML-derived markup languages such as HTML and XML can, and often do, restrict numeric character references to only those code points that are assigned to characters. Restrictions may also apply for other reasons. For example, in HTML 4, , which is a reference to a non-printing "form feed" control character, is allowed because a form feed character is allowed. But in XML, the form feed character cannot be used, not even by reference. As another example, €, which is a reference to another control character, is not allowed to be used or referenced in either HTML or XML, but when used in HTML, it is usually not flagged as an error by web browsers – some of which interpret it as a reference to the character represented by code value 128 in the
Windows-1252 Windows-1252 or CP-1252 ( code page 1252) is a single-byte character encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows for English and many European languages including Spanish, French, and German. ...
encoding for compatibility reasons. This character, "€", has to be represented as € in a standard-compliant HTML code. As a further example, prior to the publication of XML 1.0 Second Edition on October 6, 2000, XML 1.0 was based on an older version of ISO 10646 and prohibited using characters above U+FFFD, except in character data, thus making a reference like 𐀀 (U+10000) illegal. In XML 1.1 and newer editions of XML 1.0, such a reference is allowed, because the available character repertoire was explicitly extended. Markup languages also place restrictions on where character references can occur.


Compatibility issues

In the initial versions of
SGML The Standard Generalized Markup Language (SGML; ISO 8879:1986) is a standard for defining generalized markup languages for documents. ISO 8879 Annex A.1 states that generalized markup is "based on two postulates": * Declarative: Markup should ...
and
HTML The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaS ...
, numeric character references were interpreted in relationship to the document character encoding, rather than
Unicode Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, ...
. For Latin-script documents, numeric character references to characters between x80 and x9F in those documents will not be correct against
Unicode Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, ...
, and must be recoded. HTML standards prior to HTML 4 supported only Western Latin script documents: the treatment of character references above #7F may vary between applications and national conventions. For example, as mentioned above, the correct numeric character reference for the
Euro sign The euro sign () is the currency sign used for the euro, the official currency of the eurozone and unilaterally adopted by Kosovo and Montenegro. The design was presented to the public by the European Commission on 12 December 1996. It consists o ...
"€" U+20AC when using
Unicode Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, ...
is decimal € and hexadecimal €. However, if using tools supporting obsolete implementations of HTML, the reference € (Euro sign in the
CP-1252 Windows-1252 or CP-1252 (code page 1252) is a single-byte character encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows for English and many European languages including Spanish, French, and German. I ...
code page) or ¤ (Euro sign in ISO/IEC 8859-15) may work. As another example, if some text was created originally using the
MacRoman Mac OS Roman is a character encoding created by Apple Computer, Inc. for use by Macintosh computers. It is suitable for representing text in English and several other Western languages. Mac OS Roman encodes 256 characters, the first 128 of whi ...
character set, the left double quotation mark will be represented with code point xD2. This will not display properly in a system expecting a document encoded as UTF-8, ISO 8859-1, or CP-1252, where this code point is occupied by the letter Ò. The correct numeric character reference for in HTML 4 and newer is “, because U+201C is its UCS code. In some systems, the named character reference “ may also be available.


See also

*
List of XML and HTML character entity references In SGML, HTML and XML documents, the logical constructs known as ''character data'' and ''attribute values'' consist of sequences of characters, in which each character can manifest directly (representing itself), or can be represented by a series ...


References

{{DEFAULTSORT:Numeric Character Reference HTML Unicode XML