A code point, codepoint or code position is a particular position in a
table, where the position has been assigned a meaning. The table may be one dimensional (a column), two dimensional (like cells in a spreadsheet), three dimensional (sheets in a workbook), etc... in any number of dimensions.
Technically, a code point is a unique position in a quantized n-dimensional space, where the position has been assigned a semantic meaning. The table has discrete (whole) and positive positions (1, 2, 3, 4, but not fractions).
Code points are used in a multitude of formal information processing and telecommunication standards.
[ETSI TS 101 773 (section 4), https://www.etsi.org/deliver/etsi_ts/101700_101799/101773/01.02.01_60/ts_101773v010201p.pdf] For example ITU-T Recommendation T.35 contains a set of country codes for telecommunications equipment (originally fax machines) which allow equipment to indicate its country of manufacture or operation. In T.35, Argentina is represented by the code point 0x07, Canada by 0x20, Gambia by 0x41, etc.
In character encoding
Code points are commonly used in
character encoding
Character encoding is the process of assigning numbers to graphical character (computing), characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using computers. The numerical v ...
, where a code point is a numerical value that maps to a specific
character. In character encoding code points usually represent a single
grapheme
In linguistics, a grapheme is the smallest functional unit of a writing system.
The word ''grapheme'' is derived from Ancient Greek ('write'), and the suffix ''-eme'' by analogy with ''phoneme'' and other emic units. The study of graphemes ...
—usually a letter, digit, punctuation mark, or whitespace—but sometimes represent symbols,
control characters, or formatting. The set of all possible code points within a given encoding/character set make up that encoding's ''codespace''.
For example, the character encoding scheme
ASCII
ASCII ( ), an acronym for American Standard Code for Information Interchange, is a character encoding standard for representing a particular set of 95 (English language focused) printable character, printable and 33 control character, control c ...
comprises 128 code points in the range 0
hex to 7F
hex,
Extended ASCII
Extended ASCII is a repertoire of character encodings that include (most of) the original 96 ASCII character set, plus up to 128 additional characters. There is no formal definition of "extended ASCII", and even use of the term is sometimes critic ...
comprises 256 code points in the range 0
hex to FF
hex, and
Unicode
Unicode or ''The Unicode Standard'' or TUS is a character encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 16.0 defines 154,998 Char ...
comprises code points in the range 0
hex to 10FFFF
hex. The Unicode code space is divided into seventeen
planes (the basic multilingual plane, and 16 supplementary planes), each with (= 2
16) code points. Thus the total size of the Unicode code space is 17 × = .
In Unicode
For Unicode, the particular sequence of bits is called a ''
code unit'' – for the
UCS-4 encoding, any code point is encoded as 4-
byte
The byte is a unit of digital information that most commonly consists of eight bits. Historically, the byte was the number of bits used to encode a single character of text in a computer and for this reason it is the smallest addressable un ...
(
octet)
binary number
A binary number is a number expressed in the Radix, base-2 numeral system or binary numeral system, a method for representing numbers that uses only two symbols for the natural numbers: typically "0" (zero) and "1" (one). A ''binary number'' may ...
s, while in the
UTF-8
UTF-8 is a character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode Transformation Format 8-bit''. Almost every webpage is transmitted as UTF-8.
UTF-8 supports all 1,112,0 ...
encoding, different code points are encoded as sequences from one to four bytes long, forming a
self-synchronizing code
In coding theory, especially in telecommunications, a self-synchronizing code is a uniquely decodable code in which the symbol stream formed by a portion of one code word, or by the overlapped portion of any two adjacent code words, is not a ...
. See
comparison of Unicode encodings for details.
Code points are normally assigned to abstract
characters. An ''abstract'' character is not a graphical glyph but a unit of textual data. However, code points may also be left reserved for future assignment (most of the Unicode code space is unassigned), or given other designated functions.
The distinction between a code point and the corresponding abstract character is not pronounced in Unicode but is evident for many other encoding schemes, where numerous
code pages may exist for a single code space.
History
The concept of a code point dates to the earliest standards for digital information processing and digital telecommunications.
In Unicode, code points are part of Unicode's solution to a difficult conundrum faced by character encoding developers in the 1980s. If they added more bits per character to accommodate larger character sets, that design decision would also constitute an unacceptable waste of then-scarce computing resources for
Latin script
The Latin script, also known as the Roman script, is a writing system based on the letters of the classical Latin alphabet, derived from a form of the Greek alphabet which was in use in the ancient Greek city of Cumae in Magna Graecia. The Gree ...
users (who constituted the vast majority of computer users at the time), since those extra bits would always be zeroed out for such users.
The code point avoids this problem by breaking the old idea of a direct one-to-one correspondence between characters and particular sequences of bits.
See also
*
Combining character
In digital typography, combining characters are Character (computing), characters that are intended to modify other characters. The most common combining characters in the Latin script are the combining diacritic, diacritical marks (including c ...
*
Replacement character
*
Text-based (computing)
*
Unicode collation algorithm
References
External links
Codepoints.net, a site dedicated to all things characters, letters and Unicode
{{Unicode navigation
Character encoding