A CCSID (coded character set identifier) is a 16-bit number that represents a particular
encoding
In communications and Data processing, information processing, code is a system of rules to convert information—such as a letter (alphabet), letter, word, sound, image, or gesture—into another form, sometimes data compression, shortened or ...
of a specific
code page
In computing, a code page is a character encoding and as such it is a specific association of a set of printable character (computing), characters and control characters with unique numbers. Typically each number represents the binary value in a s ...
. For example,
Unicode
Unicode or ''The Unicode Standard'' or TUS is a character encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 16.0 defines 154,998 Char ...
is a code page that has several character encoding schemes (referred to as "transformation formats")—including
UTF-8
UTF-8 is a character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode Transformation Format 8-bit''. Almost every webpage is transmitted as UTF-8.
UTF-8 supports all 1,112,0 ...
,
UTF-16
UTF-16 (16-bit Unicode Transformation Format) is a character encoding that supports all 1,112,064 valid code points of Unicode. The encoding is variable-length as code points are encoded with one or two ''code units''. UTF-16 arose from an earli ...
and
UTF-32—but which may or may not actually be accompanied by a CCSID number to indicate that this encoding is being used.
Difference between a code page and a CCSID
The terms code page and CCSID are often used interchangeably, even though they are not synonymous. A code page may be only part of what makes up a CCSID. The following definitions from IBM help to illustrate this point:
* A glyph is the actual physical pattern of pixels or ink that shows up on a display or printout.
* A character is a concept that covers all glyphs associated with a certain symbol. For instance, "F", "F", "''F''", "", "", and "" are all different glyphs, but use the same character. The various modifiers (bold, italic, underline, color, and font) do not change the fact that these glyphs represent .
* A character set contains the characters necessary to allow a particular human to carry on a meaningful interaction with the computer. It does not specify how those characters are represented in a computer.
This level is the first one to separate characters into various alphabets (Latin, Arabic, Hebrew, Cyrillic, and so on) or ideographic groups (e.g., Chinese, Korean). It corresponds to a "character repertoire" in the
Unicode encoding model.
* A code page represents a particular assignment of code point values to characters. It corresponds to a "coded character set" in the Unicode encoding model. A code point for a character is the computer's internal representation of that character in a given code page. Many characters are represented by different code points in different code pages. Certain character sets can be adequately represented with single-byte code pages (which have a maximum 256 code points, hence a maximum of 256 characters), but many require more than that. Examples include
JIS X 0208
JIS X 0208 is a 2-byte character set specified as a Japanese Industrial Standards, Japanese Industrial Standard, containing 6879 graphic characters suitable for writing text, place names, personal names, and so forth in the Japanese language. Th ...
and
Unicode
Unicode or ''The Unicode Standard'' or TUS is a character encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 16.0 defines 154,998 Char ...
.
* An encoding scheme is the byte format of a code page. It maps code point values to sequences of one or more byte values in a computer. For example,
UTF-8
UTF-8 is a character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode Transformation Format 8-bit''. Almost every webpage is transmitted as UTF-8.
UTF-8 supports all 1,112,0 ...
and
UTF-16BE are two encodings of the same Unicode code page. (Varying only in how many bytes are needed to represent a particular Unicode character value, how it is contained within those bytes, and how the presence of Unicode information is indicated.) Meanwhile, in IBM's character data representation architecture (CDRA), this is typically represented with an ESID (encoding scheme identifier).
EUC and
ISO-2022
ISO/IEC 2022 ''Information technology—Character code structure and extension techniques'', is an International Organization for Standardization, ISO/International Electrotechnical Commission, IEC standard in the field of character encoding. It ...
are other examples of encoding schemes.
* A coded character set identifier (CCSID) contains all of the information necessary to assign and preserve the meaning and rendering of characters through various stages of processing and interchange. This information always includes at least one code page, but may include multiple code pages of differing byte-lengths. The CCSID also has an associated encoding scheme that governs how various code points are to be handled. This mechanism allows a program to recognize
bidirectional orientation, character shaping (mainly of Arabic characters), and other complex encoding information.
Examples
The following examples show how some CCSIDs are made up of other CCSIDs.
All three of these variant
Shift-JIS CCSIDs are
multi-byte character sets (MBCS): the single-byte character set (SBCS) portion of each CCSID is different. The
double-byte character set (DBCS) portion is the same across each CCSID. CCSID 5028 uses an updated code page 897 called CCSID 4993. CCSID 932 uses the original code page 897, which is CCSID 897. CCSID 942 uses a different SBCS from the other two CCSIDs, which is 1041.
Also notice how CCSID 5028 and 4993 are different by 4096 (1000 in hexadecimal) from the predecessor CCSID with the same code page identifier. This is a common way that CDRA denotes an upgraded CCSID.
There are a few reasons for this complexity:
* Many of the CCSIDs are used in IBM databases, like
IBM Db2, where a database field only supports an SBCS, DBCS or MBCS string. CCSIDs allow programs to differentiate between which one is being used.
* When characters are added or replaced, like the Euro currency sign introduction, one can know whether the stored strings support or do not support those character additions because a different CCSID is being used. This versioning is important for the integrity of the data.
* It enables reuse of resources among similar CCSIDs.
References
External links
IBM CDRA (character data representation architecture) glossary of termsIBM globalization terminologyComplete description of IBM CDRA (This includes a more detailed description of the architecture surrounding CCSIDs.)
*
ttps://web.archive.org/web/20150406044931/http://www-03.ibm.com/systems/i/software/globalization/ccsid_list.html List of CCSIDs supported on the IBM System i computer
{{character encoding
Character encoding