The Chinese Character Code for Information Interchange () or CCCII is a

character set Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using computers. The numerical values that make up a c ...

developed by the Chinese Character Analysis Group in

Taiwan Taiwan, officially the Republic of China (ROC), is a country in East Asia. The main geography of Taiwan, island of Taiwan, also known as ''Formosa'', lies between the East China Sea, East and South China Seas in the northwestern Pacific Ocea ...

. It was first published in 1980, and significantly expanded in 1982 and 1987. It is used mostly by library systems. It is one of the earliest established and most sophisticated encodings for

traditional Chinese A tradition is a system of beliefs or behaviors (folk custom) passed down within a group of people or society with symbolic meaning or special significance with origins in the past. A component of cultural expressions and folklore, common examp ...

(predating the establishment of

Big5 Big-5 or Big5 ( zh, t=大五碼) is a Chinese character encoding method used in Taiwan, Hong Kong, and Macau for traditional Chinese characters. The People's Republic of China (PRC), which uses simplified Chinese characters, uses the GB 18030 ...

in 1984 and

CNS 11643 The CNS 11643 character set (Chinese National Standard 11643), also officially known as the Chinese Standard Interchange Code or CSIC ( zh, tr=, t=中文標準交換碼), is officially the standard character set of Taiwan (Republic of China). Publ ...

in 1986). It is distinguished by its unique system for encoding simplified versions and other

variants Variant may refer to: Arts and entertainment * ''Variant'' (magazine), a former British cultural magazine * Variant cover, an issue of comic books with varying cover art * ''Variant'' (novel), a novel by Robison Wells * " The Variant", 2021 epis ...

of its main set of

hanzi Chinese characters are logographs used to write the Chinese languages and others from regions historically influenced by Chinese culture. Of the four independently invented writing systems accepted by scholars, they represent the only one ...

characters. A variant of an earlier version of CCCII is used by the

Library of Congress The Library of Congress (LOC) is a research library in Washington, D.C., serving as the library and research service for the United States Congress and the ''de facto'' national library of the United States. It also administers Copyright law o ...

as part of

MARC-8 The MARC-8 charset is a MARC standard used in MARC-21 library records. The MARC formats are standards for the representation and communication of bibliographic and related information in machine-readable form, and they are frequently used in lib ...

, under the name East Asian Character Code (EACC, ANSI/NISO Z39.64), where it comprises part of MARC 21's

JACKPHY In library automation the initialism JACKPHY refers to a group of language scripts not based on Roman characters, specifically: Japanese, Arabic, Chinese, Korean, Persian, Hebrew, and Yiddish. Focus on these seven writing systems by Library of ...

support. However, EACC contains fewer characters than the most recent versions of CCCII. Work at

Apple An apple is a round, edible fruit produced by an apple tree (''Malus'' spp.). Fruit trees of the orchard or domestic apple (''Malus domestica''), the most widely grown in the genus, are agriculture, cultivated worldwide. The tree originated ...

based on

Research Libraries Group The Research Libraries Group (RLG) was a U.S.-based library consortium that existed from 1974 until its merger with the OCLC library consortium in 2006. RLG developed the Eureka interlibrary search engine, the RedLightGreen database of bibliogr ...

's CJK Thesaurus, which was used to maintain EACC, was one of the direct predecessors of

Unicode Unicode or ''The Unicode Standard'' or TUS is a character encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 16.0 defines 154,998 Char ...

Unihan Han unification is an effort by the authors of Unicode and the Universal Character Set to map multiple character sets of the Han characters of the so-called CJK languages into a single set of unified characters. Han characters are a feature ...

set.

Design

Byte ranges

CCCII is designed as an 94ⁿ set, as defined by

ISO/IEC 2022 ISO/IEC 2022 ''Information technology—Character code structure and extension techniques'', is an ISO/ IEC standard in the field of character encoding. It is equivalent to the ECMA standard ECMA-35, the ANSI standard ANSI X3.41 and the Japane ...

. Each Chinese character is represented by a 3-byte code in which each byte is 7-bit, between 0x21 and 0x7E inclusive. Thus, the maximum number of Chinese characters representable in CCCII is 94×94×94 = 830584. In practice the number of characters encodable by CCCII would be less than this number, because variant characters are encoded in related ISO 2022 planes under CCCII, so most of the code points would have to be reserved for variants. In practice, however, bytes outside of these ranges are sometimes used. The code 0x212320 is used by some implementations as an ideographic space. A CCCII specification used by libraries in Hong Kong uses codes starting with 0x2120 for punctuation and symbols. The first byte 0x7F is used by some variants to encode codes for some otherwise unavailable Unified Repertoire and Ordering or

CJK Unified Ideographs Extension A __FORCETOC__ CJK Unified Ideographs Extension-A is a Unicode block A Unicode block is one of several contiguous ranges of numeric character codes (code points) of the Unicode character set that are defined by the Unicode Consortium for adminis ...

hanzi (e.g. 0x7F3449 for U+3449 or 0x7F796E for U+796E; notice how the continuation bytes match the UCS-2BE code), and this may include bytes outside of the 0x21–0x7E or even 0x20–0x7F range, e.g. 0x7F551C for U+551C, 0x7F5AA4 for U+5AA4 or 0x7F8EDA for U+8EDA.

Interaction with ISO 2022

CCCII/EACC is not registered in the International Registry of Coded Character Sets to be Used with Escape Sequences, and as such, does not have a standard designation escape for use with ISO 2022. MARC-8 assigns EACC the private-use -byte 0x31 () in its implementation of ANSI X3.41 (ISO 2022).

Layers and variant characters

The 94 ISO 2022 planes are grouped into 16 layers of 6 planes each (except for layer 16, which contains the four planes 91–94). Layer 1 contains both non-hanzi and

characters, with the non-hanzi and most frequently used hanzi being placed in plane 1, and with the remaining five planes consisting of less common hanzi. Layer 2 contains

simplified Chinese characters Simplified Chinese characters are one of two standardized Chinese characters, character sets widely used to write the Chinese language, with the other being traditional characters. Their mass standardization during the 20th century was part of ...

, with their row and cell numbers being the same as their

equivalents in layer 1. Layers 3 through 12 contain further variant forms, at row and cell numbers homologous to the first two layers. The last four layers are used for other purposes. Specifically, layer 13 contains additional characters for

Japanese language is the principal language of the Japonic languages, Japonic language family spoken by the Japanese people. It has around 123 million speakers, primarily in Japan, the only country where it is the national language, and within the Japanese dia ...

support (

kana are syllabary, syllabaries used to write Japanese phonology, Japanese phonological units, Mora (linguistics), morae. In current usage, ''kana'' most commonly refers to ''hiragana'' and ''katakana''. It can also refer to their ancestor , wh ...

and Japanese

kokuji In Japanese, or are kanji created in Japan rather than borrowed from China. Like most Chinese characters, they are primarily formed by combining existing characters - though using combinations that are not used in Chinese. Since kokuji ar ...

), and layer 14 contains additional characters for

Korean language Korean is the first language, native language for about 81 million people, mostly of Koreans, Korean descent. It is the national language of both South Korea and North Korea. In the south, the language is known as () and in the north, it is kn ...

support (

hangul The Korean alphabet is the modern writing system for the Korean language. In North Korea, the alphabet is known as (), and in South Korea, it is known as (). The letters for the five basic consonants reflect the shape of the speech organs ...

). Layer 15 is unused (reserved), while layer 16 is used for other characters. This distinctive design has been criticized by Christian Wittern of the International Research Institute for Zen Buddhism at

Hanazono University is a private university in Kyoto, Japan that belongs to the Rinzai sect (specifically the Myōshin-ji temple complex, which it is next to). The university and the neighborhood are named for Emperor Hanazono, whose donated his palace to make Myōs ...

, who asserts that the relationship of character variants "is very complex and can not be expressed in a fixed, one-dimensional, hard-wired codetable". Ken Lunde describes it as "one of the most well thought-out character set standards from Taiwan", describing its structure as "to be truly admired", but concluding that

OpenType OpenType is a format for scalable computer fonts. Derived from TrueType, it retains TrueType's basic structure but adds many intricate data structures for describing typographic behavior. OpenType is a registered trademark of Microsoft Corpora ...

variant form substitution can provide the same level of functionality. CCCII defines roughly 53940 code points as of its 1987 edition, although a more recent draft from 1989 extends this to 75684 code points (comprising 44167 unique characters and 31517 variants). EACC, the variant used by the Library of Congress, includes only a smaller set of 15686 characters.

Adoption

As of 1995, CCCII or EACC was used mostly in libraries in the

United States The United States of America (USA), also known as the United States (U.S.) or America, is a country primarily located in North America. It is a federal republic of 50 U.S. state, states and a federal capital district, Washington, D.C. The 48 ...

Hong Kong Hong Kong)., Legally Hong Kong, China in international treaties and organizations. is a special administrative region of China. With 7.5 million residents in a territory, Hong Kong is the fourth most densely populated region in the wor ...

and

. Although CCCII promised pan- CJK coverage, its support was limited to specialized hardware; difficulty ascertaining when the root versus variant character should be used, exacerbated by a lack of firmly established reference glyphs, further limited its adoption, resulting in

being more commonly used for Chinese in those territories outside of library use (since

had yet to become widely adopted at the time). , EACC is still in extensive use for specialized bibliographic purposes. It was also an important precursor to Unicode: work at

on a CJK character cross-reference database based on

's CJK Thesaurus, used to maintain EACC, was directly incorporated into the development of

set. Unicode

characters are referenced to their corresponding CCCII and EACC codes in the

database, in the keys and ; however, since Unicode's character unification criteria (based on those used by the Japanese

JIS X 0208 JIS X 0208 is a 2-byte character set specified as a Japanese Industrial Standards, Japanese Industrial Standard, containing 6879 graphic characters suitable for writing text, place names, personal names, and so forth in the Japanese language. Th ...

and on those developed by the Association for a Common Chinese Code in China) differ from those used by CCCII, not all variant characters are individually mapped. Mapping tables for hanzi,

and punctuation between EACC and Unicode are available from the Library of Congress.

Punctuation, symbol, kana and jamo charts

Following are charts for punctuation, symbols,

and Hangul jamo, showing the characters and giving possible Unicode mappings. Where possible, these are referenced against published mapping data. Unicode mappings for Hangul syllables are omitted below for brevity, but are documented by the Library of Congress. CCCII hanzi number in the tens of thousands and are not shown below (except where they are also included in the non-hanzi range, as radicals or numerals), but mappings to Unicode are available from the Unihan database and from elsewhere.

Character set 0x2120 (plane 1, row 0: Hong Kong punctuation)

Although CCCII is usually a 94ⁿ set, and therefore does not usually use codes starting with 0x2120, the following layout is used by a variant used by libraries in Hong Kong:

Character set 0x2121 (plane 1, row 1: reserved for controls)

No characters are assigned in plane 1 row 1, which is reserved for

control code In computing and telecommunications, a control character or non-printing character (NPC) is a code point in a character set that does not represent a written character or symbol. They are used as in-band signaling to cause effects other than ...

Character set 0x2122 (plane 1, row 2: mathematical operators)

This row contains mathematical operators. EACC leaves this row empty. The following table is referenced against sources from Taiwan. The following table is referenced against CCCII data provided by the Hong Kong

Innovative Innovation is the practical implementation of ideas that result in the introduction of new goods or services or improvement in offering goods or services. ISO TC 279 in the standard ISO 56000:2020 defines innovation as "a new or changed ent ...

Users Group, a group of libraries in Hong Kong, and hosted by the

University of Hong Kong The University of Hong Kong (HKU) is a public research university in Pokfulam, Hong Kong. It was founded in 1887 as the Hong Kong College of Medicine for Chinese by the London Missionary Society and formally established as the University of ...

. It uses an entirely different layout in this row:

Character set 0x2123 (plane 1, row 3: Roman and punctuation)

This row includes punctuation,

western Arabic numerals The ten Arabic numerals (0, 1, 2, 3, 4, 5, 6, 7, 8, and 9) are the most commonly used symbols for writing numbers. The term often also implies a positional notation number with a decimal base, in particular when contrasted with Roman numerals. ...

and Roman letters. Compare row 3 of Wansung code and row 3 of GB 2312. Different variants variously encode the ideographic space (U+3000) at 0x212320 (which the MARC specification acknowledges), 0x212321 (which is listed in the ANSI standard, and is also acknowledged by MARC), or 0x21635F. EACC includes only the

hyphen-minus The symbol , known in Unicode as hyphen-minus, is the form of hyphen most commonly used in digital documents. On most keyboards, it is the only character that resembles a minus sign or a dash, so it is also used for these. The name ''hyphen-mi ...

, parentheses and ideographic space in this set.

Character set 0x212A (plane 1, row 10: internal IME characters and geta mark)

In EACC, this row includes several

Private Use Area In Unicode, a Private Use Area (PUA) is a range of code points that, by definition, will not be assigned characters by the standard. Three Private Use Areas are defined: one in the Basic Multilingual Plane (), and one each in, and nearly covering ...

mapped characters used internally to represent character components by the

RLIN The Research Libraries Group (RLG) was a U.S.-based library consortium that existed from 1974 until its merger with the OCLC library consortium in 2006. RLG developed the Eureka interlibrary search engine, the RedLightGreen database of bibliograp ...

input method, which is used by the Library of Congress for non-Roman cataloging. These component characters should only be used internally by an IME and, if encountered elsewhere, may be replaced with the geta mark (U+3013), which this row also includes at 0x212A46. This row is unassigned in CCCII, but the geta mark is also listed at that location in some mappings for CCCII.

Character set 0x212B (plane 1, row 11: punctuation)

This row contains various punctuation marks used in Chinese, in addition to other symbols. CCCII includes a set of 35 punctuation marks in this row. EACC includes only 13 characters in this row (shown boxed below).

Character sets 0x212C–0x212E (plane 1, rows 12–14: radicals and ordinals)

These rows contain

Chinese radicals A radical (), or indexing component, is a visually prominent component of a Chinese character under which the character is traditionally listed in a Chinese dictionary. The radical for a character is typically a semantic component, but it can ...

Roman numerals Roman numerals are a numeral system that originated in ancient Rome and remained the usual way of writing numbers throughout Europe well into the Late Middle Ages. Numbers are written with combinations of letters from the Latin alphabet, eac ...

, celestial stems and terrestrial branches.

Character set 0x212F (plane 1, row 15: Chinese numerals and bopomofo)

This row includes Chinese numerals and

bopomofo Bopomofo, also called Zhuyin Fuhao ( ; ), or simply Zhuyin, is a Chinese transliteration, transliteration system for Standard Chinese and other Sinitic languages. It is the principal method of teaching Chinese Mandarin pronunciation in Taiwa ...

characters. EACC includes only the ideographic zero (〇).

Character set 0x272B (plane 7, row 11: reference mark)

This row contains the

reference mark The reference mark or reference symbol "※" is a typographic mark or word used in Chinese, Japanese and Korean (CJK) writing. The symbol was used historically to call attention to an important sentence or idea, such as a prologue or footnot ...

(''kome jirushi'').

Character set 0x272E–0x272F (plane 7, rows 14–15: alternative bopomofo)

A variant used by libraries in Hong Kong does not include bopomofo characters in plane 1 row 15, but includes them in a different layout in plane 7.

Character set 0x6921 (plane 73, row 1: Japanese punctuation)

This row is in plane 73, the first plane of layer 13, which contains characters included for

support. It contains punctuation. Compare row 1 of JIS X 0208, which this row tends to follow the layout of for the characters it includes.

Character set 0x6924 (plane 73, row 4: hiragana)

This row contains

hiragana is a Japanese language, Japanese syllabary, part of the Japanese writing system, along with ''katakana'' as well as ''kanji''. It is a phonetic lettering system. The word ''hiragana'' means "common" or "plain" kana (originally also "easy", ...

. Compare row 4 of JIS X 0208.

Character set 0x6925 (plane 73, row 5: katakana)

This row contains

katakana is a Japanese syllabary, one component of the Japanese writing system along with hiragana, kanji and in some cases the Latin script (known as rōmaji). The word ''katakana'' means "fragmentary kana", as the katakana characters are derived fr ...

. Compare row 5 of JIS X 0208, which this row corresponds to, besides the addition of the separate

dakuten The , colloquially , is a diacritic most often used in the Japanese kana syllabaries to indicate that the consonant of a mora should be pronounced voiced, for instance, on sounds that have undergone rendaku (sequential voicing). The , coll ...

and

handakuten The , colloquially , is a diacritic most often used in the Japanese language, Japanese kana syllabaries to indicate that the consonant of a Mora (linguistics), mora should be pronounced Voice (phonetics), voiced, for instance, on sounds that ...

Character set 0x6F24–0x6F25 (plane 79, rows 4–5: jamo)

These rows contains Korean jamo.

Character set 0x6F76 (plane 79, row 86: archaic Hangul)

This row contains several historic

Hangul The Korean alphabet is the modern writing system for the Korean language. In North Korea, the alphabet is known as (), and in South Korea, it is known as (). The letters for the five basic consonants reflect the shape of the speech organs ...

characters no longer in regular use. Several of these are mapped to the

Character set 0x7B25 (plane 91, row 5: supplementary Katakana)

This row contains additional

used to write foreign phonemes.

Footnotes

References

* Some information on this page is based on the information on th
CNS official website

External links

CNS 11643 official web site
(English version of pages available) has information about the CCCII character set in the "Chinese Information Code" section
Full mapping of EACC to Unicode, from Library of Congress
{{character encoding Computer-related introductions in 1980 1980 establishments in Taiwan Taiwanese inventions Character encoding Encodings of Asian languages Chinese-language computing