HOME

TheInfoList



OR:

The Chinese, Japanese and Korean (CJK) scripts share a common background, collectively known as
CJK characters In internationalization, CJK characters is a collective term for graphemes used in the Chinese, Japanese, and Korean writing systems, which each include Chinese characters. It can also go by CJKV to include Chữ Nôm, the Chinese-origin lo ...
. During the process called
Han unification Han unification is an effort by the authors of Unicode and the Universal Character Set to map multiple character sets of the Han characters of the so-called CJK languages into a single set of unified characters. Han characters are a featur ...
, the common (shared) characters were identified and named CJK Unified Ideographs. As of Unicode , Unicode defines a total of 97,680 characters. The term ''ideographs'' is a misnomer, as the Chinese script is not ideographic but rather
logographic In a written language, a logogram (from Ancient Greek 'word', and 'that which is drawn or written'), also logograph or lexigraph, is a written character that represents a semantic component of a language, such as a word or morpheme. Chinese c ...
. Until the early 20th century, Vietnam also used Chinese characters (
Chữ Nôm Chữ Nôm (, ) is a logographic writing system formerly used to write the Vietnamese language. It uses Chinese characters to represent Sino-Vietnamese vocabulary and some native Vietnamese words, with other words represented by new characters ...
), so sometimes the abbreviation CJKV is used.


Sources

The Ideographic Research Group (IRG) is responsible for developing extensions to the encoded repertoires of CJK unified ideographs. IRG processes proposals for new CJK unified ideographs submitted by its member bodies, and after undergoing several rounds of expert review, IRG submits a consolidated set of characters to
ISO/IEC JTC 1/SC 2 ISO/IEC JTC 1/SC 2 Coded character sets is a standardization subcommittee of the Joint Technical Committee ISO/IEC JTC 1 of the International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC), that devel ...
Working Group 2 (WG2) and the Unicode Technical Committee (UTC) for consideration for inclusion in the
ISO/IEC 10646 ISO/IEC JTC 1, entitled "Information technology", is a joint technical committee (JTC) of the International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC). Its purpose is to develop, maintain and ...
and
Unicode Unicode or ''The Unicode Standard'' or TUS is a character encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 16.0 defines 154,998 Char ...
standards. The following IRG member bodies have been involved in the standardization of CJK unified ideographs: *
China China, officially the People's Republic of China (PRC), is a country in East Asia. With population of China, a population exceeding 1.4 billion, it is the list of countries by population (United Nations), second-most populous country after ...
*
Hong Kong Hong Kong)., Legally Hong Kong, China in international treaties and organizations. is a special administrative region of China. With 7.5 million residents in a territory, Hong Kong is the fourth most densely populated region in the wor ...
*
Japan Japan is an island country in East Asia. Located in the Pacific Ocean off the northeast coast of the Asia, Asian mainland, it is bordered on the west by the Sea of Japan and extends from the Sea of Okhotsk in the north to the East China Sea ...
*
South Korea South Korea, officially the Republic of Korea (ROK), is a country in East Asia. It constitutes the southern half of the Korea, Korean Peninsula and borders North Korea along the Korean Demilitarized Zone, with the Yellow Sea to the west and t ...
*
North Korea North Korea, officially the Democratic People's Republic of Korea (DPRK), is a country in East Asia. It constitutes the northern half of the Korea, Korean Peninsula and borders China and Russia to the north at the Yalu River, Yalu (Amnok) an ...
*
Macau Macau or Macao is a special administrative regions of China, special administrative region of the People's Republic of China (PRC). With a population of about people and a land area of , it is the most List of countries and dependencies by p ...
*
Taiwan Taiwan, officially the Republic of China (ROC), is a country in East Asia. The main geography of Taiwan, island of Taiwan, also known as ''Formosa'', lies between the East China Sea, East and South China Seas in the northwestern Pacific Ocea ...
, liaison member represented by the Taipei Computer Association (TCA) *
Vietnam Vietnam, officially the Socialist Republic of Vietnam (SRV), is a country at the eastern edge of mainland Southeast Asia, with an area of about and a population of over 100 million, making it the world's List of countries and depende ...
* Unicode Technical Committee (liaison member, also representing the
United States The United States of America (USA), also known as the United States (U.S.) or America, is a country primarily located in North America. It is a federal republic of 50 U.S. state, states and a federal capital district, Washington, D.C. The 48 ...
) *
United Kingdom The United Kingdom of Great Britain and Northern Ireland, commonly known as the United Kingdom (UK) or Britain, is a country in Northwestern Europe, off the coast of European mainland, the continental mainland. It comprises England, Scotlan ...
* SAT (liaison member) The ideographs submitted by the UTC and the United Kingdom are not specific to any particular region, but are characters which have been suggested for encoding by individual experts. The ideographs submitted by SAT are required for the SAT Daizōkyō text database. The table below gives the numbers of encoded CJK unified ideographs for each IRG source for Unicode 16.0. The total number of characters (260,840) far exceeds the number of encoded CJK unified ideographs (97,680) as many characters have more than one source.


UTC sources

The majority of characters submitted by the UTC to the IRG are derived from Unicode Technical Committee (UTC) documents. Other sources include: * '' ABC Chinese-English Dictionary'' by John DeFrancis * The Adobe-CNS1 glyph collection * The Adobe-Japan1 glyph collection * A Complete Checklist of Species and Subspecies of Chinese Birds (中国鸟类系统检索) * The Great Nom Dictionary (Đại Tự Điển Chữ Nôm) * Annotations to ''
Shuowen Jiezi The ''Shuowen Jiezi'' is a Chinese dictionary compiled by Xu Shen , during the Eastern Han dynasty (25–220 CE). While prefigured by earlier reference works for Chinese characters like the ''Erya'' (), the ''Shuowen Jiezi'' contains the ...
'' (annotated by Duan Yucai) * GB18030-2000 * Required Character List Supplied by
the Church of Jesus Christ of Latter-day Saints The Church of Jesus Christ of Latter-day Saints, informally known as the LDS Church or Mormon Church, is a Nontrinitarianism, nontrinitarian Restorationism, restorationist Christianity, Christian Christian denomination, denomination and the ...
(Hong Kong) * New Commercial Dictionary (商务新词典), Hong Kong * Modern Chinese Dictionary (现代汉语词典), by Chinese Academy of Social Sciences, Linguistics Research Institute, Dictionary Editorial Office * Working Group (WG2) documents


Ordering

The ordering of CJK Unified Ideographs within Unicode blocks (not counting those added to the block later) was initially determined by consulting the following four dictionaries. Primarily, they were arranged in Kangxi Dictionary order, with the other dictionaries consulted, in order, for characters not found in the Kangxi Dictionary, to determine which Kangxi Dictionary character they should follow in the ordering. # Kangxi Dictionary # Dai Kan-Wa Jiten # Hanyu Da Zidian # Dae Jaweon This system is not used for more recently-added Unicode blocks. The Ideographic Research Group no longer uses the Dae Jaweon, nor the Dai Kan-Wa Jiten, in its work. The Kangxi Dictionary and Hanyu Da Zidian are still used both in existing character source references, and as potential replacements for existing source references discovered to be erroneous. Similarly, although a (real or virtual) Kangxi Dictionary index was previously provided as part of the submission data for UTC-source characters, this is no longer the case. Instead, the stroke type of the first residual stroke (first stroke which does not form part of the radical) is supplied with all submitted characters, and used to order characters with the same radical and stroke count within the new Unicode block.


CJK Unified Ideographs blocks


CJK Unified Ideographs

The basic block named '' CJK Unified Ideographs'' (4E00–9FFF) contains 20,992 basic
Chinese characters Chinese characters are logographs used Written Chinese, to write the Chinese languages and others from regions historically influenced by Chinese culture. Of the four independently invented writing systems accepted by scholars, they represe ...
in the range U+4E00 through U+9FFF. The block not only includes characters used in the Chinese writing system but also
kanji are logographic Chinese characters, adapted from Chinese family of scripts, Chinese script, used in the writing of Japanese language, Japanese. They were made a major part of the Japanese writing system during the time of Old Japanese and are ...
used in the
Japanese writing system The modern Japanese writing system uses a combination of Logogram, logographic kanji, which are adopted Chinese characters, and Syllabary, syllabic kana. Kana itself consists of a pair of syllabary, syllabaries: hiragana, used primarily for n ...
,
hanja Hanja (; ), alternatively spelled Hancha, are Chinese characters used to write the Korean language. After characters were introduced to Korea to write Literary Chinese, they were adapted to write Korean as early as the Gojoseon period. () ...
in
Korea Korea is a peninsular region in East Asia consisting of the Korean Peninsula, Jeju Island, and smaller islands. Since the end of World War II in 1945, it has been politically Division of Korea, divided at or near the 38th parallel north, 3 ...
, and chữ Nôm characters in Vietnamese. Many characters in this block are used in all three
writing system A writing system comprises a set of symbols, called a ''script'', as well as the rules by which the script represents a particular language. The earliest writing appeared during the late 4th millennium BC. Throughout history, each independen ...
s, while others are in only one or two of the three. This block is also known as the ''Unified Repertoire and Ordering'' (URO), especially when it needs to be differentiated from the other CJK Unified Ideographs blocks. The first 20,902 characters in the block are arranged according to the Kangxi Dictionary ordering of radicals. In this system the characters written with the fewest strokes are listed first. The remaining characters were added later, and so are not in radical order. The block is the result of
Han unification Han unification is an effort by the authors of Unicode and the Universal Character Set to map multiple character sets of the Han characters of the so-called CJK languages into a single set of unified characters. Han characters are a featur ...
, which was somewhat controversial within East Asia. Since single characters used in more than one of Chinese, Japanese and Korean were coded in the same location, and the modern typographical conventions and handwriting curricula differ slightly between regions (not necessarily along language boundaries—for example,
Hong Kong Hong Kong)., Legally Hong Kong, China in international treaties and organizations. is a special administrative region of China. With 7.5 million residents in a territory, Hong Kong is the fourth most densely populated region in the wor ...
and
Taiwan Taiwan, officially the Republic of China (ROC), is a country in East Asia. The main geography of Taiwan, island of Taiwan, also known as ''Formosa'', lies between the East China Sea, East and South China Seas in the northwestern Pacific Ocea ...
, which both use
Traditional Chinese A tradition is a system of beliefs or behaviors (folk custom) passed down within a group of people or society with symbolic meaning or special significance with origins in the past. A component of cultural expressions and folklore, common examp ...
, have slightly different local conventions), the appearance of a selected glyph could depend on the particular font being used. However, the URO applies the ''source separation rule'', meaning that pairs of characters treated as distinct in a character set used as a source for the URO (e.g. JIS X 0208 as used in e.g.
Shift JIS Shift JIS (also SJIS, MIME name Shift_JIS, known as PCK in Solaris contexts) is a character encoding for the Japanese language, originally developed by the Japanese company ASCII Corporation in conjunction with Microsoft and standardized as JIS ...
) would remain pairs of separate characters in the new Unicode encoding. Using
variation selectors Variation Selectors is a Unicode block containing 16 variation selectors used to specify a Variant form (Unicode), glyph variant for a preceding character. They are currently used to specify standardized variation sequences for mathematical symb ...
, it is possible to specify certain variant CJK ideograms within Unicode. The Adobe-Japan1
character set Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using computers. The numerical values that make up a c ...
, which has 14,684 ideographic variation sequences, is an extreme example of the use of variation selectors.


Charts

4E00-62FF, 6300-77FF, 7800-8CFF, 8D00-9FFF.


Sources

Note: Most characters appear in multiple sources, so the sum of individual character counts (108,480) is far greater than the number of encoded characters (20,992). In Unicode 4.1, 14 HKSCS-2004 characters and 8
GB 18030 GB 18030 is a Chinese government standard, described as ''Information Technology — Chinese coded character set'' and defines the required language and character support necessary for software in China. GB18030 is the registered Internet n ...
characters were assigned to between U+9FA6 and U+9FBB code points. Since then, other additions were added to this block for various reasons, all summarized in the version history section below.


CJK Unified Ideographs Extension A

The block named ''
CJK Unified Ideographs Extension A __FORCETOC__ CJK Unified Ideographs Extension-A is a Unicode block A Unicode block is one of several contiguous ranges of numeric character codes (code points) of the Unicode character set that are defined by the Unicode Consortium for adminis ...
'' (3400–4DBF) contains 6,592 additional characters in the range U+3400 through U+4DBF.


Charts

3400-4DBF.


Sources

Note: Most characters appear in more than one source, so the sum of individual character counts (23,954) is far greater than the number of encoded characters (6,592).


CJK Unified Ideographs Extension B

The block named '' CJK Unified Ideographs Extension B'' (20000–2A6DF) contains 42,720 characters in the range U+20000 through U+2A6DF. These include most of the characters used in the Kangxi Dictionary that are not in the basic CJK Unified Ideographs block, as well as many Hán-Nôm characters that were formerly used to write Vietnamese.


Charts

20000-215FF, 21600-230FF, 23100-245FF, 24600-260FF, 26100-275FF, 27600-290FF, 29100-2A6DF.


Sources

Note: Many characters appear in more than one source, so the sum of individual character counts (99,784) is far greater than the number of encoded characters (42,720).


CJK Unified Ideographs Extension C

The block named '' CJK Unified Ideographs Extension C'' (2A700–2B73F) contains 4,154 characters in the range U+2A700 through U+2B739. It was initially added in Unicode 5.2 (2009).


Charts

2A700-2B73F.


Sources

Note: Some characters appear in more than one source, so the sum of individual character counts (4,634) is greater than the number of encoded characters (4,154).


CJK Unified Ideographs Extension D

The block named '' CJK Unified Ideographs Extension D'' (2B740–2B81F) contains 222 characters in the range U+2B740 through U+2B81D that were added in Unicode 6.0 (2010).


Charts

2B740–2B81F.


Sources

Note: Some characters appear in more than one source, so the sum of individual character counts (239) is greater than the number of encoded characters (222).


CJK Unified Ideographs Extension E

The block named '' CJK Unified Ideographs Extension E'' (2B820–2CEAF) contains 5,762 characters in the range U+2B820 through U+2CEA1 that were added in Unicode 8.0 (2015).


Charts

2B820–2CEAF.


Sources

Note: Some characters appear in more than one source, so the sum of individual character counts (5,919) is greater than the number of encoded characters (5,762).


CJK Unified Ideographs Extension F

The block named '' CJK Unified Ideographs Extension F'' (2CEB0–2EBEF) contains 7,473 characters in the range U+2CEB0 through 2EBE0 that were added in Unicode 10.0 (2017). It includes more than 1,000 Sawndip characters for Zhuang.


Charts

2CEB0–2EBEF.


Sources

Note: Some characters appear in more than one source, so the sum of individual character counts (7,775) is greater than the number of encoded characters (7,473).


CJK Unified Ideographs Extension G

A block named '' CJK Unified Ideographs Extension G'' was added as part of Unicode 13.0 to the Tertiary Ideographic Plane in the range U+30000 through U+3134F, containing 4,939 characters.


Charts

30000–3134F.


Sources

Note: Some characters appear in more than one source, so the sum of individual character counts (5,081) is greater than the number of encoded characters (4,939).


CJK Unified Ideographs Extension H

A block named '' CJK Unified Ideographs Extension H'' was added as part of Unicode 15.0 to the Tertiary Ideographic Plane in the range U+31350 through U+323AF, containing 4,192 characters.


Charts

31350–323AF.


Sources

Note: Some characters appear in more than one source, so the sum of individual character counts (4,309) is greater than the number of encoded characters (4,192).


CJK Unified Ideographs Extension I

A block named '' CJK Unified Ideographs Extension I'' was added as part of Unicode to the Supplementary Ideographic Plane in the range U+2EBF0 through U+2EE5F, containing 622 characters.


Charts

2EBF0–2EE5F.


Sources

Note: Some characters appear in more than one source, making the sum of individual character counts (625) more than the number of encoded characters (622).


CJK Compatibility Ideographs

The block named '' CJK Compatibility Ideographs'' (F900–FAFF) was created to retain round-trip compatibility with other standards. However, twelve characters in this block actually have the "Unified Ideograph" property: U+FA0E 﨎, U+FA0F 﨏, U+FA11 﨑, U+FA13 﨓, U+FA14 﨔, U+FA1F 﨟, U+FA21 﨡, U+FA23 﨣, U+FA24 﨤, U+FA27 﨧, U+FA28 﨨, and U+FA29 﨩. None of the other characters in this and other "Compatibility" blocks relate to CJK unification. While 龜 and 亀 are not considered unifiable, is considered a duplicate to .


Charts

F900–FAFF.


Sources

Note: All characters appear in more than one source, so the sum of individual character counts (40) is greater than the number of encoded characters (12).


Known issues


Disunification


U+4039

The character U+4039 (䀹) was a unification of two different characters (one with jiā 夾 phonetic and one with shǎn 㚒 phonetic) until Unicode 5.0. However, they were lexically different characters that should not have been unified; they have different pronunciations and different meanings. The proposal of disunification of U+4039 was accepted for Unicode 5.1, encoding a new character at U+9FC3 (鿃) to represent shǎn.


Other 3 glyphs in Extension B

In CJK Unified Ideographs Extension B, some characters are incorrectly unified with others. These characters include U+2017B (𠅻), U+204AF (𠒯) and U+24CB2 (𤲲). The first two characters contained a wrong unification of Chinese Mainland and Vietnamese source of their glyph, while the last one unifies the Chinese Mainland and Taiwanese ones.


Unifiable variants and exact duplicates

Also in CJK Unified Ideographs Extension B, hundreds of glyph variants were encoded by mistake. Additionally, an
ISO/IEC JTC 1/SC 2 ISO/IEC JTC 1/SC 2 Coded character sets is a standardization subcommittee of the Joint Technical Committee ISO/IEC JTC 1 of the International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC), that devel ...
report has found that six exact duplicates (where the same character has inadvertently been encoded twice) and two semi-duplicates (where the CJK-B character represents a ''de facto'' disunification of two glyph forms unified in the corresponding BMP character) were encoded by mistake: * U+34A8 㒨 = U+20457 𠑗 : U+20457 is the same as the China-source glyph for U+34A8, but it is significantly different from the Taiwan-source glyph for U+34A8 * U+3DB7 㶷 = U+2420E 𤈎 : same glyph shapes * U+8641 虁 = U+27144 𧅄 : U+27144 is the same as the Korean-source glyph for U+8641, but it is significantly different from the Chinese Mainland-, Taiwan- and Japan-source glyphs for U+8641 * U+204F2 𠓲 = U+23515 𣔕 : same glyph shapes, but ordered under different radicals * U+249BC 𤦼 = U+249E9 𤧩 : same glyph shapes * U+24BD2 𤯒 = U+2A415 𪐕 : same glyph shapes, but ordered under different radicals * U+26842 𦡂 = U+26866 𦡦 : same glyph shapes * U+FA23 﨣 = U+27EAF 𧺯 : same glyph shapes (U+FA23 﨣 is a unified CJK ideograph, despite its name "CJK COMPATIBILITY IDEOGRAPH-FA23.")


Other CJK ideographs in Unicode, not Unified

Apart from the ten blocks of "Unified Ideographs," Unicode has about a dozen more blocks with not-unified CJK-characters. These are mainly CJK radicals, strokes, punctuation, marks, symbols and compatibility characters. Although some characters have their (decomposable) counterparts in other blocks, the usages can be different. An example of a not-unified CJK-character is in the CJK Symbols and Punctuation block. Although it is not covered under "CJK Unified Ideographs", it is treated as a CJK-character for all other intents and purposes. Four blocks of compatibility characters are included for compatibility with legacy text handling systems and older character sets: * CJK Compatibility (3300–33FF) * CJK Compatibility Forms (FE30–FE4F) * CJK Compatibility Ideographs (F900–FAFF) * CJK Compatibility Ideographs Supplement (2F800–2FA1F) They include forms of characters for vertical text layout and rich text characters that Unicode recommends handling through other means. Therefore, their use is discouraged.


Font support

The blocks CJK Unified Ideographs and CJK Unified Ideographs Extension A, being parts of the Basic Multilingual Plane, are supported by the majority of the CJK fonts. However, Japanese and Korean fonts usually have fewer characters (about 13,000 and 8,000, respectively) than Chinese. Extensions B, C, D are supported by additional fonts MingLiU-ExtB, MingLiU_HKSCS-ExtB, PMingLiU-ExtB, SimSun-ExtB included in Microsoft Windows since Vista.


Unicode version history


See also

*
Han unification Han unification is an effort by the authors of Unicode and the Universal Character Set to map multiple character sets of the Han characters of the so-called CJK languages into a single set of unified characters. Han characters are a featur ...
*
List of Unicode characters As of Unicode version 16.0, there are 292,531 assigned character (computing), characters with code points, covering 168 modern and historical Script (Unicode), scripts, as well as multiple symbol sets. As it is WP:CHOKING, not technically possib ...
*
List of CJK fonts This is a list of notable CJK fonts (computer fonts with a large range of CJK characters, Chinese/Japanese/Korean characters). These fonts are primarily sorted by their typeface, the main classes being "with serif", "without serif" and "script". ...
* Ideographic Research Group * Chinese cultural sphere


Notes


References


External links


UK-Source Ideographs
(Documents IRG N2107R2 and IRG N2232R) {{Unicode navigation CJK, Unicode CJK Unified Ideographs Chinese character encodings