The Chinese, Japanese and Korean (CJK) scripts share a common background, collectively known as

CJK characters In internationalization, CJK characters is a collective term for graphemes used in the Chinese, Japanese, and Korean writing systems, which each include Chinese characters. It can also go by CJKV to include Chữ Nôm, the Chinese-origin lo ...

. During the process called

Han unification Han unification is an effort by the authors of Unicode and the Universal Character Set to map multiple character sets of the Han characters of the so-called CJK languages into a single set of unified characters. Han characters are a featur ...

, the common (shared) characters were identified and named CJK Unified Ideographs. As of Unicode , Unicode defines a total of 97,680 characters. The term ''ideographs'' is a misnomer, as the

Chinese script Chinese characters are logographs used Written Chinese, to write the Chinese languages and others from regions historically influenced by Chinese culture. Of the four independently invented writing systems accepted by scholars, they represe ...

is not

ideographic An ideogram or ideograph (from Greek 'idea' + 'to write') is a symbol that is used within a given writing system to represent an idea or concept in a given language. (Ideograms are contrasted with phonograms, which indicate sounds of speech ...

but rather

logographic In a written language, a logogram (from Ancient Greek 'word', and 'that which is drawn or written'), also logograph or lexigraph, is a written character that represents a semantic component of a language, such as a word or morpheme. Chinese c ...

. Until the early 20th century, Vietnam also used Chinese characters (

Chữ Nôm Chữ Nôm (, ) is a logographic writing system formerly used to write the Vietnamese language. It uses Chinese characters to represent Sino-Vietnamese vocabulary and some native Vietnamese words, with other words represented by new characters ...

), so sometimes the abbreviation CJKV is used.

Sources

The

Ideographic Research Group The Ideographic Research Group (IRG), formerly called the Ideographic Rapporteur Group, is a subgroup of Working Group 2 (WG2) of ISO/IEC JTC1 Subcommittee 2 (SC2), which is the committee responsible for developing the Universal Coded Character Se ...

(IRG) is responsible for developing extensions to the encoded repertoires of CJK unified ideographs. IRG processes proposals for new CJK unified ideographs submitted by its member bodies, and after undergoing several rounds of expert review, IRG submits a consolidated set of characters to

ISO/IEC JTC 1/SC 2 ISO/IEC JTC 1/SC 2 Coded character sets is a standardization subcommittee of the Joint Technical Committee ISO/IEC JTC 1 of the International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC), that devel ...

Working Group 2 (WG2) and the

Unicode Technical Committee The Unicode Consortium (legally Unicode, Inc.) is a 501(c)(3) non-profit organization incorporated and based in Mountain View, California, Mountain View, California, U.S. Its primary purpose is to maintain and publish the Unicode Standard which ...

(UTC) for consideration for inclusion in the

ISO/IEC 10646 ISO/IEC JTC 1, entitled "Information technology", is a joint technical committee (JTC) of the International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC). Its purpose is to develop, maintain and ...

and

Unicode Unicode or ''The Unicode Standard'' or TUS is a character encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 16.0 defines 154,998 Char ...

standards. The following IRG member bodies have been involved in the standardization of CJK unified ideographs: *

China China, officially the People's Republic of China (PRC), is a country in East Asia. With population of China, a population exceeding 1.4 billion, it is the list of countries by population (United Nations), second-most populous country after ...

Hong Kong Hong Kong)., Legally Hong Kong, China in international treaties and organizations. is a special administrative region of China. With 7.5 million residents in a territory, Hong Kong is the fourth most densely populated region in the wor ...

Japan Japan is an island country in East Asia. Located in the Pacific Ocean off the northeast coast of the Asia, Asian mainland, it is bordered on the west by the Sea of Japan and extends from the Sea of Okhotsk in the north to the East China Sea ...

South Korea South Korea, officially the Republic of Korea (ROK), is a country in East Asia. It constitutes the southern half of the Korea, Korean Peninsula and borders North Korea along the Korean Demilitarized Zone, with the Yellow Sea to the west and t ...

North Korea North Korea, officially the Democratic People's Republic of Korea (DPRK), is a country in East Asia. It constitutes the northern half of the Korea, Korean Peninsula and borders China and Russia to the north at the Yalu River, Yalu (Amnok) an ...

Macau Macau or Macao is a special administrative regions of China, special administrative region of the People's Republic of China (PRC). With a population of about people and a land area of , it is the most List of countries and dependencies by p ...

Taiwan Taiwan, officially the Republic of China (ROC), is a country in East Asia. The main geography of Taiwan, island of Taiwan, also known as ''Formosa'', lies between the East China Sea, East and South China Seas in the northwestern Pacific Ocea ...

, liaison member represented by the Taipei Computer Association (TCA) *

Vietnam Vietnam, officially the Socialist Republic of Vietnam (SRV), is a country at the eastern edge of mainland Southeast Asia, with an area of about and a population of over 100 million, making it the world's List of countries and depende ...

(liaison member, also representing the

United States The United States of America (USA), also known as the United States (U.S.) or America, is a country primarily located in North America. It is a federal republic of 50 U.S. state, states and a federal capital district, Washington, D.C. The 48 ...

) *

United Kingdom The United Kingdom of Great Britain and Northern Ireland, commonly known as the United Kingdom (UK) or Britain, is a country in Northwestern Europe, off the coast of European mainland, the continental mainland. It comprises England, Scotlan ...

* SAT (liaison member) The ideographs submitted by the UTC and the United Kingdom are not specific to any particular region, but are characters which have been suggested for encoding by individual experts. The ideographs submitted by SAT are required for the SAT Daizōkyō text database. The table below gives the numbers of encoded CJK unified ideographs for each IRG source for Unicode 16.0. The total number of characters (260,840) far exceeds the number of encoded CJK unified ideographs (97,680) as many characters have more than one source.

UTC sources

The majority of characters submitted by the UTC to the IRG are derived from Unicode Technical Committee (UTC) documents. Other sources include: * ''

ABC Chinese-English Dictionary ABC are the first three letters of the Latin script. ABC or abc may also refer to: Arts, entertainment and media Broadcasting * Aliw Broadcasting Corporation, Philippine broadcast company * American Broadcasting Company, a commercial American ...

'' by

John DeFrancis John DeFrancis (August 31, 1911January 2, 2009) was an American linguist, sinologist, author of Chinese language textbooks, lexicographer of Chinese dictionaries, and professor emeritus of Chinese Studies at the University of Hawaiʻi at Mānoa ...

* The Adobe-CNS1 glyph collection * The Adobe-Japan1 glyph collection * A Complete Checklist of Species and Subspecies of Chinese Birds (中国鸟类系统检索) * The Great Nom Dictionary (Đại Tự Điển Chữ Nôm) * Annotations to ''

Shuowen Jiezi The ''Shuowen Jiezi'' is a Chinese dictionary compiled by Xu Shen , during the Eastern Han dynasty (25–220 CE). While prefigured by earlier reference works for Chinese characters like the ''Erya'' (), the ''Shuowen Jiezi'' contains the ...

'' (annotated by

Duan Yucai Duan Yucai () (1735–1815), courtesy name Ruoying () was a Chinese philology, philologist of the Qing Dynasty. He made great contributions to the study of Historical Chinese phonology, and is known for his annotated edition of ''Shuowen Jiezi''. ...

) * GB18030-2000 * Required Character List Supplied by

the Church of Jesus Christ of Latter-day Saints The Church of Jesus Christ of Latter-day Saints, informally known as the LDS Church or Mormon Church, is a Nontrinitarianism, nontrinitarian Restorationism, restorationist Christianity, Christian Christian denomination, denomination and the ...

(Hong Kong) * New Commercial Dictionary (商务新词典), Hong Kong * Modern Chinese Dictionary (现代汉语词典), by

Chinese Academy of Social Sciences The Chinese Academy of Social Sciences (CASS) is a Chinese state research institute and think tank. It is a ministry-level institution under the State Council of the People's Republic of China. The CASS is the highest academic institution and c ...

, Linguistics Research Institute, Dictionary Editorial Office * Working Group (WG2) documents

Ordering

The ordering of CJK Unified Ideographs within Unicode blocks (not counting those added to the block later) was initially determined by consulting the following four dictionaries. Primarily, they were arranged in

Kangxi Dictionary The ''Kangxi Dictionary'' () is a Chinese dictionary published in 1716 during the High Qing, considered from the time of its publishing until the early 20th century to be the most authoritative reference for written Chinese characters. Wanting ...

order, with the other dictionaries consulted, in order, for characters not found in the Kangxi Dictionary, to determine which Kangxi Dictionary character they should follow in the ordering. # Kangxi Dictionary #

Dai Kan-Wa Jiten The is a Japanese dictionary of ''kanji'' (Chinese characters) compiled by Tetsuji Morohashi. Remarkable for its comprehensiveness and size, Morohashi's dictionary contains over 50,000 character entries and 530,000 compound words. Haruo Shira ...

Hanyu Da Zidian The ''Hanyu Da Zidian'' (), also known as the Grand Chinese Dictionary, is a reference dictionary on Chinese characters. Overview A group of more than 400 editors and lexicographers began compilation in 1974, and it was published in eight volum ...

# Dae Jaweon This system is not used for more recently-added Unicode blocks. The

no longer uses the Dae Jaweon, nor the Dai Kan-Wa Jiten, in its work. The Kangxi Dictionary and Hanyu Da Zidian are still used both in existing character source references, and as potential replacements for existing source references discovered to be erroneous. Similarly, although a (real or virtual) Kangxi Dictionary index was previously provided as part of the submission data for UTC-source characters, this is no longer the case. Instead, the stroke type of the first residual stroke (first stroke which does not form part of the

radical Radical (from Latin: ', root) may refer to: Politics and ideology Politics *Classical radicalism, the Radical Movement that began in late 18th century Britain and spread to continental Europe and Latin America in the 19th century *Radical politics ...

) is supplied with all submitted characters, and used to order characters with the same radical and

stroke count Stroke number, or stroke count (), is the number of strokes of a Chinese character. It may also refer to the number of different strokes in a Chinese character set. Stroke number plays an important role in Chinese character sorting, teaching and co ...

within the new Unicode block.

CJK Unified Ideographs blocks

CJK Unified Ideographs

The basic block named ''

CJK Unified Ideographs The Chinese, Japanese and Korean (CJK) scripts share a common background, collectively known as CJK characters. During the process called Han unification, the common (shared) characters were identified and named CJK Unified Ideographs. As of Uni ...

'' (4E00–9FFF) contains 20,992 basic

Chinese characters Chinese characters are logographs used Written Chinese, to write the Chinese languages and others from regions historically influenced by Chinese culture. Of the four independently invented writing systems accepted by scholars, they represe ...

in the range U+4E00 through U+9FFF. The block not only includes characters used in the

Chinese writing system Written Chinese is a writing system that uses Chinese characters and other symbols to represent the Chinese languages. Chinese characters do not directly represent pronunciation, unlike letters in an alphabet or syllabograms in a syllabary. Rathe ...

but also

kanji are logographic Chinese characters, adapted from Chinese family of scripts, Chinese script, used in the writing of Japanese language, Japanese. They were made a major part of the Japanese writing system during the time of Old Japanese and are ...

used in the

Japanese writing system The modern Japanese writing system uses a combination of Logogram, logographic kanji, which are adopted Chinese characters, and Syllabary, syllabic kana. Kana itself consists of a pair of syllabary, syllabaries: hiragana, used primarily for n ...

hanja Hanja (; ), alternatively spelled Hancha, are Chinese characters used to write the Korean language. After characters were introduced to Korea to write Literary Chinese, they were adapted to write Korean as early as the Gojoseon period. () ...

Korea Korea is a peninsular region in East Asia consisting of the Korean Peninsula, Jeju Island, and smaller islands. Since the end of World War II in 1945, it has been politically Division of Korea, divided at or near the 38th parallel north, 3 ...

, and chữ Nôm characters in Vietnamese. Many characters in this block are used in all three

writing system A writing system comprises a set of symbols, called a ''script'', as well as the rules by which the script represents a particular language. The earliest writing appeared during the late 4th millennium BC. Throughout history, each independen ...

s, while others are in only one or two of the three. This block is also known as the ''Unified Repertoire and Ordering'' (URO), especially when it needs to be differentiated from the other CJK Unified Ideographs blocks. The first 20,902 characters in the block are arranged according to the

ordering of

s. In this system the characters written with the fewest strokes are listed first. The remaining characters were added later, and so are not in radical order. The block is the result of

, which was somewhat controversial within East Asia. Since single characters used in more than one of Chinese, Japanese and Korean were coded in the same location, and the modern typographical conventions and handwriting curricula differ slightly between regions (not necessarily along language boundaries—for example,

and

, which both use

Traditional Chinese A tradition is a system of beliefs or behaviors (folk custom) passed down within a group of people or society with symbolic meaning or special significance with origins in the past. A component of cultural expressions and folklore, common examp ...

, have slightly different local conventions), the appearance of a selected glyph could depend on the particular font being used. However, the URO applies the ''source separation rule'', meaning that pairs of characters treated as distinct in a character set used as a source for the URO (e.g.

JIS X 0208 JIS X 0208 is a 2-byte character set specified as a Japanese Industrial Standards, Japanese Industrial Standard, containing 6879 graphic characters suitable for writing text, place names, personal names, and so forth in the Japanese language. Th ...

as used in e.g.

Shift JIS Shift JIS (also SJIS, MIME name Shift_JIS, known as PCK in Solaris contexts) is a character encoding for the Japanese language, originally developed by the Japanese company ASCII Corporation in conjunction with Microsoft and standardized as JIS ...

) would remain pairs of separate characters in the new Unicode encoding. Using

variation selectors Variation Selectors is a Unicode block containing 16 variation selectors used to specify a Variant form (Unicode), glyph variant for a preceding character. They are currently used to specify standardized variation sequences for mathematical symb ...

, it is possible to specify certain variant CJK ideograms within Unicode. The Adobe-Japan1

character set Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using computers. The numerical values that make up a c ...

, which has 14,684 ideographic variation sequences, is an extreme example of the use of variation selectors.

Charts

4E00-62FF, 6300-77FF, 7800-8CFF, 8D00-9FFF.

Sources

Note: Most characters appear in multiple sources, so the sum of individual character counts (108,480) is far greater than the number of encoded characters (20,992). In Unicode 4.1, 14 HKSCS-2004 characters and 8

GB 18030 GB 18030 is a Chinese government standard, described as ''Information Technology — Chinese coded character set'' and defines the required language and character support necessary for software in China. GB18030 is the registered Internet n ...

characters were assigned to between U+9FA6 and U+9FBB code points. Since then, other additions were added to this block for various reasons, all summarized in the

version history Version may refer to: Computing * Software version, a set of numbers that identify a unique evolution of a computer program * VERSION (CONFIG.SYS directive), a configuration directive in FreeDOS Music * Cover version * Dub version * Remix * ''Ve ...

section below.

CJK Unified Ideographs Extension A

The block named ''

CJK Unified Ideographs Extension A __FORCETOC__ CJK Unified Ideographs Extension-A is a Unicode block A Unicode block is one of several contiguous ranges of numeric character codes (code points) of the Unicode character set that are defined by the Unicode Consortium for adminis ...

'' (3400–4DBF) contains 6,592 additional characters in the range U+3400 through U+4DBF.

Charts

3400-4DBF.

Sources

Note: Most characters appear in more than one source, so the sum of individual character counts (23,954) is far greater than the number of encoded characters (6,592).

CJK Unified Ideographs Extension B

The block named ''

CJK Unified Ideographs Extension B CJK Unified Ideographs Extension B is a Unicode block containing rare and historic CJK ideographs for Chinese, Japanese, Korean, and Vietnamese submitted to the Ideographic Research Group between 1998 and 2000, plus seven gongche characters for ...

'' (20000–2A6DF) contains 42,720 characters in the range U+20000 through U+2A6DF. These include most of the characters used in the

that are not in the basic CJK Unified Ideographs block, as well as many

Hán-Nôm Spoken and written Vietnamese today uses the Latin script-based Vietnamese alphabet to represent native Vietnamese words (''thuần Việt''), Vietnamese words which are of Chinese origin (''Hán-Việt'', or Sino-Vietnamese), and other forei ...

characters that were formerly used to write Vietnamese.

Charts

Note: Some characters appear in more than one source, so the sum of individual character counts (7,775) is greater than the number of encoded characters (7,473).

CJK Unified Ideographs Extension G

A block named ''

CJK Unified Ideographs Extension G __FORCETOC__ CJK Unified Ideographs Extension G is a Unicode block containing rare and historic CJK Unified Ideographs for Chinese, Japanese, Korean, and Vietnamese which were submitted to the Ideographic Research Group during 2015. It is the firs ...

'' was added as part of Unicode 13.0 to the

Tertiary Ideographic Plane In the Unicode standard, a plane is a contiguous group of 65,536 (216) code points. There are 17 planes, identified by the numbers 0 to 16, which corresponds with the possible values 00–1016 of the first two positions in six position hexadecimal ...

in the range U+30000 through U+3134F, containing 4,939 characters.

Charts

30000–3134F.

Sources

Note: Some characters appear in more than one source, so the sum of individual character counts (5,081) is greater than the number of encoded characters (4,939).

CJK Unified Ideographs Extension H

A block named '' CJK Unified Ideographs Extension H'' was added as part of Unicode 15.0 to the

in the range U+31350 through U+323AF, containing 4,192 characters.

Charts

31350–323AF.

Sources

Note: Some characters appear in more than one source, so the sum of individual character counts (4,309) is greater than the number of encoded characters (4,192).

CJK Unified Ideographs Extension I

A block named '' CJK Unified Ideographs Extension I'' was added as part of Unicode to the

Supplementary Ideographic Plane In the Unicode standard, a plane is a contiguous group of 65,536 (216) code points. There are 17 planes, identified by the numbers 0 to 16, which corresponds with the possible values 00–1016 of the first two positions in six position hexadecimal ...

in the range U+2EBF0 through U+2EE5F, containing 622 characters.

Charts

2EBF0–2EE5F.

Sources

Note: Some characters appear in more than one source, making the sum of individual character counts (625) more than the number of encoded characters (622).

CJK Compatibility Ideographs

The block named '' CJK Compatibility Ideographs'' (F900–FAFF) was created to retain round-trip compatibility with other standards. However, twelve characters in this block actually have the "Unified Ideograph" property: U+FA0E 﨎, U+FA0F 﨏, U+FA11 﨑, U+FA13 﨓, U+FA14 﨔, U+FA1F 﨟, U+FA21 﨡, U+FA23 﨣, U+FA24 﨤, U+FA27 﨧, U+FA28 﨨, and U+FA29 﨩. None of the other characters in this and other "Compatibility" blocks relate to CJK unification. While 龜 and 亀 are not considered unifiable, is considered a duplicate to .

Charts

F900–FAFF.

Sources

Note: All characters appear in more than one source, so the sum of individual character counts (40) is greater than the number of encoded characters (12).

Known issues

Disunification

U+4039

The character U+4039 (䀹) was a unification of two different characters (one with jiā 夾 phonetic and one with shǎn 㚒 phonetic) until Unicode 5.0. However, they were lexically different characters that should not have been unified; they have different pronunciations and different meanings. The proposal of disunification of U+4039 was accepted for Unicode 5.1, encoding a new character at U+9FC3 (鿃) to represent shǎn.

Other 3 glyphs in Extension B

In CJK Unified Ideographs Extension B, some characters are incorrectly unified with others. These characters include U+2017B (𠅻), U+204AF (𠒯) and U+24CB2 (𤲲). The first two characters contained a wrong unification of Chinese Mainland and Vietnamese source of their glyph, while the last one unifies the Chinese Mainland and Taiwanese ones.

Unifiable variants and exact duplicates

Also in CJK Unified Ideographs Extension B, hundreds of glyph variants were encoded by mistake. Additionally, an

report has found that six exact duplicates (where the same character has inadvertently been encoded twice) and two semi-duplicates (where the CJK-B character represents a ''de facto'' disunification of two glyph forms unified in the corresponding BMP character) were encoded by mistake: * U+34A8 㒨 = U+20457 𠑗 : U+20457 is the same as the China-source glyph for U+34A8, but it is significantly different from the Taiwan-source glyph for U+34A8 * U+3DB7 㶷 = U+2420E 𤈎 : same glyph shapes * U+8641 虁 = U+27144 𧅄 : U+27144 is the same as the Korean-source glyph for U+8641, but it is significantly different from the Chinese Mainland-, Taiwan- and Japan-source glyphs for U+8641 * U+204F2 𠓲 = U+23515 𣔕 : same glyph shapes, but ordered under different radicals * U+249BC 𤦼 = U+249E9 𤧩 : same glyph shapes * U+24BD2 𤯒 = U+2A415 𪐕 : same glyph shapes, but ordered under different radicals * U+26842 𦡂 = U+26866 𦡦 : same glyph shapes * U+FA23 﨣 = U+27EAF 𧺯 : same glyph shapes (U+FA23 﨣 is a unified CJK ideograph, despite its name "CJK COMPATIBILITY IDEOGRAPH-FA23.")

Other CJK ideographs in Unicode, not Unified

Apart from the ten blocks of "Unified Ideographs," Unicode has about a dozen more blocks with not-unified CJK-characters. These are mainly CJK radicals, strokes, punctuation, marks, symbols and compatibility characters. Although some characters have their (decomposable) counterparts in other blocks, the usages can be different. An example of a not-unified CJK-character is in the

CJK Symbols and Punctuation CJK Symbols and Punctuation is a Unicode block containing symbols and punctuation used for writing the Chinese, Japanese and Korean languages. It also contains one Chinese character. Block The block has variation sequences defined for East ...

block. Although it is not covered under "CJK Unified Ideographs", it is treated as a CJK-character for all other intents and purposes. Four blocks of compatibility characters are included for compatibility with legacy text handling systems and older character sets: *

CJK Compatibility CJK Compatibility is a Unicode block containing square symbols (both CJK and Latin alphanumeric) encoded for compatibility with East Asian character sets. In Unicode 1.0, it was divided into two blocks, named CJK Squared Words (U+3300–U+337F) ...

(3300–33FF) * CJK Compatibility Forms (FE30–FE4F) * CJK Compatibility Ideographs (F900–FAFF) * CJK Compatibility Ideographs Supplement (2F800–2FA1F) They include forms of characters for vertical text layout and rich text characters that Unicode recommends handling through other means. Therefore, their use is discouraged.

Font support

The blocks CJK Unified Ideographs and CJK Unified Ideographs Extension A, being parts of the

Basic Multilingual Plane In the Unicode standard, a plane is a contiguous group of 65,536 (216) code points. There are 17 planes, identified by the numbers 0 to 16, which corresponds with the possible values 00–1016 of the first two positions in six position hexadecimal ...

, are supported by the majority of the CJK fonts. However, Japanese and Korean fonts usually have fewer characters (about 13,000 and 8,000, respectively) than Chinese. Extensions B, C, D are supported by additional fonts MingLiU-ExtB, MingLiU_HKSCS-ExtB, PMingLiU-ExtB, SimSun-ExtB included in Microsoft Windows since Vista.

Unicode version history

Notes

References

External links

UK-Source Ideographs
(Documents IRG N2107R2 and IRG N2232R) {{Unicode navigation CJK, Unicode CJK Unified Ideographs Chinese character encodings

Sources

UTC sources

Ordering

CJK Unified Ideographs blocks

CJK Unified Ideographs

Charts

Sources

CJK Unified Ideographs Extension A

Charts

Sources

CJK Unified Ideographs Extension B

Charts

Sources

CJK Unified Ideographs Extension C

Charts

Sources

CJK Unified Ideographs Extension D

Charts

Sources

CJK Unified Ideographs Extension E

Charts

Sources

CJK Unified Ideographs Extension F

Charts

Sources

CJK Unified Ideographs Extension G

Charts

Sources

CJK Unified Ideographs Extension H

Charts

Sources

CJK Unified Ideographs Extension I

Charts

Sources

CJK Compatibility Ideographs

Charts

Sources

Known issues

Disunification

U+4039

Other 3 glyphs in Extension B

Unifiable variants and exact duplicates

Other CJK ideographs in Unicode, not Unified

Font support

Unicode version history

See also

Notes

References

External links