HOME

TheInfoList



OR:

In computing, Chinese character encodings can be used to represent text written in the CJK languages— Chinese, Japanese, Korean—and (rarely) obsolete Vietnamese, all of which use
Chinese character Chinese characters are logographs used to write the Chinese languages and others from regions historically influenced by Chinese culture. Of the four independently invented writing systems accepted by scholars, they represent the only on ...
s. Several general-purpose
character encoding Character encoding is the process of assigning numbers to graphical character (computing), characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using computers. The numerical v ...
s accommodate Chinese characters, and some of them were developed specifically for Chinese. In addition to
Unicode Unicode or ''The Unicode Standard'' or TUS is a character encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 16.0 defines 154,998 Char ...
(with the set of
CJK Unified Ideographs The Chinese, Japanese and Korean (CJK) scripts share a common background, collectively known as CJK characters. During the process called Han unification, the common (shared) characters were identified and named CJK Unified Ideographs. As of Uni ...
), local encoding systems exist. The Chinese
Guobiao The National Standards of the People's Republic of China (), coded as , are the standards issued by the Standardization Administration of China under the authorization of Article 10 of the Standardization Law of the People's Republic of China. ...
(or GB, "national standard") system is used in
mainland China "Mainland China", also referred to as "the Chinese mainland", is a Geopolitics, geopolitical term defined as the territory under direct administration of the People's Republic of China (PRC) in the aftermath of the Chinese Civil War. In addit ...
and
Singapore Singapore, officially the Republic of Singapore, is an island country and city-state in Southeast Asia. The country's territory comprises one main island, 63 satellite islands and islets, and one outlying islet. It is about one degree ...
, and the (mainly) Taiwanese
Big5 Big-5 or Big5 ( zh, t=大五碼) is a Chinese character encoding method used in Taiwan, Hong Kong, and Macau for traditional Chinese characters. The People's Republic of China (PRC), which uses simplified Chinese characters, uses the GB 18030 ...
system is used in
Taiwan Taiwan, officially the Republic of China (ROC), is a country in East Asia. The main geography of Taiwan, island of Taiwan, also known as ''Formosa'', lies between the East China Sea, East and South China Seas in the northwestern Pacific Ocea ...
,
Hong Kong Hong Kong)., Legally Hong Kong, China in international treaties and organizations. is a special administrative region of China. With 7.5 million residents in a territory, Hong Kong is the fourth most densely populated region in the wor ...
and
Macau Macau or Macao is a special administrative regions of China, special administrative region of the People's Republic of China (PRC). With a population of about people and a land area of , it is the most List of countries and dependencies by p ...
as the two primary "legacy" local encoding systems. Guobiao is usually displayed using
simplified characters Simplified Chinese characters are one of two standardized character sets widely used to write the Chinese language, with the other being traditional characters. Their mass standardization during the 20th century was part of an initiative by t ...
and Big5 is usually displayed using
traditional characters Traditional Chinese characters are a standard set of Chinese character forms used to write Chinese languages. In Taiwan, the set of traditional characters is regulated by the Ministry of Education and standardized in the ''Standard Form of ...
. There is however no mandated connection between the encoding system and the font used to display the characters; font and encoding are usually tied together for practical reasons. The issue of which encoding to use can also have political implications, as GB is the official standard of the
People's Republic of China China, officially the People's Republic of China (PRC), is a country in East Asia. With population of China, a population exceeding 1.4 billion, it is the list of countries by population (United Nations), second-most populous country after ...
and Big5 is a '' de facto'' standard of
Taiwan Taiwan, officially the Republic of China (ROC), is a country in East Asia. The main geography of Taiwan, island of Taiwan, also known as ''Formosa'', lies between the East China Sea, East and South China Seas in the northwestern Pacific Ocea ...
. In contrast to the situation with Japanese, there has been relatively little overt opposition to Unicode, which solves many of the issues involved with GB and Big5. Unicode is widely regarded as politically neutral, has good support for both simplified and traditional characters, and can be easily converted to and from the GB and Big5. Furthermore, Unicode has the advantage of not being limited only to Chinese, since it contains character codes for (nearly) every language.


Guobiao

The Guobiao (GB) line of character encodings start with the
Simplified Chinese Simplification, Simplify, or Simplified may refer to: Mathematics Simplification is the process of replacing a mathematical expression by an equivalent one that is simpler (usually shorter), according to a well-founded ordering. Examples include: ...
charset
GB 2312 is a key official character set of the People's Republic of China, used for Simplified Chinese characters. GB2312 is the registered internet name for EUC-CN, which is its usual encoded form. ''GB'' refers to the Guobiao standards (国家标准), ...
published in 1980. Two encoding schemes existed for GB 2312: a one-or-two byte 8-bit EUC-CN encoding commonly used, and a 7-bit encoding called HZ for usenet posts. A traditional variant called GB/T 12345 was published in 1990. The EUC-CN form was later extended into GBK to include ''all'' Unicode 1.1 CJK Ideographs in 1993, abandoning the ISO-2022 model. By doing so, GBK includes
traditional Chinese characters Traditional Chinese characters are a standard set of Chinese character forms used to written Chinese, write Chinese languages. In Taiwan, the set of traditional characters is regulated by the Ministry of Education (Taiwan), Ministry of Educat ...
in addition to simplified ones in GB2312. GBK gained popularity through the widespread Code page 936 implementation found in Microsoft Windows 95. In 2000,
GB 18030 GB 18030 is a Chinese government standard, described as ''Information Technology — Chinese coded character set'' and defines the required language and character support necessary for software in China. GB18030 is the registered Internet n ...
was published as GBK's successor. This new encoding includes a four-byte UTF which encodes all Unicode codepoints not previously encoded. In 2005,
GB 18030 GB 18030 is a Chinese government standard, described as ''Information Technology — Chinese coded character set'' and defines the required language and character support necessary for software in China. GB18030 is the registered Internet n ...
was published to contain reference glyphs for scripts used by
ethnic minorities in China Ethnic minorities in China are the non-Han Chinese, Han population in the China, People's Republic of China (PRC). The PRC officially recognizes 55 ethnic minority groups within China in addition to the Han majority. , the combined population ...
, as well as glyphs from
CJK Unified Ideographs The Chinese, Japanese and Korean (CJK) scripts share a common background, collectively known as CJK characters. During the process called Han unification, the common (shared) characters were identified and named CJK Unified Ideographs. As of Uni ...
Extension B due to the update of
Unicode Unicode or ''The Unicode Standard'' or TUS is a character encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 16.0 defines 154,998 Char ...
. Adobe-GB1 is the corresponding PostScript charset for GB encodings.


Big5

The Big5 family of character encodings start with the initial definition by the consortium of five companies in Taiwan that developed it. It is a double-byte character set (DBCS) somehow similar to
Shift JIS Shift JIS (also SJIS, MIME name Shift_JIS, known as PCK in Solaris contexts) is a character encoding for the Japanese language, originally developed by the Japanese company ASCII Corporation in conjunction with Microsoft and standardized as JIS ...
, often combined with a MBCS like
ASCII ASCII ( ), an acronym for American Standard Code for Information Interchange, is a character encoding standard for representing a particular set of 95 (English language focused) printable character, printable and 33 control character, control c ...
. Quite a few vendors as well as official extensions exist, of which ETEN, HKSCS (Hong Kong) and Big5-2003 (as a part of CNS 11643 by Taiwan) are the most well-known ones. Adobe-CNS1 is the PostScript charset corresponding to the Big5 family of encodings.


Conversion

Prior to GBK which includes both traditional and simplified characters, conversion between Traditional Chinese and Simplified Chinese charsets was complicated by the need of transcribing text between the two variants of Chinese, as one charset cover many of the other's characters only in its own variant. The conversion between traditional and simplified Chinese is usually problematic, because the simplification of some traditional forms merged two or more different characters into one simplified form. The traditional to simplified (many-to-one) conversion is technically simple. The opposite conversion often results in a data loss when converting to
GB 2312 is a key official character set of the People's Republic of China, used for Simplified Chinese characters. GB2312 is the registered internet name for EUC-CN, which is its usual encoded form. ''GB'' refers to the Guobiao standards (国家标准), ...
: in mapping one-to-many when assigning traditional glyphs to the simplified glyphs, some characters will inevitably be the wrong choices in some of the usages. Thus simplified to traditional conversion often requires usage context or common phrase lists to resolve conflicts. This issue is less of a problem with newer standards such as GBK, GB 18030 and Unicode, which have separate code points for both simplified and traditional characters. One other issue is that many of the encoding systems are missing characters. While the missing characters are often literary and not commonly used in ordinary text, this does become a problem because people's names often contain these characters. An example of the problem is the Taiwanese politician Wang Chien-shien who has a () character in his name which is not in some character systems, and former Chinese premier
Zhu Rongji Zhu Rongji ( zh, s=朱镕基; IPA: ; born 23 October 1928) is a retired Chinese politician who served as the 5th premier of China from 1998 to 2003. He also served as member of the Politburo Standing Committee of the Chinese Communist Party (CCP ...
, whose () character is not in GB 2312. The newest GB standard, GB 18030 has the complete character repertoire of Unicode 4.0, including the Unihan extensions in the
Supplementary Ideographic Plane In the Unicode standard, a plane is a contiguous group of 65,536 (216) code points. There are 17 planes, identified by the numbers 0 to 16, which corresponds with the possible values 00–1016 of the first two positions in six position hexadecimal ...
.


See also

*
Chinese input methods for computers Several input methods allow the use of Chinese characters with computers. Most allow selection of characters based either on their pronunciation or their graphical shape. Phonetic input methods are easier to learn but are less efficient, while g ...
*
Han unification Han unification is an effort by the authors of Unicode and the Universal Character Set to map multiple character sets of the Han characters of the so-called CJK languages into a single set of unified characters. Han characters are a featur ...
* Four corner method * Chinese character information technology


References


Further reading

*


External links


Chinese Encoding Converter

ICU's Converter Explorer



Chinese Character Codes




{{CJK_computing Korean language Encodings of Asian languages